DOC PREVIEW
UCSD CSE 280B - Sequence-Based Filtering Method

This preview shows page 1-2-3-4 out of 11 pages.

Save
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

A Sequence Based Filtering Method for ncRNA Identification and its Application to Searching for Riboswitch Elements Shaojie Zhang Ilya Borovok Yair Aharonowitz Roded Sharan Vineet Bafna Abstract Recent studies have uncovered an RNA world in which non coding RNA ncRNA sequences play a central role in the regulation of gene expression Computational studies on ncRNA have been directed toward developing detection methods for ncRNAs State of the art methods for the problem like covariance models suffer from high computational cost underscoring the need for efficient filtering approaches that can identify promising sequence segments and accelerate the detection process In this paper we make several contributions toward this goal First we formalize the concept of a filter and provide figures of merit that allow comparing between filters Second we design efficient sequence based filters that dominate the current state of the art HMM filters Third we provide a new formulation of the covariance model that allows speeding up RNA alignment We demonstrate the power of our approach on both synthetic data and real bacterial genomes We then apply our algorithm to the detection of novel riboswitch elements from the whole bacterial and archaeal genomes Our results point to a number of novel riboswitch candidates and include genomes that were not previously known to contain riboswitches 1 Introduction A database filter is a computational procedure that takes a database as input and outputs a subset of the database The goal is to ensure that the object being searched for remains in the database after filtering the filtered database is significantly smaller and the filtering operation is very fast Filters have played a central role in bioinformatics BLAST is the prototypical example with a keyword match filter greatly improving the search for remote homologs Indeed improving the filters for sequence similarity search remains an intensively researched area with many recent publications Filtering is also being applied in other bioinformatics domains including structural genomics 9 proteomics mass spectrometry 5 17 and non coding RNA ncRNA 21 22 23 Here we revisit the notion of filtering focusing on applications to detecting ncRNAs ncRNAs are genomic sequences that are transcribed but not translated and function as RNA molecules Recent discoveries of many novel families and sub families of ncRNA have underscored their importance and hint at an RNA world where coding and non coding genes play equally important roles 3 15 20 However computational tools for detecting novel ncRNA are not yet mature and the signal for ncRNA is considerably weaker than that for protein coding genes Therefore a comparative approach to discovering novel homologs of a query ncRNA is also increasing in importance much like BLAST is often used to identify novel homologs of coding genes While viable this approach poses a technical challenge since the known algorithms for aligning ncRNA are at least an order of magnitude slower than sequence alignment 8 23 and even slower when other secondary structures such as pseudoknots are allowed 1 Indeed using a search based on a covariance model CM 2 it would take 54 hours to query two bacterial genomes E coli K12 and Staphylococcus aureus MW2 7 5 Mb for a sub family such as the FMN riboswitch 145 bp This makes the filtering problem both easier and harder On the one hand the alignment is so expensive cubic time that even a computationally intensive filter quadratic time could be useful At the same time since the alignment is so expensive the filtering itself must be very efficient in removing a large portion of the database while retaining the true hits For example a filter that removes 50 of the database is still not sufficient to make CM searches tractable for large genomic sequences Algorithms that align ncRNA are expensive because they score for both sequence and structure conservation and the latter task is computationally intensive Filtering for RNA was systematically explored by Weinberg and Ruzzo 21 22 who used a pigeonhole argument to show that it is enough to scan for sequence similarity expressed by a hidden Markov model leaving the more expensive structural alignment for the filtered sequence Henceforth we refer to their filter as Department of Computer Science and Engineering University of California San Diego La Jolla CA 92093 0404 USA shzhang vbafna cs ucsd edu Supported by NSF DBI 0516440 Department of Molecular Microbiology and Biotechnology Tel Aviv University Tel Aviv 69978 Israel borovok yaira post tau ac il School of Computer Science Tel Aviv University Tel Aviv 69978 Israel roded post tau ac il Supported by an Alon Fellowship 1 HMM filter Subsequently they and independently some of us also used partial structure conservation for the filtering 21 23 Even after applying these filters the problem remains computationally expensive and it is worthwhile to ask if one can do better Here we make several contributions in this regard First we formalize the concept of a filter and provide figures of merit that allow comparison between filters Second we design novel filters and show that they dominate the HMM filters of Weinberg and Ruzzo 22 we defer a formal definition of the notion of dominance to Section 2 In practice this leads to 1 2 orders of magnitude decrease in search time However our main point is not that we can build better filters but that it is relatively easy to do so Indeed the filters we design are very simple conceptually indicating perhaps that we have only scratched the surface on this problem The main contribution of this paper is a principled approach to combining filters that have different performance characteristics to achieve dominance Section 3 We also revisit the issue of alignment by aligning an RNA profile to a filtered substring We emphasize that there is a strong practically 1 1 correspondence with CMs in both the alignment algorithm and the observed results Indeed the advantage of the CMs is that their parameters can be trained using the same formalization However our reformulation helps us take advantage of simple tricks like banding and others which help speed up the alignment appreciable loss in accuracy Section 4 Similar extensions would require a departure from the formalism of stochastic context free grammars that support CMs This also has an impact on filtering Unlike previous approaches we do not tie the accuracy of our filtering procedure


View Full Document

UCSD CSE 280B - Sequence-Based Filtering Method

Download Sequence-Based Filtering Method
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Sequence-Based Filtering Method and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Sequence-Based Filtering Method and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?