Querying for ncRNA Filtering March 2006 Vineet Bafna Homology based ncRNA discovery Basic Question Given a non coding RNA RNA family as query can you find all sequences in the database that are similar in both sequence and structure March 2006 Vineet Bafna Searching ncRNA genes Scanning Filtering Blast based heuristic filter preprocessing step for CM search Profile HMM filter Weinberg and Ruzzo Recomb 2004 sequence filter quadratic time With efficient filters for single query RSEARCH Klein and Eddy BMC Bioinformatics 2003 cubic time CM Search Griffiths Jones et al NAR 2003 cubic time FastR Bafna and Zhang CSB 2004 alignment cubic time Using conserved secondary structure nested loops or Multiloops structure filter structure filter linear time Efficient filters for an RNA profile alignment March 2006 Vineet Bafna An RNA family There are some sequence structure conservation in an RNA family We want to search genome using an RNA family as a query March 2006 Vineet Bafna RNA profiles Covariance Models If we have a good multiple alignment we can build a profile We build a separate sequence and structural profile The structure is maintained as a tree AC GGG CCC T ACG GG CC CAT A TGG CCA T March 2006 A C G T 3 0 0 0 0 0 2 0 0 1 0 0 1 0 2 0 0 1 1 1 0 0 3 0 0 0 0 3 0 0 0 3 0 0 0 Vineet Bafna 0 3 0 0 0 1 1 0 0 1 0 1 0 0 2 1 0 0 0 2 0 0 0 3 0 A A A C A G A T C A C C C G C T G A G C G G G T T A T C T G T T 0 0 1 3 1 1 0 0 0 3 0 0 Scoring RNA profiles Align the target sequence to the profile The score is a weighted sum of sequence and structural conservation March 2006 Vineet Bafna Using a binary tree to represent an RNA profile Solid nodes represent the base pairing columns Void nodes represent either branch point or unpaired columns C G A C Branch points C G A U C G March 2006 Vineet Bafna C U RNA Profile alignment procedure For all intervals i j for all nodes v v If v is a real node i If v is a dummy node with 1 child If v is a dummy node with 2 children March 2006 Vineet Bafna j The filtering problem road map RNA alignment algorithm is expensive in computation at least cubic time Allow banding How to design fast efficient filter for an RNA family alignment We want formalize the concept of sequence filter chain filter CF Comparison between filters Combining filters to achieve better performance March 2006 Vineet Bafna Formalizing ncRNA filters Score of an alignment of a query model CM or RNA profile and a target subsequence is based on sequence and structure similarity If gaps are allowed SeqScore computation is O n 2 time and StructScore computation is at least O n 3 time HMM filter Weinberg and Ruzzo used SeqScore as the sequence filter one order of magnitude faster than the structure alignment March 2006 Vineet Bafna How to measure the performance of the filter Assume filter F to test on random database D inserted a set of true sequences A Running time Efficiency Accuracy Dominating filter Filter combinations Union Composition March 2006 Vineet Bafna Conditions for domination F can be dominated if there exists F1 that is as accurate but much faster Proof March 2006 Vineet Bafna Exploiting the condition If we are an order of magnitude faster we do not have to be as efficient If T1 T 1 n eF1 only needs to be n 1 n March 2006 Vineet Bafna Sequence filters HMM filter dominates the structure alignment Accurate and efficient But slow D n2 We want to define a better sequence based filter which dominates the HMM filter The pigeonhole principle Keywords searching is much faster around linear time March 2006 Vineet Bafna Keywords 1 2 3 4 Create sequence profile based on the input multiple alignment length n Compute the scores of all possible words with length w for all possible starting positions in the profile by just adding up the corresponding profile matrix scores Compute a cutoff score and output the words and starting positions of the words in the profile Cutoff will be T Average Score for each sequence n w The output is sorted by starting positions for each word that may possible appear based on the profile matrix March 2006 A C G T 1 1 0 0 0 0 2 2 1 8 1 1 1 3 1 0 1 1 0 0 4 5 5 2 6 1 7 1 8 6 9 7 1 1 5 0 2 2 2 2 9 1 1 0 0 8 1 2 0 0 6 0 1 1 2 1 Suppose w 4 we will have following keywords Vineet Bafna Keywords Sarting position ACAA 1 ACAT 1 CGAA 6 CGTA 6 Using one keyword to filter Generate all keywords that score above a threshold Some keywords may not appear in the original alignment as long as the score for the keyword meets the cutoff Tl L This may can handle certain mutations We try to use multiple keywords and distant constraints to improve the efficiency In contrast BLAST keywords are parts of the original query sequence March 2006 Vineet Bafna Multiple keywords chain filtering A filter has the following parameters l length of keyword m number of keywords slack in distance K subset of keywords Efficiency of this filter March 2006 Vineet Bafna The efficiency of a chain filter decreases with increasing number of keywords 5 0 5 F 10 15 log e sK 5 20 sK 10 25 sK 15 30 sK 20 35 1 March 2006 2 3 4 5 m 6 Vineet Bafna 7 8 9 10 Fast search with chain filters Build a trie or hash on the database For every hit of w to position i mark the region i label w Select positions that have many m overlapping intervals i label w March 2006 w Vineet Bafna Running time for chain filters The computation time is o L D Only keeping the dominating gaps we define LABEL as the expected position in the query sequence March 2006 Vineet Bafna Known riboswitch families in Rfam database March 2006 Vineet Bafna Comparing CF and HMM filters CF filter dominates the HMM filters The composite filter achieves the best efficiency March 2006 Vineet Bafna These sequence based filters are not perfect ncRNA evolution is constrained by it secondary structure Drastic sequence changes by compensatory mutations can be tolerated Can we filter based on structure Sequence identity 0 24 Our Solution Use secondary structure based filters March 2006 Vineet Bafna What is the probability that you will see a stack of size 5 in a random sequence p 1 4 5 0 001 Conclusion Stacks of size 5 with fixed loop size are rare in a random …
View Full Document