RNA Search and !Motif Discovery"Lecture 9!CSEP 590A!Autumn 2008"Outline"Whirlwind tour of ncRNA search & discovery"RNA motif description (Covariance Model Review)"Algorithms for searching"Rigorous & heuristic filtering"Motif discovery"Applications"2 Motif Description"7 RNA Motif Models"“Covariance Models” (Eddy & Durbin 1994)"aka profile stochastic context-free grammars"aka hidden Markov models on steroids"Model position-specific nucleotide preferences and base-pair preferences"Pro: accurate"Con: model building hard, search sloooow"8Example: searching for tRNAs 13 Mj: "Match states (20 emission probabilities)"Ij: "Insert states (Background emission probabilities)"Dj: "Delete states (silent - no emission)"Profile Hmm Structure"17 18 CM Structure"A: Sequence + structure"B: the CM “guide tree”"C: probabilities of letters/ pairs & of indels"Think of each branch being an HMM emitting both sides of a helix (but 3’ side emitted in reverse order)"CM Viterbi Alignment"! ! xi= ithletter of inputxij= substring i,..., j of inputTyz= P(transition y " z)Exi, xjy= P(emission of xi, xjfrom state y)Sijy= max#log P(xijgen'd starting in state y via path#)20! Sijy= max"log P(xijgenerated starting in state y via path")Sijy=maxz[Si+1, j#1z+ logTyz+ log Exi, xjy] match pairmaxz[Si+1, jz+ logTyz+ log Exiy] match/insert leftmaxz[Si, j#1z+ logTyz+ log Exjy] match/insert rightmaxz[Si, jz+ logTyz] deletemaxi<k$ j[Si,kyleft+ Sk +1, jyright] bifurcation% & ' ' ' ( ' ' ' Time O(qn3), q states, seq len n 21 23 mRNA leader mRNA leader switch? 24 Mutual Information"Max when no seq conservation but perfect pairing"MI = expected score gain from using a pair state"Finding optimal MI, (i.e. opt pairing of cols) is hard(?)"Finding optimal MI without pseudoknots can be done by dynamic programming"! Mij= fxi,xjxi,xj"log2fxi,xjfxifxj; 0 # Mij# 22629 Pseudoknots disallowed allowed ! maxjMi, ji=1n"# $ % & ' ( /231 Rfam – an RNA family DB!Griffiths-Jones, et al., NAR ‘03,’05"Biggest scientific computing user in Europe - 1000 cpu cluster for a month per release"Rapidly growing:"Rel 1.0, 1/03: 25 families, 55k instances"Rel 7.0, 3/05: 503 families, >300k instances"33 IRE (partial seed alignment):!Hom.sap. GUUCCUGCUUCAACAGUGUUUGGAUGGAAC Hom.sap. UUUCUUC.UUCAACAGUGUUUGGAUGGAAC Hom.sap. UUUCCUGUUUCAACAGUGCUUGGA.GGAAC Hom.sap. UUUAUC..AGUGACAGAGUUCACU.AUAAA Hom.sap. UCUCUUGCUUCAACAGUGUUUGGAUGGAAC Hom.sap. AUUAUC..GGGAACAGUGUUUCCC.AUAAU Hom.sap. UCUUGC..UUCAACAGUGUUUGGACGGAAG Hom.sap. UGUAUC..GGAGACAGUGAUCUCC.AUAUG Hom.sap. AUUAUC..GGAAGCAGUGCCUUCC.AUAAU Cav.por. UCUCCUGCUUCAACAGUGCUUGGACGGAGC Mus.mus. UAUAUC..GGAGACAGUGAUCUCC.AUAUG Mus.mus. UUUCCUGCUUCAACAGUGCUUGAACGGAAC Mus.mus. GUACUUGCUUCAACAGUGUUUGAACGGAAC Rat.nor. UAUAUC..GGAGACAGUGACCUCC.AUAUG Rat.nor. UAUCUUGCUUCAACAGUGUUUGGACGGAAC SS_cons <<<<<...<<<<<......>>>>>.>>>>> Rfam"Input (hand-curated):"MSA “seed alignment”"SS_cons"Score Thresh T"Window Len W"Output:"CM"scan results & “full alignment”"34Faster Search"37 Faster Genome Annotation !of Non-coding RNAs !Without Loss of Accuracy"Zasha Weinberg "& W.L. Ruzzo"Recomb ‘04, ISMB ‘04, Bioinfo ‘06"38 RaveNnA: Genome Scale !RNA Search"Typically 100x speedup over raw CM, w/ no loss in accuracy: "" "drop structure from CM to create a (faster) HMM""use that to pre-filter sequence; ""discard parts where, provably, CM score < threshold;""actually run CM on the rest (the promising parts)""assignment of HMM transition/emission scores is key """(large convex optimization problem)"Weinberg & Ruzzo, Bioinformatics, 2004, 2006 39 CM’s are good, but slow!EMBL CM hits junk Rfam Goal 10 years, 1000 computers 1 month, 1000 computers Our Work ~2 months, 1000 computers EMBL CM hits Ravenna Rfam Reality EMBL hits junk BLAST CM 41CM to HMM"25 emisions per state 5 emissions per state, 2x states A C G U – A C G U – A C G U – A C G U – A C G U – A C G U – A C G U – A C G U – CM HMM 43 Need: log Viterbi scores CM ! HMM"Key Issue: 25 scores " 10"P A C G U – A C G U – L A C G U – A C G U – R CM HMM 44 Viterbi/Forward Scoring"Path ! defines transitions/emissions"Score(!) = product of “probabilities” on !"NB: ok if “probs” aren’t, e.g. !"1!(e.g. in CM, emissions are odds ratios vs !0th-order background)"For any nucleotide sequence x:"Viterbi-score(x) = max{ score(!) | ! emits x}"Forward-score(x) = !{ score(!) | ! emits x}"45 Key Issue: 25 scores " 10"Need: log Viterbi scores CM ! HMM"PCA ! LC + RA PCC ! LC + RC PCG ! LC + RG PCU ! LC + RU PC– ! LC + R– … … … … … PAA ! LA + RA PAC ! LA + RC PAG ! LA + RG PAU ! LA + RU PA– ! LA + R– NB: HMM not a prob. model P A C G U – A C G U – L A C G U – A C G U – R CM HMM 46Rigorous Filtering"Any scores satisfying the linear inequalities give rigorous filtering!Proof: ! CM Viterbi path score ! ! “corresponding” HMM path score! ! Viterbi HMM path score ! (even if it does not correspond to any CM path)"PAA ! LA + RA PAC ! LA + RC PAG ! LA + RG PAU ! LA + RU PA– ! LA + R– … 47 Some scores filter better"PUA = 1 ! LU + RA"PUG = 4 ! LU + RG"" " "" " " Assuming ACGU # 25%"Option 1: " """Opt 1:" LU = RA = RG = 2 " " LU + (RA + RG)/2 = 4 "Option 2: " """Opt 2:" LU = 0, RA = 1, RG = 4 " LU + (RA + RG)/2 = 2.5"48 Optimizing filtering"For any nucleotide sequence x:"Viterbi-score(x) = max{ score(!) | ! emits x }"Forward-score(x) = !{ score(!) | ! emits x }"Expected Forward Score"E(Li, Ri) = !all sequences x Forward-score(x)*Pr(x)"NB: E is a function of Li, Ri only"Optimization: !Minimize E(Li, Ri) subject to score Lin.Ineq.s"This is heuristic (“forward$ % Viterbi$ % filter$”)"But still rigorous because “subject to score Lin.Ineq.s”"Under 0th-order background model 49 Calculating E(Li, Ri)"E(Li, Ri) = !x Forward-score(x)*Pr(x)"Forward-like: for every state, calculate expected score for all paths ending there; easily calculated from expected scores of predecessors & transition/emission probabilities/scores"50Minimizing E(Li, Ri)"Calculate E(Li, Ri) symbolically, in terms of emission scores, so we can do partial derivatives for numerical convex optimization algorithm"! "E (L1, L2, ...)"LiForward: Viterbi: 51 “Convex” Optimization"Convex: !local max = global max;"simple “hill climbing” works"Nonconvex:!can be many local maxima, << global max;!“hill-climbing” fails"52 Estimated Filtering Efficiency!(139 Rfam 4.0
View Full Document