Slide 1Dictionary MatchingDict. Matching & string matchingDirect AlgorithmThe Trie AutomatonAn O(lpn) algorithm for keyword matchingIllustration:Idea for improving the timeFailure functionIllustrationIllustrationIllustrationIllustrationIllustrationIllustrationIllustrationIllustrationIllustrationTime analysisBlast: Putting it all togetherBlast StepsBLAST outputDistant hitsProtein Sequence AnalysisSilly QuizNot all features(residues) are importantDiverged family members provide key featuresProtein sequence motifsPrositeBasic ideaEX: Zinc Finger domainProteins containing zf domainsFrom alignment to regular expressionsThe sequence analysis perspectiveRegular ExpressionsSlide 36Regular ExpressionRegular Expression & AutomataExamples: Regular Expression & AutomataConstructing automata from R.EEnd of L6Protein structure basicsSide chains determine amino-acid typeBond angles form structural constraintsVarious constraints determine 3d structureAlpha-helixBeta-sheetDomains3D structureSearching structure databasesTrivia QuizHow are Proteins Sequenced? Mass Spec 101:Nobel Citation 2002Nobel Citation, 2002Mass SpectrometrySample PreparationSingle Stage MSTandem MS1/14/19 CSE182CSE182-L7Dicitionary matchingPattern matchingFa05 CSE 182Dictionary Matching•Q: Given k words (si has length li), and a database of size n, find all matches to these words in the database string.•How fast can this be done?1:POTATO2:POTASSIUM3:TASTEP O T A S T P O T A T OdictionarydatabaseFa05 CSE 182Dict. Matching & string matching•How fast can you do it, if you only had one word of length m?–Trivial algorithm O(nm) time–Pre-processing O(m), Search O(n) time.•Dictionary matching–Trivial algorithm (l1+l2+l3…)n–Using a keyword tree, lpn (lp is the length of the longest pattern)–Aho-Corasick: O(n) after preprocessing O(l1+l2..)•We will consider the most general caseFa05 CSE 182Direct AlgorithmP O P O P O T A S T P O T A T OP O T A T OP O T A T OP O T A T OP O T A T O P O T A T OObservations:•When we mismatch, we (should) know something about where the next match will be.•When there is a mismatch, we (should) know something about other patterns in the dictionary as well.Fa05 CSE 182P OTAT OTUIS MS ETAThe Trie Automaton•Construct an automaton A from the dictionary–A[v,x] describes the transition from node v to a node w upon reading x.–A[u,’T’] = v, and A[u,’S’] = w–Special root node r–Some nodes are terminal, and labeled with the index of the dictionary word.1:POTATO2:POTASSIUM3:TASTE123wvuSrFa05 CSE 182An O(lpn) algorithm for keyword matching•Start with the first position in the db, and the root node.•If successful transition–Increment current pointer–Move to a new node–If terminal node “success”•Else–Retract ‘current’ pointer–Increment ‘start’ pointer–Move to root & repeatFa05 CSE 182Illustration:P OTAT OTUIS MS ETAP O T A S T P O T A T OlcvS123Fa05 CSE 182Idea for improving the timeP O T A S T P O T A T O•Suppose we have partially matched pattern i (indicated by l, and c), but fail subsequently. If some other pattern j is to match–Then prefix(pattern j) = suffix [ first c-l characters of pattern(i))lc1:POTATO2:POTASSIUM3:TASTEP O T A S S I U MT A S T EPattern iPattern j1/14/19 CSE182P O T A T OTUIS MS ETAvS1n1n7n6n5n4n3n2n9n8n10•Every node v corresponds to a string sv that is a prefix of some pattern.•Define F[v] to be the node u such that su is the longest suffix of sv•If we fail to match at v, we should jump to F[v], and commence matching from there •Let lp[v] = |su|Failure function1/14/19 CSE182IllustrationP O T A T OTUIS MS ETAvS1n1n7n6n5n4n3n2n9n8n10•What is F(n10)?•What is F(n5)?•F(n3)?•Lp(n10)?1/14/19 CSE182IllustrationP O T A S T P O T A T OP O T A T OTUIS MS ETAS1l = 1n1n7n6n5n4n3n2n9n8vc = 1n101/14/19 CSE182IllustrationP O T A S T P O T A T OP O T A T OTUIS MS ETAS1l = 1n1n7n6n5n4n3n2n9n8vc = 2n101/14/19 CSE182IllustrationP O T A S T P O T A T OP O T A T OTUIS MS ETAS1l = 1n1n7n6n5n4n3n2n9n8vc = 6n101/14/19 CSE182IllustrationP O T A S T P O T A T OP O T A T OTUIS MS ETAS1l = 3n1n7n6n5n4n3n2n9n8vc = 6n101/14/19 CSE182IllustrationP O T A S T P O T A T OP O T A T OTUIS MS ETAS1l = 3n1n7n6n5n4n3n2n9n8vc = 7n10n111/14/19 CSE182IllustrationP O T A S T P O T A T OP O T A T OTUIS MS ETAS1l = 7n1n7n6n5n4n3n2n9n8vc = 7n101/14/19 CSE182IllustrationP O T A S T P O T A T OP O T A T OTUIS MS ETAS1l = 7n1n7n6n5n4n3n2n9n8vc = 8n101/14/19 CSE182IllustrationP O T A S T P O T A T OP O T A T OTUIS MS ETAS1l = 7n1n7n6n5n4n3n2n9n8vc = 7n101/14/19 CSE182Time analysis•In each step, either c is incremented, or l is incremented•Neither pointer is ever decremented (lp[v] < c-l).•l and c do not exceed n•Total time <= 2nP O T A S T P O T A T Ol c1/14/19 CSE182Blast: Putting it all together•Input: Query of length m, database of size n•Select word-size, scoring matrix, gap penalties, E-value cutof•Blast1/14/19 CSE182Blast Steps1. Generate an automaton of all query keywords.2. Scan database using a “Dictionary Matching” algorithm (O(n) time). Identify all hits.3. Extend each hit using a variant of “local alignment” algorithm. Use the scoring matrix and gap penalties.4. For each alignment with score S, compute E-value, and the P-value. Sort according to increasing E-value until the cut-of is reached.5. Output results.1/14/19 CSE182BLAST output•Look up Blast Results with RID–HA5YXH5C0121/14/19 CSE182Distant hits1/14/19 CSE182Protein Sequence Analysis•What can you do if BLAST does not return a hit?–Sometimes, homology (evolutionary similarity) exists at very low levels of sequence similarity.•A: Accept hits at higher E-value. –This increases the probability that the sequence similarity is a chance event.–How can we get around this paradox?–Reformulated Q: suppose two sequences B,C have the same level of sequence similarity to sequence A. If A& B are related in function, can we assume that A& C are? If not, how can we distinguish?ABC1/14/19 CSE182Silly QuizSkin patternsFacial Features1/14/19 CSE182Not all features(residues) are importantSkin patternsFacial Features1/14/19 CSE182Diverged family members provide key features1/14/19 CSE182Protein sequence motifs•Premise: •The sequence of a protein sequence gives clues about its structure and function.• Not all residues are equally important in determining function.•Suppose we knew the key residues of a family. If our query matches in those residues, it is a
View Full Document