DOC PREVIEW
UCSD CSE 182 - Dicitionary Matching

This preview shows page 1-2-3-4-27-28-29-30-55-56-57-58 out of 58 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 58 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 58 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 58 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 58 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 58 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 58 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 58 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 58 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 58 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 58 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 58 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 58 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 58 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Slide 1Dictionary MatchingDict. Matching & string matchingDirect AlgorithmThe Trie AutomatonAn O(lpn) algorithm for keyword matchingIllustration:Idea for improving the timeFailure functionIllustrationIllustrationIllustrationIllustrationIllustrationIllustrationIllustrationIllustrationIllustrationTime analysisBlast: Putting it all togetherBlast StepsBLAST outputDistant hitsProtein Sequence AnalysisSilly QuizNot all features(residues) are importantDiverged family members provide key featuresProtein sequence motifsPrositeBasic ideaEX: Zinc Finger domainProteins containing zf domainsFrom alignment to regular expressionsThe sequence analysis perspectiveRegular ExpressionsSlide 36Regular ExpressionRegular Expression & AutomataExamples: Regular Expression & AutomataConstructing automata from R.EEnd of L6Protein structure basicsSide chains determine amino-acid typeBond angles form structural constraintsVarious constraints determine 3d structureAlpha-helixBeta-sheetDomains3D structureSearching structure databasesTrivia QuizHow are Proteins Sequenced? Mass Spec 101:Nobel Citation 2002Nobel Citation, 2002Mass SpectrometrySample PreparationSingle Stage MSTandem MS1/14/19 CSE182CSE182-L7Dicitionary matchingPattern matchingFa05 CSE 182Dictionary Matching•Q: Given k words (si has length li), and a database of size n, find all matches to these words in the database string.•How fast can this be done?1:POTATO2:POTASSIUM3:TASTEP O T A S T P O T A T OdictionarydatabaseFa05 CSE 182Dict. Matching & string matching•How fast can you do it, if you only had one word of length m?–Trivial algorithm O(nm) time–Pre-processing O(m), Search O(n) time.•Dictionary matching–Trivial algorithm (l1+l2+l3…)n–Using a keyword tree, lpn (lp is the length of the longest pattern)–Aho-Corasick: O(n) after preprocessing O(l1+l2..)•We will consider the most general caseFa05 CSE 182Direct AlgorithmP O P O P O T A S T P O T A T OP O T A T OP O T A T OP O T A T OP O T A T O P O T A T OObservations:•When we mismatch, we (should) know something about where the next match will be.•When there is a mismatch, we (should) know something about other patterns in the dictionary as well.Fa05 CSE 182P OTAT OTUIS MS ETAThe Trie Automaton•Construct an automaton A from the dictionary–A[v,x] describes the transition from node v to a node w upon reading x.–A[u,’T’] = v, and A[u,’S’] = w–Special root node r–Some nodes are terminal, and labeled with the index of the dictionary word.1:POTATO2:POTASSIUM3:TASTE123wvuSrFa05 CSE 182An O(lpn) algorithm for keyword matching•Start with the first position in the db, and the root node.•If successful transition–Increment current pointer–Move to a new node–If terminal node “success”•Else–Retract ‘current’ pointer–Increment ‘start’ pointer–Move to root & repeatFa05 CSE 182Illustration:P OTAT OTUIS MS ETAP O T A S T P O T A T OlcvS123Fa05 CSE 182Idea for improving the timeP O T A S T P O T A T O•Suppose we have partially matched pattern i (indicated by l, and c), but fail subsequently. If some other pattern j is to match–Then prefix(pattern j) = suffix [ first c-l characters of pattern(i))lc1:POTATO2:POTASSIUM3:TASTEP O T A S S I U MT A S T EPattern iPattern j1/14/19 CSE182P O T A T OTUIS MS ETAvS1n1n7n6n5n4n3n2n9n8n10•Every node v corresponds to a string sv that is a prefix of some pattern.•Define F[v] to be the node u such that su is the longest suffix of sv•If we fail to match at v, we should jump to F[v], and commence matching from there •Let lp[v] = |su|Failure function1/14/19 CSE182IllustrationP O T A T OTUIS MS ETAvS1n1n7n6n5n4n3n2n9n8n10•What is F(n10)?•What is F(n5)?•F(n3)?•Lp(n10)?1/14/19 CSE182IllustrationP O T A S T P O T A T OP O T A T OTUIS MS ETAS1l = 1n1n7n6n5n4n3n2n9n8vc = 1n101/14/19 CSE182IllustrationP O T A S T P O T A T OP O T A T OTUIS MS ETAS1l = 1n1n7n6n5n4n3n2n9n8vc = 2n101/14/19 CSE182IllustrationP O T A S T P O T A T OP O T A T OTUIS MS ETAS1l = 1n1n7n6n5n4n3n2n9n8vc = 6n101/14/19 CSE182IllustrationP O T A S T P O T A T OP O T A T OTUIS MS ETAS1l = 3n1n7n6n5n4n3n2n9n8vc = 6n101/14/19 CSE182IllustrationP O T A S T P O T A T OP O T A T OTUIS MS ETAS1l = 3n1n7n6n5n4n3n2n9n8vc = 7n10n111/14/19 CSE182IllustrationP O T A S T P O T A T OP O T A T OTUIS MS ETAS1l = 7n1n7n6n5n4n3n2n9n8vc = 7n101/14/19 CSE182IllustrationP O T A S T P O T A T OP O T A T OTUIS MS ETAS1l = 7n1n7n6n5n4n3n2n9n8vc = 8n101/14/19 CSE182IllustrationP O T A S T P O T A T OP O T A T OTUIS MS ETAS1l = 7n1n7n6n5n4n3n2n9n8vc = 7n101/14/19 CSE182Time analysis•In each step, either c is incremented, or l is incremented•Neither pointer is ever decremented (lp[v] < c-l).•l and c do not exceed n•Total time <= 2nP O T A S T P O T A T Ol c1/14/19 CSE182Blast: Putting it all together•Input: Query of length m, database of size n•Select word-size, scoring matrix, gap penalties, E-value cutof•Blast1/14/19 CSE182Blast Steps1. Generate an automaton of all query keywords.2. Scan database using a “Dictionary Matching” algorithm (O(n) time). Identify all hits.3. Extend each hit using a variant of “local alignment” algorithm. Use the scoring matrix and gap penalties.4. For each alignment with score S, compute E-value, and the P-value. Sort according to increasing E-value until the cut-of is reached.5. Output results.1/14/19 CSE182BLAST output•Look up Blast Results with RID–HA5YXH5C0121/14/19 CSE182Distant hits1/14/19 CSE182Protein Sequence Analysis•What can you do if BLAST does not return a hit?–Sometimes, homology (evolutionary similarity) exists at very low levels of sequence similarity.•A: Accept hits at higher E-value. –This increases the probability that the sequence similarity is a chance event.–How can we get around this paradox?–Reformulated Q: suppose two sequences B,C have the same level of sequence similarity to sequence A. If A& B are related in function, can we assume that A& C are? If not, how can we distinguish?ABC1/14/19 CSE182Silly QuizSkin patternsFacial Features1/14/19 CSE182Not all features(residues) are importantSkin patternsFacial Features1/14/19 CSE182Diverged family members provide key features1/14/19 CSE182Protein sequence motifs•Premise: •The sequence of a protein sequence gives clues about its structure and function.• Not all residues are equally important in determining function.•Suppose we knew the key residues of a family. If our query matches in those residues, it is a


View Full Document

UCSD CSE 182 - Dicitionary Matching

Download Dicitionary Matching
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Dicitionary Matching and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Dicitionary Matching 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?