DOC PREVIEW
UCSD CSE 182 - Lecture

This preview shows page 1-2-3-4-29-30-31-32-33-60-61-62-63 out of 63 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 63 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 63 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 63 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 63 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 63 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 63 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 63 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 63 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 63 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 63 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 63 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 63 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 63 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 63 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CSE182-L5: Scoring matrices Dictionary MatchingScoring DNADNA scoring matricesScoring proteinsPAM 1 distancePAM1 matrixPAM 1PAM distanceHigher PAMsPAM250 based scoring matrixScoring using PAM matricesBLOSUM series of MatricesPAM vs. BLOSUMThe last step in BlastDictionary Matching, R.E. matching, and position specific scoringKeyword searchDictionary MatchingDict. Matching & string matchingDirect AlgorithmThe Trie AutomatonAn O(lpn) algorithm for keyword matchingIllustration:Idea for improving the timeImproving speed of dictionary matchingAn O(n) alg. For keyword matchingIllustrationTime analysisBlast: Putting it all togetherBlast StepsProtein Sequence AnalysisSilly QuizSlide 32Protein sequence motifsPrositeBasic ideaEX: Zinc Finger domainProteins containing zf domainsFrom alignment to regular expressionsThe sequence analysis perspectiveRegular ExpressionsRegular ExpressionRegular Expression & AutomataExamples: Regular Expression & AutomataConstructing automata from R.ERegular Expression MatchingAlg. For matching R.E.Slide 47D.P. to match regular expressionSlide 49AlgorithmThe final stepProfiles versus regular expressionsProfilesScoring ProfilesPsi-BLAST ideaPsi-BLAST speedDatabases of MotifsDatabases of protein domainsPfamPROSITEPowerPoint PresentationBLOCKSSlide 63Fa05 CSE 182CSE182-L5: Scoring matrices Dictionary MatchingFa05 CSE 182Scoring DNA•DNA has structure.QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Fa05 CSE 182DNA scoring matrices•So far, we considered a simple match/mismatch criterion.•The nucleotides can be grouped into Purines (A,G) and Pyrimidines.•Nucleotide substitutions within a group (transitions) are more likely than those across a group (transversions)QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Fa05 CSE 182Scoring proteins•Scoring protein sequence alignments is a much more complex task than scoring DNA–Not all substitutions are equal•Problem was first worked on by Pauling and collaborators•In the 1970s, Margaret Dayhoff created the first similarity matrices.–“One size does not fit all”–Homologous proteins which are evolutionarily close should be scored differently than proteins that are evolutionarily distant –Different proteins might evolve at different rates and we need to normalize for thatFa05 CSE 182PAM 1 distance•Two sequences are 1 PAM apart if they differ in 1 % of the residues.•PAM1(a,b) = Pr[residue b substitutes residue a, when the sequences are 1 PAM apart]1% mismatchFa05 CSE 182PAM1 matrix•Align many proteins that are very similar–Is this a problem?•PAM1 distance is the probability of a substitution when 1% of the residues have changed•Estimate the frequency Pb|a of residue a being substituted by residue b.•S(a,b) = log10(Pab/PaPb) = log10(Pb|a/Pb)Fa05 CSE 182PAM 1Fa05 CSE 182PAM distance•Two sequences are 1 PAM apart when they differ in 1% of the residues.•When are 2 sequences 2 PAMs apart?1 PAM1 PAM2 PAMFa05 CSE 182Higher PAMs•PAM2(a,b) = ∑c PAM1(a,c). PAM1 (c,b)•PAM2 = PAM1 * PAM1 (Matrix multiplication)•PAM250–= PAM1*PAM249 –= PAM1250Fa05 CSE 182•S250(a,b) = log10(Pab/PaPb) = log10(PAM250(b|a)/Pb)PAM250 based scoring matrixFa05 CSE 182Scoring using PAM matrices•Suppose we know that two sequences are 250 PAMs apart. •S(a,b) = log10(Pab/PaPb)= log10(Pb|a/Pb) = log10(PAM250(a,b)/Pb)•How does it help?–S250(A,V) >> S1(A,V)–Scoring of hum vs. Dros should be using a higher PAM matrix than scoring hum vs. mus. –An alignment with a smaller % identity could still have a higher score and be more significant hummusdrosFa05 CSE 182BLOSUM series of Matrices•Henikoff & Henikoff: Sequence substitutions in evolutionarily distant proteins do not seem to follow the PAM distributions•A more direct method based on hand-curated multiple alignments of distantly related proteins from the BLOCKS database.•BLOSUM60 Merge all proteins that have greater than 60%. Then, compute the substitution probability.–In practice BLOSUM62 seems to work very well.Fa05 CSE 182PAM vs. BLOSUM•What is the correspondence?•PAM1 Blosum1•PAM2 Blosum2• Blosum62•PAM250 Blosum100Fa05 CSE 182The last step in Blast•We have discussed–Alignments–Db filtering using keywords–E-values and P-values–Scoring matrices•The last step: Database filtering requires us to scan a large sequence fast for matching keywordsFa05 CSE 182Dictionary Matching, R.E. matching, and position specific scoringFa05 CSE 182Keyword search•Recall: In BLAST, we get a collection of keywords from the query sequence, and identify all db locations with an exact match to the keyword.•Question: Given a collection of strings (keywords), find all occrrences in a database string where they keyword might match.Fa05 CSE 182Dictionary Matching•Q: Given k words (si has length li), and a database of size n, find all matches to these words in the database string.•How fast can this be done?1:POTATO2:POTASSIUM3:TASTEP O T A S T P O T A T OdictionarydatabaseFa05 CSE 182Dict. Matching & string matching•How fast can you do it, if you only had one word of length m?–Trivial algorithm O(nm) time–Pre-processing O(m), Search O(n) time.•Dictionary matching–Trivial algorithm (l1+l2+l3…)n–Using a keyword tree, lpn (lp is the length of the longest pattern)–Aho-Corasick: O(n) after preprocessing O(l1+l2..)•We will consider the most general caseFa05 CSE 182Direct AlgorithmP O P O P O T A S T P O T A T OP O T A T OP O T A T OP O T A T OP O T A T O P O T A T OObservations:•When we mismatch, we (should) know something about where the next match will be.•When there is a mismatch, we (should) know something about other patterns in the dictionary as well.Fa05 CSE 182P OTAT OTUIS MS ETAThe Trie Automaton•Construct an automaton A from the dictionary–A[v,x] describes the transition from node v to a node w upon reading x.–A[u,’T’] = v, and A[u,’S’] = w–Special root node r–Some nodes are terminal, and labeled with the index of the dictionary word.1:POTATO2:POTASSIUM3:TASTE123wvuSrFa05 CSE 182An O(lpn) algorithm for keyword matching•Start with the first position in the db, and the root node.•If successful transition–Increment current pointer–Move to a new node–If terminal node “success”•Else–Retract ‘current’ pointer–Increment ‘start’ pointer–Move to root & repeatFa05 CSE 182Illustration:P OTAT


View Full Document

UCSD CSE 182 - Lecture

Download Lecture
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?