DOC PREVIEW
UCSD CSE 182 - Lecture

This preview shows page 1-2-3-4-5-6 out of 19 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Slide 1Expectation?Scoring proteinsFrequency based scoringLog-odds scoringPAM 1 distancePAM1 matrixPAM 1Slide 9PAM distanceGenerating Higher PAMsSlide 12Scoring using PAM matricesPAM250 based scoring matrixBLOSUM series of MatricesPAM vs. BLOSUMP-value computationWhat is a distribution functionSlide 191/14/19 CSE 182CSE182-L5: Scoring matrices Dictionary MatchingExpectation?•Some quantities can be reasonably guessed by taking a statistical sample, others not–Average weight of a group of 100 people–Average height of a group of 100 people–Average grade on a test•Give an example of a quantity that cannot.•When the distribution, and the expectation is known, it is easy to see when you see something significant.•If the distribution is not well understood, or the wrong distribution is chosen, a wrong conclusion can be drawn1/14/19 CSE 1821/14/19 CSE 182Scoring proteins•Scoring protein sequence alignments is a much more complex task than scoring DNA–Not all substitutions are equal•Problem was first worked on by Pauling and collaborators•In the 1970s, Margaret Dayhoff created the first similarity matrices.–“One size does not fit all”–Homologous proteins which are evolutionarily close should be scored differently than proteins that are evolutionarily distant –Different proteins might evolve at different rates and we need to normalize for that3Frequency based scoring•Our goal is to score each column in the alignment•Comparing against expectation:–Think about alignments of pairs of random sequences, and compute the probability that A and B appear together just by chance PR(A,B)–Compute the probability of A and B appearing together in the alignment of related sequences (orthologs) PO(A,B)•A good score function?1/14/19 CSE 182 A B€ logPO(A,B)PR(A,B) ⎛ ⎝ ⎜ ⎞ ⎠ ⎟Log-odds scoring•Log-odds score makes sense.•It is also sensitive to evolution•However, to compute a log-odds score function you need good alignments•To get good alignments of sequences, you need a (log-odds) score function.1/14/19 CSE 1821/14/19 CSE 182PAM 1 distance•Define: Two sequences are 1 PAM apart if they differ in 1 % of the residues.•PAM1(a,b) = Pr[residue b substitutes residue a, when the sequences are 1 PAM apart]1% mismatch61/14/19 CSE 182PAM1 matrix•Align many proteins that are very similar–Is this a problem?•1 PAM evolutionary distance represents the time in which 1% of the residues have changed•Estimate the frequency Pb|a of residue a being substituted by residue b.•PAM1(a,b) = Pa|b = Pr(b will mutate to an a after 1 PAM evolutionary distance)•Scoring matrix –S(a,b) = log10(Pab/PaPb) = log10(Pb|a/Pb)71/14/19 CSE 182PAM 1•Top column shows original, and left column shows replacement residue = PAM1(a,b) = Pr(a|b)8•For closely related sequences (1PAM) apart, we can make a set of alignments, and use that to compute an appropriate evolutionary distance.•What do we do for higher PAM sequences?1/14/19 CSE 1821/14/19 CSE 182PAM distance•Two sequences are 1 PAM apart when they differ in 1% of the residues.•When are 2 sequences 2 PAMs apart?1 PAM1 PAM2 PAM101/14/19 CSE 182Generating Higher PAMs•PAM2(a,b) = ∑c PAM1(a,c). PAM1 (c,b)•PAM2 = PAM1 * PAM1 (Matrix multiplication)•PAM250–= PAM1*PAM249 –= PAM1250=aabcbcPAM2PAM1PAM1111/14/19 CSE 182Note: This is not the score matrix: What happens as you keep increasing the power?121/14/19 CSE 182Scoring using PAM matrices•Suppose we know that two sequences are 250 PAMs apart. •S(a,b) = log10(Pab/PaPb)= log10(Pa|b/Pa) = log10(PAM250(a,b)/Pa)•How does it help?–S250(A,V) >> S1(A,V)–Scoring of hum vs. Dros should be using a higher PAM matrix than scoring hum vs. mus. –An alignment with a smaller % identity could still have a higher score and be more significant hummusdros131/14/19 CSE 182•S250(a,b) = log10(Pab/PaPb) = log10(PAM250(a,b)/Pa)PAM250 based scoring matrix141/14/19 CSE 182BLOSUM series of Matrices•Henikoff & Henikoff: Sequence substitutions in evolutionarily distant proteins do not seem to follow the PAM distributions•A more direct method based on hand-curated multiple alignments of distantly related proteins from the BLOCKS database.•BLOSUM60 Merge all proteins that have greater than 60%. Then, compute the substitution probability.–In practice BLOSUM62 seems to work very well.151/14/19 CSE 182PAM vs. BLOSUM•What is the correspondence?•PAM1 Blosum1•PAM2 Blosum2• Blosum62•PAM250 Blosum100161/14/19 CSE 182P-value computation•BLAST: The matching regions are expanded into alignments, which are scored using SW, and an appropriate scoring matrix.•The results are presented in order of decreasing scores•The score is just a number.•How significant is the top scoring hits if it has a score S?•Expect/E-value (score S)= Number of times we would expect to see a random query generate a score S, or better•How can we compute E-value?1/14/19 CSE 182What is a distribution function•Given a collection of numbers (scores)–1, 2, 8, 3, 5, 3,6, 4, 4,1,5,3,6,7,….•Plot its distribution as follows:–X-axis =each number–Y-axis (count/frequency/probability) of seeing that number–More generally, the x-axis can be a range to accommodate real numbers•End of L51/14/19 CSE


View Full Document

UCSD CSE 182 - Lecture

Download Lecture
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?