DOC PREVIEW
UCSD CSE 182 - Lecture

This preview shows page 1-2-3-4-5-6 out of 19 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

October 09! CSE 182!CSE182-L5: Scoring matrices Dictionary MatchingExpectation? • Some quantities can be reasonably guessed by taking a statistical sample, others not – Average weight of a group of 100 people – Average height of a group of 100 people – Average grade on a test • Give an example of a quantity that cannot. • When the distribution, and the expectation is known, it is easy to see when you see something significant. • If the distribution is not well understood, or the wrong distribution is chosen, a wrong conclusion can be drawn October 09! CSE 182!October 09! CSE 182!Scoring proteins • Scoring protein sequence alignments is a much more complex task than scoring DNA – Not all substitutions are equal • Problem was first worked on by Pauling and collaborators • In the 1970s, Margaret Dayhoff created the first similarity matrices. – “One size does not fit all” – Homologous proteins which are evolutionarily close should be scored differently than proteins that are evolutionarily distant – Different proteins might evolve at different rates and we need to normalize for that 3!Frequency based scoring • Our goal is to score each column in the alignment • Comparing against expectation: – Think about alignments of pairs of random sequences, and compute the probability that A and B appear together just by chance PR(A,B) – Compute the probability of A and B appearing together in the alignment of related sequences (orthologs) PO(A,B) • A good score function? October 09! CSE 182! A B € logPO(A,B)PR(A,B)     Log-odds scoring • Log-odds score makes sense. • It is also sensitive to evolution • However, to compute a log-odds score function you need good alignments • To get good alignments of sequences, you need a (log-odds) score function. October 09! CSE 182!October 09! CSE 182!PAM 1 distance • Define: Two sequences are 1 PAM apart if they differ in 1 % of the residues. • PAM1(a,b) = Pr[residue b substitutes residue a, when the sequences are 1 PAM apart] 1% mismatch!6!October 09! CSE 182!PAM1 matrix • Align many proteins that are very similar – Is this a problem? • 1 PAM evolutionary distance represents the time in which 1% of the residues have changed • Estimate the frequency Pb|a of residue a being substituted by residue b. • PAM1(a,b) = Pa|b = Pr(b will mutate to an a after 1 PAM evolutionary distance) • Scoring matrix – S(a,b) = log10(Pab/PaPb) = log10(Pb|a/Pb) 7!October 09! CSE 182!PAM 1 • Top column shows original, and left column shows replacement residue = PAM1(a,b) = Pr(a|b) 8!• For closely related sequences (1PAM) apart, we can make a set of alignments, and use that to compute an appropriate evolutionary distance. • What do we do for higher PAM sequences? October 09! CSE 182!October 09! CSE 182!PAM distance • Two sequences are 1 PAM apart when they differ in 1% of the residues. • When are 2 sequences 2 PAMs apart? 1 PAM 1 PAM 2 PAM 10!October 09! CSE 182!Generating Higher PAMs • PAM2(a,b) = ∑c PAM1(a,c). PAM1 (c,b) • PAM2 = PAM1 * PAM1 (Matrix multiplication) • PAM250 – = PAM1*PAM249 – = PAM1250 =!a!a!b! c!b!c!PAM2!PAM1!PAM1!11!October 09! CSE 182!Note: This is not the score matrix: !What happens as you keep increasing the power?!12!October 09! CSE 182!Scoring using PAM matrices • Suppose we know that two sequences are 250 PAMs apart. • S(a,b) = log10(Pab/PaPb)= log10(Pa|b/Pa) = log10(PAM250(a,b)/Pa) • How does it help? – S250(A,V) >> S1(A,V) – Scoring of hum vs. Dros should be using a higher PAM matrix than scoring hum vs. mus. – An alignment with a smaller % identity could still have a higher score and be more significant hum!mus!dros!13!October 09! CSE 182!• S250(a,b) = log10(Pab/PaPb) = log10(PAM250(a,b)/Pa) PAM250 based scoring matrix 14!October 09! CSE 182!BLOSUM series of Matrices • Henikoff & Henikoff: Sequence substitutions in evolutionarily distant proteins do not seem to follow the PAM distributions • A more direct method based on hand-curated multiple alignments of distantly related proteins from the BLOCKS database. • BLOSUM60 Merge all proteins that have greater than 60%. Then, compute the substitution probability. – In practice BLOSUM62 seems to work very well. 15!October 09! CSE 182!PAM vs. BLOSUM • What is the correspondence? • PAM1 Blosum1 • PAM2 Blosum2 • Blosum62 • PAM250 Blosum100 16!October 09! CSE 182!P-value computation • BLAST: The matching regions are expanded into alignments, which are scored using SW, and an appropriate scoring matrix. • The results are presented in order of decreasing scores • The score is just a number. • How significant is the top scoring hits if it has a score S? • Expect/E-value (score S)= Number of times we would expect to see a random query generate a score S, or better • How can we compute E-value?October 09! CSE 182!What is a distribution function • Given a collection of numbers (scores) – 1, 2, 8, 3, 5, 3,6, 4, 4,1,5,3,6,7,…. • Plot its distribution as follows: – X-axis =each number – Y-axis (count/frequency/probability) of seeing that number – More generally, the x-axis can be a range to accommodate real numbers• End of L5 October 09! CSE


View Full Document

UCSD CSE 182 - Lecture

Download Lecture
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?