UCSD CSE 182 - Lecture - D2665876

Home> Schools> University of California, San Diego> Computer Science & Engineering (CSE) > CSE 182> Lecture

DOC PREVIEW

UCSD CSE 182 - Lecture

School name University of California, San Diego

Course Cse 182- Biological Databases

Pages 19

This preview shows page 1-2-3-4-5-6 out of 19 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

October 09! CSE 182!CSE182-L5: Scoring matrices Dictionary MatchingExpectation? • Some quantities can be reasonably guessed by taking a statistical sample, others not – Average weight of a group of 100 people – Average height of a group of 100 people – Average grade on a test • Give an example of a quantity that cannot. • When the distribution, and the expectation is known, it is easy to see when you see something significant. • If the distribution is not well understood, or the wrong distribution is chosen, a wrong conclusion can be drawn October 09! CSE 182!October 09! CSE 182!Scoring proteins • Scoring protein sequence alignments is a much more complex task than scoring DNA – Not all substitutions are equal • Problem was first worked on by Pauling and collaborators • In the 1970s, Margaret Dayhoff created the first similarity matrices. – “One size does not fit all” – Homologous proteins which are evolutionarily close should be scored differently than proteins that are evolutionarily distant – Different proteins might evolve at different rates and we need to normalize for that 3!Frequency based scoring • Our goal is to score each column in the alignment • Comparing against expectation: – Think about alignments of pairs of random sequences, and compute the probability that A and B appear together just by chance PR(A,B) – Compute the probability of A and B appearing together in the alignment of related sequences (orthologs) PO(A,B) • A good score function? October 09! CSE 182! A B € logPO(A,B)PR(A,B)     Log-odds scoring • Log-odds score makes sense. • It is also sensitive to evolution • However, to compute a log-odds score function you need good alignments • To get good alignments of sequences, you need a (log-odds) score function. October 09! CSE 182!October 09! CSE 182!PAM 1 distance • Define: Two sequences are 1 PAM apart if they differ in 1 % of the residues. • PAM1(a,b) = Pr[residue b substitutes residue a, when the sequences are 1 PAM apart] 1% mismatch!6!October 09! CSE 182!PAM1 matrix • Align many proteins that are very similar – Is this a problem? • 1 PAM evolutionary distance represents the time in which 1% of the residues have changed • Estimate the frequency Pb|a of residue a being substituted by residue b. • PAM1(a,b) = Pa|b = Pr(b will mutate to an a after 1 PAM evolutionary distance) • Scoring matrix – S(a,b) = log10(Pab/PaPb) = log10(Pb|a/Pb) 7!October 09! CSE 182!PAM 1 • Top column shows original, and left column shows replacement residue = PAM1(a,b) = Pr(a|b) 8!• For closely related sequences (1PAM) apart, we can make a set of alignments, and use that to compute an appropriate evolutionary distance. • What do we do for higher PAM sequences? October 09! CSE 182!October 09! CSE 182!PAM distance • Two sequences are 1 PAM apart when they differ in 1% of the residues. • When are 2 sequences 2 PAMs apart? 1 PAM 1 PAM 2 PAM 10!October 09! CSE 182!Generating Higher PAMs • PAM2(a,b) = ∑c PAM1(a,c). PAM1 (c,b) • PAM2 = PAM1 * PAM1 (Matrix multiplication) • PAM250 – = PAM1*PAM249 – = PAM1250 =!a!a!b! c!b!c!PAM2!PAM1!PAM1!11!October 09! CSE 182!Note: This is not the score matrix: !What happens as you keep increasing the power?!12!October 09! CSE 182!Scoring using PAM matrices • Suppose we know that two sequences are 250 PAMs apart. • S(a,b) = log10(Pab/PaPb)= log10(Pa|b/Pa) = log10(PAM250(a,b)/Pa) • How does it help? – S250(A,V) >> S1(A,V) – Scoring of hum vs. Dros should be using a higher PAM matrix than scoring hum vs. mus. – An alignment with a smaller % identity could still have a higher score and be more significant hum!mus!dros!13!October 09! CSE 182!• S250(a,b) = log10(Pab/PaPb) = log10(PAM250(a,b)/Pa) PAM250 based scoring matrix 14!October 09! CSE 182!BLOSUM series of Matrices • Henikoff & Henikoff: Sequence substitutions in evolutionarily distant proteins do not seem to follow the PAM distributions • A more direct method based on hand-curated multiple alignments of distantly related proteins from the BLOCKS database. • BLOSUM60 Merge all proteins that have greater than 60%. Then, compute the substitution probability. – In practice BLOSUM62 seems to work very well. 15!October 09! CSE 182!PAM vs. BLOSUM • What is the correspondence? • PAM1 Blosum1 • PAM2 Blosum2 • Blosum62 • PAM250 Blosum100 16!October 09! CSE 182!P-value computation • BLAST: The matching regions are expanded into alignments, which are scored using SW, and an appropriate scoring matrix. • The results are presented in order of decreasing scores • The score is just a number. • How significant is the top scoring hits if it has a score S? • Expect/E-value (score S)= Number of times we would expect to see a random query generate a score S, or better • How can we compute E-value?October 09! CSE 182!What is a distribution function • Given a collection of numbers (scores) – 1, 2, 8, 3, 5, 3,6, 4, 4,1,5,3,6,7,…. • Plot its distribution as follows: – X-axis =each number – Y-axis (count/frequency/probability) of seeing that number – More generally, the x-axis can be a range to accommodate real numbers• End of L5 October 09! CSE

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3-4-5-6 out of 19 pages.

UCSD CSE 182 - Lecture

Sign up for free to view:

Please select your school