DOC PREVIEW
CMU BSC 03711 - Logistics

This preview shows page 1-2-3-4 out of 11 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1Logistics• Problem Set 4 returned today• Problem Set 5 out today, due Dec. 5• Today: no office hours• Thursday: Thanksgiving – no class• Next week: Gene finding• Dec 7: Last class, project presentations• Final exam: Dec 18thHLHLHLHLHLHLHLHLHLCoiled Coil HMMPairwise sequence alignment global and localMultiple sequence alignmentlocalglobalSubstitution matricesDatabase searchingBLASTEvolutionary tree reconstructionRNA structure predictionGene FindingProtein structure predictionSequence statisticsComputational genomics…Today1. BLAST Statistics2. Introduction to information theory3. Information content of alignments 4. How to select a substitution matrix4. Normalized bit scores5. Gapped BLAST.Review21. Relating S and E Expected number of ungapped alignments with score S found with random sequences is:SKmneEλ−=],[1jiSejpipλ−=∑The parameter λ is specified by the equationNote that E is proportional to the size of the search space, mn, and decreases exponentially with the score, Swhere K is a constant that depends on S[i,j] and can be computed from the theory for any scoring function.2. Introduction to information theoryConcepts from Shannon Information theory:– Uncertainty; also called the surprisal–Entropy– Relative EntropyLet’s consider some examples:• Tossing a coin– N = 2 states: H, T– Uncertainty: what is the outcome of the next toss?• Telegraph–N = |Σ| states: set of symbols– Uncertainty: what is the outcome of the next letter?Uncertainty associated with state, i:ipiu log)( −=Note:• Since pi≤ 1, u(i) ≥ 0• u(i) approaches infinity as pi approaches 0.• u(i) approaches 0 as pi approaches 1.• Uncertainty is additive: u(i and j) = u(i) + u(j).where piis the probability of state i.Units depend on the base of the logarithm:• Using log2gives uncertainty in bits.• In this case, uncertainty ~ number of yes/no questions needed todetermine the stateThe Entropy is the expected uncertainty:Coin toss example:∑∑−==iiiiippiupH log)()1(log)1(log22 HHHHppppH−−−−=HHpNote: The entropy is maximal when all states are equally likely.3Relative Entropy aka Kullback-Leibler DivergenceInterpretations:• If Q is the prior distribution (before seeing the data) and P is the posterior distribution, the KL divergence measures the decrease in uncertainty from seeing the data• Expected discrimination information: Information available to discriminate in favor of hypothesis Ĥaagainst hypothesis Ĥ0, givenĤais true. Note: D(Q|P) is not symmetric and therefore not a distance.∑=iiiipqqPQD log)||(Relative Entropy for ungapped local alignments.• Given sequences a and b:• Alternate hypothesis (Ĥa): a and b are related at n PAMs divergence.i and j are aligned with “target” frequencies, qnij• Null hypothesis (Ĥ0): a and b are unrelated.i and j are aligned with background frequencies, pipjThe relative entropy gives the number of bits per position available to distinguish related alignments from chance at n PAMs.To calculate the relative entropy, we need to know the target frequencies∑∑∝=jiijnjijiijnijnKLjiSqppqqPQD,,],[log)||(How to determine the target frequencies, qij?1. From PAM transition probability: qij= piPn[i,j]3. From theory: Karlin & Altschul, 90],[ jinSeppqjiijnλ−=2. Generate “random” sequences from background probabilitiesFind MSPs in pairs of random sequencesCount target frequencies in MSPs83%2.95202.573063%2.0060bits/sitebits/site250200160120100455060809020%0.360.3825%0.510.5230%0.700.6638%0.980.9943%1.181.18Sequence identity PAM BLOSUMjiijijAAklppqqHPHPHPD202log)()(log)(∑∑==The information available in the alignment depends on S[i,j]and is given by the relative entropy:43. Information content of alignmentsGiven the expected number of false positives:SKmneEλ−=mnS2log≈Interpretation: the minimum score need to distinguish MSPs from chance is equivalent to the number of bits required to specify the starting position of the alignment.If we choose λ = ln 2, then with some algebraic manipulationHow many bits are required to find meaningful alignments in today’s databases?For a typical amino acid sequence of length m = 250, –if n = 1 billion, then a minimum of 38 bits are required to distinguish significant MSP’s from chance.mnS2log≈The information available in the alignment depends on Sn[i,j] and is given by the relative entropyjiijnijnklppqqD2log∑=ImplicationsThe lower the relative entropy, H, the longer the minimum alignment that is distinguishable from chance.minimum query sequence lengthminimum number of bitsbits per position=A query sequence must be at least 38/2.57 = 15 residues long at 30 PAMs38/0.70 = 54 residues long at 160 PAMs38/0.36 = 105 residues long at 250 PAMsto distinguish significant HSP’s from chance.minimum query sequence lengthminimum number of bitsbits per position=2.573025020016012010020 %0.3625%0.5130 %0.7038%0.9843 %1.18Seq Id PAM In a data base of length 1 billion, 38 bits are required.5Today1. BLAST Statistics2. Introduction to information theory3. Information content of alignments 4. How to select a substitution matrix4. Normalized bit scores5. Gapped BLAST.ReviewThe best scoring matrix for distinguishing significant alignments by chance is the matrix corresponding to the qijfrom related sequences at the evolutionary distance of interest [KA90].jpipijqjiS2log],[ =4. How to choose S[i,j]?Proof by contradiction– Let Sn[ ] be the optimal matrix, i.e., the matrix that best distinguishes significant alignments from chance alignments, at n PAMs.– Then MSPs in sequences n PAMs diverged will have maximum scores when scored with Sn[ ] .– Suppose Sn[ ] does not correspond to– Then there must be some x, y in Σ that occurs more frequently than qij. – We can increase the score of the MSPs by increasing Sn[x,y]. ¾ Sn[ ] is not optimal.],[ jiSeppqjiijλ−=ImplicationsBLAST will give reasonable accuracy as long as the empirical target frequencies, qij,, in the alignments of interest do not deviate too far from the theoretical target frequencies:],[ jiSeppqjiijλ−=Reasonable accuracy can be achieved with two or three matrices.6The average score (in bits) per alignment position when using a PAM Mmatrix to compare sequences in fact separated by D PAMs(Calculated by simulation)Efficiency = Score with PAM MScore with PAM D= 94% efficiency Choosing your scoring matrix1. BLAST will give reasonable accuracy as long as the empirical target


View Full Document

CMU BSC 03711 - Logistics

Documents in this Course
lecture

lecture

8 pages

Lecture

Lecture

3 pages

Homework

Homework

10 pages

Lecture

Lecture

17 pages

Delsuc05

Delsuc05

15 pages

hmwk1

hmwk1

2 pages

lecture

lecture

6 pages

Lecture

Lecture

10 pages

barnacle4

barnacle4

15 pages

review

review

10 pages

Homework

Homework

10 pages

Midterm

Midterm

12 pages

lecture

lecture

11 pages

lecture

lecture

32 pages

Lecture

Lecture

7 pages

Lecture

Lecture

17 pages

Lecture

Lecture

12 pages

Lecture

Lecture

21 pages

Lecture

Lecture

11 pages

Lecture

Lecture

28 pages

Homework

Homework

13 pages

lecture

lecture

11 pages

Lecture

Lecture

8 pages

Lecture

Lecture

9 pages

lecture

lecture

8 pages

Problem

Problem

6 pages

Homework

Homework

10 pages

Lecture

Lecture

9 pages

Problem

Problem

7 pages

hmwk4

hmwk4

7 pages

Problem

Problem

6 pages

lecture

lecture

16 pages

Problem

Problem

8 pages

Problem

Problem

6 pages

Problem

Problem

13 pages

lecture

lecture

9 pages

Problem

Problem

11 pages

Notes

Notes

7 pages

Lecture

Lecture

7 pages

Lecture

Lecture

10 pages

Lecture

Lecture

9 pages

Homework

Homework

15 pages

Lecture

Lecture

16 pages

Problem

Problem

15 pages

Load more
Download Logistics
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Logistics and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Logistics 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?