Unformatted text preview:

15-780: Graduate Artificial IntelligenceComputational biology: Sequence alignment and profile HMMsCentral dogmaProteinmRNADNAtranscriptiontranslationCCTGAGCCAACTATTGATGAACCUGAGCCAACUAUUGAUGAAPEPTIDE2Assigning function to proteins• One of the main goals of molecular (and computational) biology.• There are 25000 human genes and the vast majority of their functions is still unknown• Several ways to determine function- Direct experiments (knockout, overexpression)- Interacting partners- 3D structures- Sequence homologyHardEasier3Comparison of Different Organisms Genome size Num. of genesE. coli .05*1084,200Worm 1*10818,400Yeast .15*1086,000Fly 1.8*10813,600Human 30*10830,000Plant 1.3*10825,0004Function from sequence homology• We have a query gene: ACTGGTGTACCGAT• Given a database with genes with a known function, our goal is to find another gene with similar sequence (possibly in another organism)• When we find such gene we predict the function of the query gene to be similar to the resulting database gene• Problems- How do we determine similarity?5Sequence analysis techniques• A major area of research within computational biology.• Initially, based on deterministic or heuristic alignment methods• More recently, based on probabilistic inference methods 6Sequence analysis• Traditional- Dynamic programming-Blast• Probabilsitic- Profile HMMs7Pairwise sequence alignmentAGCCTTACCATTA G C C  T TA  C C A T T• We cannot expect the alignments to be perfect.• Major reasons include insertion, deletion and substitutions.• We need to allow gaps in the resulting alignment.8Scoring Alignments• Alignments can be scored by comparing the resulting alignment to a background (random) model. Independent Related∏=iyxiipMyxP )|,(∏∏=jxixjiqqIyxP )|,(Score for alignment:)log(),(,babaqqpbas =),(∑=iiiyxsSwhere:9Scoring Alignments• Alignments can be scored by comparing the resulting alignment to a background (random) model. ∏∏=jxixjiqqIyxP )|,(∏=iyxiipMyxP )|,()log(),(,babaqqpbas =Independent RelatedScore for alignment:),(∑=iiiyxsSwhere:In other words, we are trying to find an alignment that maximizes the likelihood ratio of the aligned pair compared to the background model10Computing optimal alignment: The Needham-Wuncsh algorithmF(i-1,j-1)+s(xi,xj)F(i,j) = maxd is a penalty for a gapF(i-1,j)-dF(i,j-1)-dA G C C T TACCATTF(i-1,j-1) F(i-1,j)F(i,j-1) F(i,j)11ExampleAssume a simple model where S(a,b) = 1 if a=b and -5 otherwise.Also, assume that d = -1AGCCTT0 -1-2-3-4-5-6-1-2-3-4-5-6ACCATT12ExampleAssume a simple model where S(a,b) = 1 if a=b and -5 otherwise.Also, assume that d = -1AGCCTT0-1-2-3-4-5-6-1-2-3-4-5-6A 1CCATTF(i,j) = maxF(i-1,j-1)+s(xi,xj)F(i-1,j)-dF(i,j-1)-d13ExampleAssume a simple model where S(a,b) = 1 if a=b and -5 otherwise.Also, assume that d = -1AGCCTT0-1-2-3-4-5-6-1-2-3-4-5-6A10C 0CATTF(i,j) = maxF(i-1,j-1)+s(xi,xj)F(i-1,j)-dF(i,j-1)-d14ExampleAssume a simple model where S(a,b) = 1 if a=b and -5 otherwise.Also, assume that d = -1AGCCTT0-1-2-3-4-5-6-1-2-3-4-5-6A10-1 -2 -3 -4C0-1C -1A -2T -3T -4F(i,j) = maxF(i-1,j-1)+s(xi,xj)F(i-1,j)-dF(i,j-1)-d15ExampleAssume a simple model where S(a,b) = 1 if a=b and -5 otherwise.Also, assume that d = -1AGCCTT0-1-2-3-4-5-6-1-2-3-4-5-6A10-1-2-3-4C 0 -1 1 0 -1 -2C-1-20210A-2-3-110-1T-3-4-2 021T-4-5-3-113F(i,j) = maxF(i-1,j-1)+s(xi,xj)F(i-1,j)-dF(i,j-1)-d16ExampleAssume a simple model where S(a,b) = 1 if a=b and -5 otherwise.Also, assume that d = -1AGCCTT0-1-2-3-4-5-6-1-2-3-4-5-6A10-1-2-3-4C 0 -1 1 0 -1 -2C-1-20210A-2-3-110-1T-3-4-2 021T-4-5-3-113F(i,j) = maxF(i-1,j-1)+s(xi,xj)F(i-1,j)-dF(i,j-1)-d17ExampleAssume a simple model where S(a,b) = 1 if a=b and -5 otherwise.Also, assume that d = -1AGCCTT0-1-2-3-4-5-6-1-2-3-4-5-6A10-1-2-3-4C 0 -1 1 0 -1 -2C-1-20210A-2-3-110-1T-3-4-2 021T-4-5-3-113A G C C  T TA  C C A T T18Running time• The running time of an alignment algorithms if O(n2)• This doesn’t sound too bad, or is it?• The time requirement for doing global sequence alignment is too high in many cases.• Consider a database with tens of thousands of sequences. Looking through all these sequences for the best alignment is too time consuming.• In many cases, a much faster heuristic approach can achieve equally good results.19BLAST: Basic Local Alignment Search Tool• Heuristic alignment method, first presented in 1990.• The biggest success of computational biology to date.• Since it was suggested, a number of new and improved version where presented (psi-BLAST).• Currently available with almost all public databases. 20BLAST (cont.)• Sequence is composed of a list of ‘words’.• Uses a dictionary (3 for AA and 11 for nucleotides).• All matches to database are recorded. 21BLAST• Hits are extended in both direction if they are less than A bases away from each other.• All sequences reaching a certain score are returned, and a complete alignment is performed.2223Sequence analysis• Traditional- Dynamic programming-Blast• Probabilsitic- Profile HMMs√√24Protein families• Proteins can be classified into families (and further into sub families etc.)• A specific family includes proteins with similar high level functions• For example:- Transcription factors-Receptors-Etc.Family assignment is an important first step towards function prediction25Multiple Alignment Process• Process of aligning three or more sequences with each other• We can determine such alignment by generalizing the algorithm to align two sequences• What’s the complexity of this?26Multiple Alignment: Reasons for differencesSubstitutionsInsertionsDeletions27Biological Motivation:• Given a single amino acid target sequence of unknown protein we want to infer the family of the resulting protein.28Methods for Characterizing a Protein Family• Objective: Given a number of related sequences, encapsulate what they have in common in such a way that we can recognize other members of the family. • Some standard methods for characterization:– Multiple Alignments– Regular Expressions– Consensus Sequences– Hidden Markov Models29Designing HMMs: Consensus (match) statesWe first include states to output the consensus sequenceA: 0.8T: 0.2C: 0.8G: 0.2A: 0.8C: 0.2T: 0.8G: 0.230Designing HMMs: InsertionsWe next add states to allow insertionsstartA: 0.8T: 0.2C: 0.8G: 0.2A: 0.8C: 0.2T: 0.8G: 0.2111 0.40.6 0.80.2A: 0.2C: 0.4 : G:0.2T: 0.231Designing HMMs:


View Full Document

CMU CS 15780 - Lecture

Download Lecture
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?