DOC PREVIEW
CMU CS 15780 - PHMMs

This preview shows page 1-2-3-22-23-24-44-45-46 out of 46 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 46 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 46 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 46 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 46 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 46 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 46 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 46 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 46 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 46 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 46 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

15-780: Graduate ArtificialIntelligenceComputational biology: Sequence alignmentand profile HMMs2Central dogmaProteinmRNADNAtranscriptiontransla tionCCTGAGCCAACTATTGATGAAPEPTIDECCUGAGCCAACUAUUGAUGAA3Comparison of DifferentOrganisms18,4001*108Worm4,200.05*108E. coli25,0001.3*108Plant25,00030*108Human13,6001.8*108Fly6,000.15*108YeastNum. of genesGenome size4Assigning function to proteins• One of the main goals of molecular (andcomputational) biology.• There are 25000 human genes and the vast majorityof their functions is still unknown• Several ways to determine function - Direct experiments (knockout, overexpression) - Interacting partners - 3D structures - Sequence homologyHardEasier5Function from sequencehomology• We have a query gene: ACTGGTGTACCGAT• Given a database containing genes with knownfunction, our goal is to find similar genes from thisdatabase (possibly in another organism)• When we find such gene we predict the function ofthe query gene to be similar to the resulting databasegene• Problems - How do we determine similarity?6Sequence analysis techniques• A major area of research within computationalbiology.• Initially, based on deterministic or heuristic alignmentmethods• More recently, based on probabilistic inferencemethods7Sequence analysis• Traditional - Dynamic programming - Blast• Probabilsitic - Profile HMMs8Pairwise sequence alignmentAGCCTTACCATTA G C C T TA C C A T TAGCCTTAGCATTA G C C T TA G C A T T9Pairwise sequence alignmentAGCCTTACCATTA G C C T TA C C A T TAGCCTTAGCATTA G C C T TA G C A T T• We cannot expect the alignments to be perfect.• Major reasons include insertion, deletion and substitutions.• We need to allow gaps in the resulting alignment.10Scoring Alignments!!=jxixjiqqIyxP )|,(!=iyxiipMyxP )|,()log(),(,babaqqpbas =• Alignments can be scored by comparing the resultingalignment to a background (random) model.Independent RelatedScore foralignment:),(!=iiiyxsSwhere:11Scoring Alignments!!=jxixjiqqIyxP )|,(!=iyxiipMyxP )|,()log(),(,babaqqpbas =• Alignments can be scored by comparing the resultingalignment to a background (random) model.Independent RelatedScore foralignment:),(!=iiiyxsSwhere:In other words, we are trying to find an alignmentthat maximizes the likelihood ratio of the alignedpair compared to the background model12Computing optimal alignment:The Needham-Wuncsh algorithmF(i,j)F(i,j-1)F(i-1,j)F(i-1,j-1)F(i,j) = maxF(i-1,j-1)+s(xi,xj)F(i-1,j)+dF(i,j-1)+dA G C C T TACCATTd is a penalty fora gap13Example-6-5-4-3-2-10-6-5-4-3-2-1TTACCATTCCGAAssume a simple model where S(a,b) = 1 if a=b and -5 otherwise.Also, assume that d = -114ExampleAssume a simple model where S(a,b) = 1 if a=b and -5 otherwise.Also, assume that d = -1F(i,j) = maxF(i-1,j-1)+s(xi,xj)F(i-1,j)+dF(i,j-1)+d-6-5-4-3-2-10-6-5-4-3-2-1TTACC 1ATTCCGA15ExampleAssume a simple model where S(a,b) = 1 if a=b and -5 otherwise.Also, assume that d = -1-6-5-4-3-2-10-6-5-4-3-2-1TTAC 0C 0 1ATTCCGAF(i,j) = maxF(i-1,j-1)+s(xi,xj)F(i-1,j)+dF(i,j-1)+d16ExampleAssume a simple model where S(a,b) = 1 if a=b and -5 otherwise.Also, assume that d = -1-6-5-4-3-2-10-6-5-4-3-2-1 -4T -3T -2A -1C -1 0C-4-3-2 -1 0 1ATTCCGAF(i,j) = maxF(i-1,j-1)+s(xi,xj)F(i-1,j)+dF(i,j-1)+d17ExampleAssume a simple model where S(a,b) = 1 if a=b and -5 otherwise.Also, assume that d = -1-6-5-4-3-2-10-6-5-4-3-2-131-1-3-5 -4T120-2-4 -3T-101-1 -3 -2A0 1 2 0 -2 -1C-2-1 0 1 -1 0C-4-3-2 -1 0 1ATTCCGA18ExampleAssume a simple model where S(a,b) = 1 if a=b and -5 otherwise.Also, assume that d = -1-6-5-4-3-2-10-6-5-4-3-2-131-1-3-5 -4T120-2-4 -3T-101-1 -3 -2A0 1 2 0 -2 -1C-2-1 0 1 -1 0C-4-3-2 -1 0 1ATTCCGA19ExampleAssume a simple model where S(a,b) = 1 if a=b and -5 otherwise.Also, assume that d = -1-6-5-4-3-2-10-6-5-4-3-2-131-1-3-5 -4T120-2-4 -3T-101-1 -3 -2A0 1 2 0 -2 -1C-2-1 0 1 -1 0C-4-3-2 -1 0 1ATTCCGAA G C C T TA C C A T T20Running time• The running time of an alignment algorithms if O(n2)• This doesn’t sound too bad, or is it?• The time requirement for doing global sequencealignment is too high in many cases.• Consider a database with tens of thousands ofsequences. Looking through all these sequences forthe best alignment is too time consuming.• In many cases, a much faster heuristic approachcan achieve equally good results.21BLAST: Basic Local AlignmentSearch Tool• Heuristic alignment method, first presented in 1990.• The biggest success of computational biology to date.• Since it was suggested, a number of new andimproved version where presented (psi-BLAST).• Currently available with almost all public databases.22BLAST (cont.)• Sequence is composed of a list of ‘words’.• Uses a dictionary (3 for AA and 11 for nucleotides).• All matches to database are recorded.23BLAST• Hits are extended in both direction if they are lessthan X bases away from each other.• All sequences reaching a certain score arereturned, and a complete alignment is performed.2425Sequence analysis• Traditional - Dynamic programming - Blast• Probabilsitic - Profile HMMs√√26Protein families• Proteins can be classified into families (and furtherinto sub families etc.)• A specific family includes proteins with similar highlevel functions• For example: - Transcription factors - Receptors - Etc.Family assignment is an important firststep towards function prediction27Multiple Alignment Process• Process of aligning three ormore sequences with eachother• Fine for offline computations• We can determine suchalignment by generalizingthe algorithm to align twosequences• What’s the complexity ofthis?28Multiple Alignment: Reasons fordifferencesSubstitutionsInsertionsDeletions29Biological Motivation:• Given a single amino acid target sequence ofunknown protein we want to infer the family of theresulting protein.30Methods for Characterizing aProtein Family• Objective: Given a number of related sequences,encapsulate what they have in common in such away that we can recognize other members of thefamily.• Some standard methods for characterization:– Multiple Alignments– Regular Expressions– Consensus Sequences– Hidden Markov Models31Designing HMMs: Consensus(match) statesWe first include states tooutput the consensussequenceA: 0.8T: 0.2C: 0.8G: 0.2A: 0.8C: 0.2T: 0.8G: 0.232startDesigning HMMs: InsertionsWe next add states to allowinsertionsA: 0.8T: 0.2C: 0.8G:


View Full Document

CMU CS 15780 - PHMMs

Download PHMMs
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view PHMMs and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view PHMMs 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?