1Pairwise sequence alignment (global and local)Multiple sequence pqalignmentlocalglobalSubstitution matricesDatabase searchingBLASTSequence ttitiEvolutionary tree reconstructionRNA structure predictionGene FindingProtein structure predictionstatisticsComputational genomics…... RLSKIISMFQAHIRGYLIRKAYKRGYQARCLLK ... ... RNKHAIAVIWAFWLVQSSFRGYQAGSKARRELK ... .. GWQKRVRGWIVIVRRNFKKKRNEKLSATAZZZZZYQ ... ... MKRSQVVKQEKAARKVQKFWRGHRVQHNQR ... ... QEEVSAIIIQRAYRRYLLKQKVKILRVQSS ... RLSKIISMIQAHIRGYLIRKAYKRGYQARCLLK... RLSKIISMIQAHIRGYLIRKAYKRGYQARCLLK ..... RNKHAIAVIWAFWLVQSSFRGYQAGSKARRELK ..... GWIQKRVRGWIVIRRNFKKKRNEKLSATAZZZZZYQ .... MKRSQVVKQEKAARKIQKFWRGHRVQHNQR ... ... QEEVSAIIIQRAYRRYLLKQKVKILRVQSS ... DiscoveryModeling.. GWQKRVRGWIVIVRRNQVNQAAVTIQRWYRCQVQRRRAGFKKKRNEKLSATAZZZZZRecognitionLocal Multiple AlignmentPosition Specific Scoring Matrices (PSSMs)Position Specific Scoring Matrices (PSSMs)– Modeling, Recognition• Gibbs sampler– Discovery• Hidden Markov Models (HMMs)Discovery Modeling Recognition•Discovery, Modeling, Recognition• Can represent gaps, positional dependenciesPosition Specific Scoring MatricesGiven A[l,k] (k sequences, l positions),nij–Frequency of aa i at positionj– Propensity of aa i at position jknjiFij],[ibjiFjiP],[],[ ][log][jiPjiS–Log odds scoring matrix],[log],[2jiPjiS1n-w=1]],[[10llitSwli2Local Multiple Alignment•Position Specific Scoring Matrices (PSSMs)•Position Specific Scoring Matrices (PSSMs)– Modeling, Recognition Gibbs sampler– Discovery• Hidden Markov Models (HMMs)Discovery Modeling Recognition•Discovery, Modeling, Recognition• Can represent gaps, positional dependenciesDiscovery• Input: k sequences containing a common ungapped pattern (e g a transcription factorungapped pattern (e.g., a transcription factor binding site, a domain…)• Output: A set of k subsequences that are “most similar” to each other. • ApproachesExhaustive enumeration–Exhaustive enumeration Gibbs sampler– Expectation maximization using HMMsGibbs sampler summaryConvergence: (see optional reading for details)– Model sampling process as a Markov ChainEach state is a set ofksubsequences–Each state is a set of k subsequences– Show that• the Markov Chain has a stationary distribution• the state corresponding to the most likely pattern has high probability in that distributionIn practice, the sampler can get stuck in local optima• Randomness helps.• Run the procedure several times with different starting configurations.Gibbs sampler summaryOther considerations:– Problems could arise if a sequence has no copy of the pattern or has more than one copy– You could find a biologically meaningful pattern that is not the pattern you are looking for.– Use pseudocounts when building the PSSM to ensure all characters are represented.Black Magic (see Lawrence et al, optional reading)g( ,p g)– Pseudocounts– Selecting the window size, w– Selecting the starting configuration– Termination condition.3Local Multiple Alignment•Position Specific Scoring Matrices (PSSMs)•Position Specific Scoring Matrices (PSSMs)– Modeling, Recognition• Gibbs sampler– Discovery Hidden Markov Models (HMMs)Discovery Modeling Recognition•Discovery, Modeling, Recognition• Can represent gaps, positional dependenciesProblems with PSSMsDo not capture positional dependenciesWEIRD D0.60WEIRDWEIQHWEIRDWEIQHE1.00H0.40I1.00Q0.40R0.60W1.00Note: We never see QD or RH, only RD and QH.But, P(RH) = P(QD) = 0.24, while P(QH) = 0.16Problems with PSSMsHard to recognize pattern instances that contain indelsWETIRDD 0.8 0.8 0.8 0.8 2.4E0629060616W E T I R D E0.62.90.60.61.6H 2.0 2.0 2.0 2.0 3.0I 0.8 0.8 3.1 0.8 0.8Q 1.1 1.1 1.1 2.1 1.1R 0.8 0.8 0.8 2.8 0.8W 5.0 2.7 2.7 2.7 1.85.0+2.9+1.2+1.4+1.5 = 11W E T I R D 1.2+1.8+3.1+3.0+3.4 = 12.5W E T I R D 5.0+2.9+3.1+3.0+3.4= 18.4Problems with PSSMsVariable length motifsWETIRD Gaps can be represented by expanding Σ, but what size window should be used to score new instances of the motif?WE_IRDWETIQHWE_IRDWETIQHx x x W E T I R D x x x x x x W E I Q H x x x x x.4Problems with PSSMsDo not handle boundary detection problems wellGoal: label every element in the sequence with aGoal: label every element in the sequence with a zero (not in pattern) or a one (in pattern)PSSMx x x x x x x x x x x y y y y y y y x x x x x x x x0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0Examples of boundary detection problems•Recognition of regulatory motifs•Recognition of regulatory motifs• Recognition of protein domains• Intron/exon boundaries• Gene boundaries•Transmembrane regions•Transmembrane regions• Secondary structures (α helices, β sheets)Plan •Review Markov chains•Review Markov chains• Extend to Hidden Markov Models– Boundary detection– Scoring sequences•HMM constructionHMM construction• Biological applications: revisit gaps and dependencies.Markov chains• States: S1, S2, ... SN•States visited: q0, q1, …., qt, qt+1, …Saes s ed q0,q1,,qt,qt+1,• Initial distribution of states: π(i) = P(q0 =Si)• Transition probabilities: aij= P(qt =Sj |qt-1 =Si)Questions we can ask:Questions we can ask:What is the probability of being in a particular state at a particular time?What is the probability of seeing a particular sequence of states?5An example: transmembrane regionsExtracellularMembraneCytosolTend to be hydrophobicModel each amino acid as hydrophobic (H) or hydrophilic (L)→ A peptide sequence can be represented as a sequence of H’s and Ls. MLVKRFWKCE…. HHHLLHLHHLHL...An example: transmembrane regionsExtracellularMembraneCytosolTend to be hydrophobicQuestions to ask:which subsequences correspond to transmembrane regions?HHHLLHLHHLHL...An example: transmembrane regionsExtracellularMembraneCytosolTend to be hydrophobicA simpler question:is a given sequence a transmembrane sequence?HHHLLHLHHLHL...A Markov chain for recognizing transmembrane sequences0.70.30.7• States: SH, SLHL0.3003H,L• Σ = {H,L}• π(H) = 0.7, π(L) = 0.3Is a given sequence, say HHLHH, a transmembrane sequence?60.7 0.30.7π(H) = 0 7π(L) = 0 3Transmembrane model:HL0.3P(HHLHH) = 0.7 x 0.7 x 0.3 x 0.7 x 0.7 = 0.072π(H) = 0.7, π(L) = 0.3Is it a transmembrane protein?Problem: need a threshold, threshold must be length dependentHL0.7 0.30.7Transmembrane model:HL0.5 0.50.5Null model:HL0.3π(H) = 0.7, π(L) = 0.3HL0.5π(H) = 0.5, π(L) = 0.5P(HHLHH | TM)
View Full Document