1Pairwise sequence alignment (global and local)Multiple sequence alignmentlocalglobalSubstitution matricesDatabase searchingBLASTEvolutionary tree reconstructionRNA structure predictionGene FindingProtein structure predictionSequence statisticsComputational genomics…Hypothesis testing using a (log) odds ratio• Observation: Data, D (6H, 2T)• What process generated this data?– Alternative hypothesis: Ha (p≠0.5)– Null hypothesis: Ho(p=0.5)• P(Ha|D): posterior probability• P(Ha): prior probability• P(D|Ha): likelihood of the data given the hypothesisHypothesis testing using a likelihood ratioLikelihood ratio:How likely is the data under the alternate hypothesis compared with the likelihood under the null hypothesis?=)|()|(0HDPHDPaP(6 heads in 8 tosses | q)P(6 heads in 8 tosses | 0.5)P (toss yields heads): Ha: q≠0.5, Ho: 0.5Note: There are ways to test a hypothesis; e.g., a p-value.Need to estimate qMaximum Likelihood EstimationWhat process generated this data?– Model with parameters: e.g., binomial with parameter p– The best estimate of q is the value that maximizes the likelihood of the data. To obtain q, solve:()knkknqqqknP−−= )1(),,(()0))1((2626=−dqqqd0)|(=dqHDdP75.0=q2Hypothesis testing using a (log) odds ratioLikelihood ratio:How likely is the data under the alternate hypothesis compared with the likelihood under the null hypothesis?=)|()|(0HDPHDPaP(6 heads in 8 tosses | 0.75)P(6 heads in 8 tosses | 0.5)Observing 6 heads in 8 coin tosses is 2.85 times as likely if q = 0.75 than if the coin is fair.Note: the sample size is very small!(0.75)6(0.25)2(0.5)6(0.5)2= 2.85Note:• The estimate improves as the sample size increases. A method is consistent if • For mathematical convenience we may use the log likelihood ratio: • In general, the probability distribution is unknown. Select a model and maximize the likelihood with respect to that model. Results can vary with the choice of model• We estimated a parameter and determined the likelihood in a single, unified process.qqn=∞→ˆlim)|()|(log0HDPHDPaMaximum Likelihood Estimation for Phylogeny ReconstructionData: Multiple sequence alignment, n sites, k taxaModel: sequence evolution, e.g. Jukes CantorParameters:Internal labels, l= (l1,l2…lj)Branch lengths, x= (x1,x2…xj)Given a topology, TSelect l, x such that P(MSA | T, l, x) is maximumMaximum Likelihood Estimation for Phylogeny ReconstructionAssumptions:Sites are independent: score each site separatelyLineages are independent (Markov property): compute each branch separately)|()|( TsitePTMSAPii∏=)|()|(jjiixsitePTsiteP∏=3Maximum Likelihood Estimation for Phylogeny ReconstructionGiven a topology, T, Select l, x such thatis maximum∏∏∏=ijhhijxlTsitePHMSAP ),,|()|(Probabilities given by, e.g., Jukes Cantor model:ATr= {C, G, A, T}x1x2P(site | T) = P(r=A)P(x1)AAP(x2)AT+ P(r=T)P(x1)TAP(x2)TT + 2P(r=C)P(x1)CAP(x2)CT+ 2P(r=G)P(x1)GAP(x2)GTP(xi)CC= ( 1/4 + 3/4 e-4xi), P(xi)CG= ( 1/4 -1/4 e-4xi), etc.Note this is a sum, not a productMaximum Likelihood Estimation for Phylogeny Reconstruction• Consistent (more data, better estimation)• Computationally intensive– Consider T(k) trees– For each internal node, |Σ|k labels. MLE used more often for DNA than for protein sequences– Branch lengths are typically determined numerically.• If evolutionary model is a reversible Markov chain then the MLE distance matrix converges to additive.– Î Neighbor Joining is a consistent method• Note that parsimony is not consistent.Selecting data for tree reconstruction• For reconstructing recent events, use DNA sequences• For reconstructing distant events, use amino acid sequences• Select sequences that – Are present in all taxa– Contain a conserved region– Exhibit variation within that region– e.g., Ribosomal (16sRNA) genes were used to reconstruct the tree of life. These genes encode products use in all organisms from bacteria to mammals.• Pitfalls: duplicated genes, horizontal gene transfer, mosaic genes.4Comparison of Phylogeny Reconstruction Methods• Parsimony – Selection dominates, e.g., ribosomal genes– Exhaustive or heuristic search, branch and bound• Distance– Neutral mutation dominates, e.g., immunoglobulin sequences– Exhaustive or heuristic search, greedy methods. – Neighbor Joining finds correct tree in quadratic time if data is additive. – UPGMA finds correct tree in quadratic time if data is ultrametric.• Maximum Likelihood– Neutral mutation dominates, e.g., immunoglobulin sequences– Exhaustive or heuristic searchCharacterDistanceCharacterDataNoNoYesSelective pressureYesYesNoConsistentYesYesYesDNAVery slowYesYesAmino acidsYesYesYesTopologyProbYesYesBranch lengthsProbNoYesAncestral statesYesYesNoModel of mutational changeYesYesYesNP-completeMax LikelihoodDistanceParsimonyPairwise sequence alignment (global and local)Multiple sequence alignmentlocalglobalSubstitution matricesDatabase searchingBLASTEvolutionary tree reconstructionRNA structure predictionGene FindingProtein structure predictionSequence statisticsComputational genomics…Applications of Local MSAConserved patterns in biological sequencesExample: Transcription factor binding sitesSP ...gcttt AATTTTCACTATATACTATAA cgatt... ST ...cagat ATAAATGATATAGTGGTTATA gttaa...ST ...atctt TTTTATTATTAAATCGTATTA gcagc... EC ...aggct ATAAATGATATAGTGGTTATA gttag...EC ...acctt TTTTATTATTAAATCGTATTA gtcac...VC ...ttata ACTAATAATTATAAAATATGT gtgtc...YP ...gctga TGAAATGATATAATCGTTATA taaga... …agcgagcctgagcactcgaggcatctctgcacattcagcatgggatgggcctcctgtccctgtatgcgcctgatga…5intronspromotortranscription factor binding sitespolymerasetfSome known binding site motifsApplications of Local MSA Conserved patterns in biological sequencesExample: Protein domains Fold independentlyCarry out specific functions Found in diverse contextsConserved in evolutionInsulin receptorFN3RLkinaseFurin likeRLProtein Tyrosine KinasesAdapted from Robinson et al., 2000Protein domain databasesConserved Domain Database (CDD)Representation: Position specific scoring matrices (PSSMs)Structurally corrected local MSAsCDART: Conserved Domain Architecture Retrieval Tool PFAM, SMARTRepresentation: Hidden Markov Models (HMM’s)Curated local MSA’sMore: see Mount, Table 9.56Multi-domain protein example:PAX gene family• Developmental regulatory genes that encode transcription factors• Contain a DNA binding domain• Early expressed during
View Full Document