1Calculating midterm scores511/495• Homework: Average top two scores• Total: (0.6*midterm + 0.4* homework)/100711/856HkAtt•Homework: Average top two scores• Total: (0.6*midterm + 0.4* homework + .1*Lit)/110Midterm GradesPairwise sequence alignment (global and local)Multiple sequence pqalignmentlocalglobalSubstitution matricesDatabase searchingBLASTSequence ttitiEvolutionary tree reconstructionRNA structure predictionGene FindingProtein structure predictionstatisticsComputational genomics…Likelihood of MSA: ...TCAGG......TGTCG...kijx5x6Ti...TGACG......TCCGA...CAATx1x2x3x4),,|(),,|(1xlTsitePxlTMSAPikai1aAssumptions:Sites are independent: score each site separatelyLineages are independent (Markov property): compute each branch separately2Maximum Likelihood Estimation for Phylogeny ReconstructionNote we need to consider• All sites: O(n)• All trees: O(Trooted(k))•All combinations of internal labels: O(|Σ|k)(||• A branch lengths: O(k) branchesBranch lengths are estimated numericallyMaximum Likelihood Estimation for Phylogeny Reconstruction•Computationally intensiveComputationally intensive• Consistent (more data, better estimation)• If evolutionary model is a reversible Markov chain (e.g., JC), then the MLE distance matrix converges to additive. Neighbor Joining is a consistent methodFarach and Kannan, 96• Note that parsimony is not consistent.Selecting data for tree reconstruction• For reconstructing recent events, use DNA sequences• For reconstructing distant events, use amino acid sequences• Select sequences that – Are present in all taxa– Contain a conserved region– Exhibit variation within that region– e.g., Ribosomal (16sRNA) genes were used to reconstruct th t f lif Th d d t i llthe tree of life. These genes encode products use in all organisms from bacteria to mammals.• Pitfalls: duplicated genes, horizontal gene transfer, mosaic genes.Comparison of Phylogeny Reconstruction Methods• Parsimony Sl ti d i t ib l–Selection dominates, e.g., ribosomal genes– Exhaustive or heuristic search, branch and bound• Distance– Neutral mutation dominates, e.g., immunoglobulin sequences– Exhaustive or heuristic search, greedy methods. – Neighbor Joining finds correct tree in quadratic time if data is additive. –UPGMA finds correct tree in quadratic time if data is ultrametric.• Maximum Likelihood– Neutral mutation dominates, e.g., immunoglobulin sequences– Exhaustive or heuristic search3Parsimony Distance Max LikelihoodData Character Distance CharacterNP-complete Yes Yes YespTopology Yes Yes YesBranch lengths Yes Yes ProbAncestral states Yes No ProbDNA Yes Yes YesAmino acids Yes Yes Very slowConsistent No Yes YesModel of mutational changeNo Yes YesBootstrapping, Branches and Partitions• Every edge partitions a tree into two ftHA2CA1CA2groups of taxaHA1MA1FA(CA2HA2)(CA1MA1HA1FA)Bootstrapping, Branches and Partitions• Every edge partitions a tree into two ftHA2CA1CA2groups of taxaHA1MA1FA(MA1HA1) (CA1CA2HA2FA)Bootstrapping, Branches and Partitions• These two trees are different, but they htitiCA1CA2share a partitionCA1CA2(MA1HA1CA1FA) (CA2HA2)(MA1HA1CA1FA) (CA2HA2)HA2HA1MA1FAHA2HA1MA1FA4• Neither of these partitions exist in the th tBootstrapping, Branches and PartitionsCA1CA2other treeCA1CA2(MA1CA1) (CA2HA2FA HA1)(MA1HA1CA1) (CA2HA2FA)HA2HA1MA1FAHA2HA1MA1FABootstrapping a gene tree• For i = 1 to N–Construct MSA’ by sampling columns from the ypgoriginal MSA with replacement– Construct a new tree, t’, from MSA’– Tabulate the partitions in t’.• For every partition, p, in the original tree, score(p)= (the number observations of p)/NHA2CA1HA1MA1CA2FA1009957Pairwise sequence alignment (global and local)Multiple sequence pqalignmentlocalglobalSubstitution matricesDatabase searchingBLASTSequence ttitiEvolutionary tree reconstructionRNA structure predictionGene FindingProtein structure predictionstatisticsComputational genomics…Applications of Local MSAConserved patterns in biological sequencesExample: Transcription factor binding sitesSP ...gcttt AATTTTCACTATATACTATAA cgatt... ST ...cagat ATAAATGATATAGTGGTTATA gttaa...ST ...atctt TTTTATTATTAAATCGTATTA gcagc... EC ...aggct ATAAATGATATAGTGGTTATA gttag...EC ...acctt TTTTATTATTAAATCGTATTA gtcac...VC ...ttata ACTAATAATTATAAAATATGT gtgtc...YP ...gctga TGAAATGATATAATCGTTATA taaga... …agcgagcctgagcactcgaggcatctctgcacattcagcatgggatgggcctcctgtccctgtatgcgcctgatga…5polymerasetfintronspromotortranscription factor binding sitesSome known binding site motifsApplications of Local MSA Conserved patterns in biological sequencesExample: Protein domainsExample: Protein domains Fold independentlyCarry out specific functions Found in diverse contextsConserved in evolutionInsulin receptorFN3RLkinaseFurin likeRLProtein Tyrosine KinasesAdapted from Robinson et al., 2000Protein domain databasesConserved Domain Database (CDD)Conserved Domain Database (CDD)Representation: Position specific scoring matrices (PSSMs)Structurally corrected local MSAs CDART: Conserved Domain Architecture Retrieval Tool PFAM, SMARTRepresentation: Hidden Markov Models (HMM’s)C t d l l MSA’Curated local MSA’sMore: see Mount, Table 9.56hdiPax structurehttp://www.gene-regulation.com/info/pax.htmlpaired domainhomeodomainPax domain architecturePairwise sequence alignment (global and local)Multiple sequence pqalignmentlocalglobalDiscovery: identifying conserved patterns in multiple sequencesyyg p pqModeling: Constructing probabilistic models of local MSA’sRecognition: finding new instances of known patterns (using those models)... RLSKIISMFQAHIRGYLIRKAYKRGYQARCLLK ... ... RNKHAIAVIWAFWLVQSSFRGYQAGSKARRELK ... .. GWQKRVRGWIVIVRRNFKKKRNEKLSATAZZZZZYQ ... ... MKRSQVVKQEKAARKVQKFWRGHRVQHNQR ... ... QEEVSAIIIQRAYRRYLLKQKVKILRVQSS ... RLSKIISMIQAHIRGYLIRKAYKRGYQARCLLK... RLSKIISMIQAHIRGYLIRKAYKRGYQARCLLK ..... RNKHAIAVIWAFWLVQSSFRGYQAGSKARRELK ..... GWIQKRVRGWIVIRRNFKKKRNEKLSATAZZZZZYQ .... MKRSQVVKQEKAARKIQKFWRGHRVQHNQR ... ... QEEVSAIIIQRAYRRYLLKQKVKILRVQSS ... DiscoveryModeling.. GWQKRVRGWIVIVRRNQVNQAAVTIQRWYRCQVQRRRAGFKKKRNEKLSATAZZZZZRecognitionLocal Multiple Sequence AlignmentProbabilistic Framework• Discoveryy– Given multiple sequences, often unaligned, find a conserved pattern (e.g., the Pax domain)• Representation– Given a local MSA for the Pax domain, construct probabilistic model•Recognition (using model)•Recognition (using model)– Given a new
View Full Document