Machine Learning 1010 701 15701 15 781 Spring 2008 Machine Learning in Computational Biology Eric Xing Lecture 20 April 7 2008 Reading Eric Xing 1 The Central Dogma Eric Xing 2 1 Genome and Proteome A C G Eric Xing T 3 Gene Structure in DNA z The inference problem predicting locations of the genes on DNA Eric Xing 4 2 Proteins are coded by DNA z There are between 30 000 to 40 000 genes in the human genome DNA genetic code protein z The human gene inventory corresponds to 1 5 of the genome coding regions Eric Xing 5 Protein Structure Hierarchy z The inference problem predicting the structures from sequences APAFSVSPASGACGPECA Eric Xing 6 3 Genetic Polymorphisms Eric Xing 7 Genetic Demography z Are there genetic prototypes among them z What are they z How many how many ancestors do we have Eric Xing 8 4 Computation Biology and ML z Mixture and infinite mixture z Z clustering of genetic polymorphisms X z Hidden Markov Models z z gene finding Trees z Y2 Y3 X A1 X A2 X A3 YT X AT sequence evolution AG AG AC z Y1 Conditional Random Fields z protein structure prediction Eric Xing 9 Computation Biology and ML z Mixture and infinite mixture z z HMMs z z Z X gene finding Trees z z clustering of genetic polymorphisms sequence evolution CRMs z Eric Xing protein structure prediction 10 5 Biological Terms z Genetic polymorphism a difference in DNA sequence among individuals groups or populations z Single Nucleotide Polymorphism SNP DNA sequence variation occurring when a single nucleotide A T C or G differs between members of the species Each variant is called an allele Almost always bi allelic Account for most of the genetic diversity among different normal individuals e g drug response disease susceptibility Eric Xing 11 From SNPs to Haplotypes z Alleles of adjacent SNPs on a chromosome form haplotypes z Useful in the study of disease association or genetic evolution Eric Xing 12 6 Phase ambiguity of SNPs haplotypes T Cp C Cm G T A ATGC A sequencing A heterozygous diploid individual TC TG AA T The Genotype pairs of alleles with association of alleles to chromosomes unknown G C T A A T C This is a mixture modeling problem T G A A Haplotype h h1 h2 possible associations of alleles to chromosome Eric Xing 13 Haplotype Inference Why is it approachable z Many of the haplotypes appear many times z Data for many individuals allows inference T G T C G A C T A T A G T A T A C C C T A T C G T T A C G T A A C G C T A C T T A G T C T T C T A G A C C A T A G T C T Solution seems better since it uses fewer haplotypes Eric Xing 14 7 Finite mixture model z The probability of a genotype g p h h p g h h p g h1 h2 H Population haplotype pool z z Hn1 1 2 1 Hn2 Gn 2 Genotyping model Haplotype model Standard settings z p h1 h2 p h1 p h2 Hardy Weinberg equilibrium z H K fixed sized population haplotype pool Problem K H Eric Xing 15 Ancestral Inference k Ak Hn1 Hn2 Gn N Essentially a clustering problem but z Better recovery of the ancestors leads to better haplotyping results because of more accurate grouping of common haplotypes z True haplotypes are obtainable with high cost but they can validate model more subjectively as opposed to examining saliency of clustering z Many other biological scientific utilities Eric Xing 16 8 Being Bayesian about z Population haplotype identities z Population haplotype frequencies z Number of population haplotypes z Associations between population haplotype and individual haplotype genotype Eric Xing 17 A Hierarchical Bayesian Infinite Allele model Bayesian Haplotype Inference via the Dirichlet Process Xing et al ICML2004 G0 G k Ak Hn1 Hn2 Gn Assume an individual haplotype h is stochastically derived from a population haplotype ak with nucleotide substitution frequency k h p h a k Not knowing the correspondences between individual and population haplotypes each individual haplotype is a mixture of population haplotypes The number and identity of the population haplotypes are unknown use a Dirichlet Process to construct a prior distribution G on H RJ Inference Eric Xing Markov Chain Monte Carlo 18 9 Chinese Restaurant Process 1 P ci k c i 2 0 1 0 0 1 1 1 1 2 1 3 m1 i 1 1 2 2 3 m2 i 1 2 3 i 1 CRP defines an exchangeable distribution on partitions over an infinite sequence of integers Eric Xing 19 The DP Mixture of Ancestral Haplotypes The customers around a table form a cluster z z z associate a mixture component i e a population haplotype with a table sample a at each table from a base measure G0 to obtain the population haplotype and nucleotide substitution frequency for that component 1 A 3 z Eric Xing 2 4 A 5 8 A 6 9 A A A 7 With p h and p g h1 h2 the CRP yields a posterior distribution on the number of population haplotypes and on the haplotype configurations and the nucleotide substitution frequencies 20 10 Convergence of Ancestral Inference Eric Xing 21 Results on simulated data DP vs Finite Mixture via EM individual error z 0 45 0 4 0 35 0 3 0 25 Series1 DP 0 2 0 15 0 1 0 05 0 Series2 EM 1 2 3 4 5 data sets Eric Xing 22 11 Results The Gabriel data Eric Xing 23 Population structure z DATA 256 European individuals with 103 loci Population Structure Eric Xing 24 12 Computation Biology and ML z Mixture and infinite mixture z z HMMs z z gene finding Trees z z clustering of genetic polymorphisms sequence evolution Y1 Y2 Y3 X A1 X A2 X A3 YT X AT CRMs z protein structure prediction Eric Xing 25 cacatcgctgcgtttcggcagctaattgccttttagaaattattttcccatttcgagaaactcgtgtgggatgccggatgcggctttcaatcacttctggcccgggatcggattgggtcacattgtctgcgggctctattgtctcgatccgc ggcgcagttcgcgtgcttagcggtcagaaaggcagagattcggttcggattgatgcgctggcagcagggcacaaagatctaatgactggcaaatcgctacaaataaattaaagtccggcggctaattaatgagcggactgaagccactttgg attaaccaaaaaacagcagataaacaaaaacggcaaagaaaattgccacagagttgtcacgctttgttgcacaaacatttgtgcagaaaagtgaaaagcttttagccattattaagtttttcctcagctcgctggcagcacttgcgaatgta ctgatgttcctcataaatgaaaattaatgtttgctctacgctccaccgaactcgcttgtttgggggattggctggctaatcgcggctagatcccaggcggtataaccttttcgcttcatcagttgtgaaaccagatggctggtgttttggca cagcggactcccctcgaacgctctcgaaatcaagtggctttccagccggcccgctgggccgctcgcccactggaccggtattcccaggccaggccacactgtaccgcaccgcataatcctcgccagactcggcgctgataaggcccaatgtc actccgcaggcgtctatttatgccaaggaccgttcttcttcagctttcggctcgagtatttgttgtgccatgttggttacgatgccaatcgcggtacagttatgcaaatgagcagcgaataccgctcactgacaatgaacggcgtcttgtca tattcatgctgacattcatattcattcctttggttttttgtcttcgacggactgaaaagtgcggagagaaacccaaaaacagaagcgcgcaaagcgccgttaatatgcgaactcagcgaactcattgaagttatcacaacaccatatccata
View Full Document