DOC PREVIEW
CMU CS 10701 - Lecture

This preview shows page 1-2-23-24 out of 24 pages.

Save
View full document
Premium Document
Do you want full access? Go Premium and unlock all 24 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Machine Learning 1010 701 15701 15 781 Spring 2008 Machine Learning in Computational Biology Eric Xing Lecture 20 April 7 2008 Reading Eric Xing 1 The Central Dogma Eric Xing 2 1 Genome and Proteome A C G Eric Xing T 3 Gene Structure in DNA z The inference problem predicting locations of the genes on DNA Eric Xing 4 2 Proteins are coded by DNA z There are between 30 000 to 40 000 genes in the human genome DNA genetic code protein z The human gene inventory corresponds to 1 5 of the genome coding regions Eric Xing 5 Protein Structure Hierarchy z The inference problem predicting the structures from sequences APAFSVSPASGACGPECA Eric Xing 6 3 Genetic Polymorphisms Eric Xing 7 Genetic Demography z Are there genetic prototypes among them z What are they z How many how many ancestors do we have Eric Xing 8 4 Computation Biology and ML z Mixture and infinite mixture z Z clustering of genetic polymorphisms X z Hidden Markov Models z z gene finding Trees z Y2 Y3 X A1 X A2 X A3 YT X AT sequence evolution AG AG AC z Y1 Conditional Random Fields z protein structure prediction Eric Xing 9 Computation Biology and ML z Mixture and infinite mixture z z HMMs z z Z X gene finding Trees z z clustering of genetic polymorphisms sequence evolution CRMs z Eric Xing protein structure prediction 10 5 Biological Terms z Genetic polymorphism a difference in DNA sequence among individuals groups or populations z Single Nucleotide Polymorphism SNP DNA sequence variation occurring when a single nucleotide A T C or G differs between members of the species Each variant is called an allele Almost always bi allelic Account for most of the genetic diversity among different normal individuals e g drug response disease susceptibility Eric Xing 11 From SNPs to Haplotypes z Alleles of adjacent SNPs on a chromosome form haplotypes z Useful in the study of disease association or genetic evolution Eric Xing 12 6 Phase ambiguity of SNPs haplotypes T Cp C Cm G T A ATGC A sequencing A heterozygous diploid individual TC TG AA T The Genotype pairs of alleles with association of alleles to chromosomes unknown G C T A A T C This is a mixture modeling problem T G A A Haplotype h h1 h2 possible associations of alleles to chromosome Eric Xing 13 Haplotype Inference Why is it approachable z Many of the haplotypes appear many times z Data for many individuals allows inference T G T C G A C T A T A G T A T A C C C T A T C G T T A C G T A A C G C T A C T T A G T C T T C T A G A C C A T A G T C T Solution seems better since it uses fewer haplotypes Eric Xing 14 7 Finite mixture model z The probability of a genotype g p h h p g h h p g h1 h2 H Population haplotype pool z z Hn1 1 2 1 Hn2 Gn 2 Genotyping model Haplotype model Standard settings z p h1 h2 p h1 p h2 Hardy Weinberg equilibrium z H K fixed sized population haplotype pool Problem K H Eric Xing 15 Ancestral Inference k Ak Hn1 Hn2 Gn N Essentially a clustering problem but z Better recovery of the ancestors leads to better haplotyping results because of more accurate grouping of common haplotypes z True haplotypes are obtainable with high cost but they can validate model more subjectively as opposed to examining saliency of clustering z Many other biological scientific utilities Eric Xing 16 8 Being Bayesian about z Population haplotype identities z Population haplotype frequencies z Number of population haplotypes z Associations between population haplotype and individual haplotype genotype Eric Xing 17 A Hierarchical Bayesian Infinite Allele model Bayesian Haplotype Inference via the Dirichlet Process Xing et al ICML2004 G0 G k Ak Hn1 Hn2 Gn Assume an individual haplotype h is stochastically derived from a population haplotype ak with nucleotide substitution frequency k h p h a k Not knowing the correspondences between individual and population haplotypes each individual haplotype is a mixture of population haplotypes The number and identity of the population haplotypes are unknown use a Dirichlet Process to construct a prior distribution G on H RJ Inference Eric Xing Markov Chain Monte Carlo 18 9 Chinese Restaurant Process 1 P ci k c i 2 0 1 0 0 1 1 1 1 2 1 3 m1 i 1 1 2 2 3 m2 i 1 2 3 i 1 CRP defines an exchangeable distribution on partitions over an infinite sequence of integers Eric Xing 19 The DP Mixture of Ancestral Haplotypes The customers around a table form a cluster z z z associate a mixture component i e a population haplotype with a table sample a at each table from a base measure G0 to obtain the population haplotype and nucleotide substitution frequency for that component 1 A 3 z Eric Xing 2 4 A 5 8 A 6 9 A A A 7 With p h and p g h1 h2 the CRP yields a posterior distribution on the number of population haplotypes and on the haplotype configurations and the nucleotide substitution frequencies 20 10 Convergence of Ancestral Inference Eric Xing 21 Results on simulated data DP vs Finite Mixture via EM individual error z 0 45 0 4 0 35 0 3 0 25 Series1 DP 0 2 0 15 0 1 0 05 0 Series2 EM 1 2 3 4 5 data sets Eric Xing 22 11 Results The Gabriel data Eric Xing 23 Population structure z DATA 256 European individuals with 103 loci Population Structure Eric Xing 24 12 Computation Biology and ML z Mixture and infinite mixture z z HMMs z z gene finding Trees z z clustering of genetic polymorphisms sequence evolution Y1 Y2 Y3 X A1 X A2 X A3 YT X AT CRMs z protein structure prediction Eric Xing 25 cacatcgctgcgtttcggcagctaattgccttttagaaattattttcccatttcgagaaactcgtgtgggatgccggatgcggctttcaatcacttctggcccgggatcggattgggtcacattgtctgcgggctctattgtctcgatccgc ggcgcagttcgcgtgcttagcggtcagaaaggcagagattcggttcggattgatgcgctggcagcagggcacaaagatctaatgactggcaaatcgctacaaataaattaaagtccggcggctaattaatgagcggactgaagccactttgg attaaccaaaaaacagcagataaacaaaaacggcaaagaaaattgccacagagttgtcacgctttgttgcacaaacatttgtgcagaaaagtgaaaagcttttagccattattaagtttttcctcagctcgctggcagcacttgcgaatgta ctgatgttcctcataaatgaaaattaatgtttgctctacgctccaccgaactcgcttgtttgggggattggctggctaatcgcggctagatcccaggcggtataaccttttcgcttcatcagttgtgaaaccagatggctggtgttttggca cagcggactcccctcgaacgctctcgaaatcaagtggctttccagccggcccgctgggccgctcgcccactggaccggtattcccaggccaggccacactgtaccgcaccgcataatcctcgccagactcggcgctgataaggcccaatgtc actccgcaggcgtctatttatgccaaggaccgttcttcttcagctttcggctcgagtatttgttgtgccatgttggttacgatgccaatcgcggtacagttatgcaaatgagcagcgaataccgctcactgacaatgaacggcgtcttgtca tattcatgctgacattcatattcattcctttggttttttgtcttcgacggactgaaaagtgcggagagaaacccaaaaacagaagcgcgcaaagcgccgttaatatgcgaactcagcgaactcattgaagttatcacaacaccatatccata


View Full Document

CMU CS 10701 - Lecture

Documents in this Course
lecture

lecture

12 pages

lecture

lecture

17 pages

HMMs

HMMs

40 pages

lecture

lecture

15 pages

lecture

lecture

20 pages

Notes

Notes

10 pages

Notes

Notes

15 pages

Lecture

Lecture

22 pages

Lecture

Lecture

13 pages

Lecture

Lecture

24 pages

Lecture9

Lecture9

38 pages

lecture

lecture

26 pages

lecture

lecture

13 pages

Lecture

Lecture

5 pages

lecture

lecture

18 pages

lecture

lecture

22 pages

Boosting

Boosting

11 pages

lecture

lecture

16 pages

lecture

lecture

20 pages

Lecture

Lecture

20 pages

Lecture

Lecture

39 pages

Lecture

Lecture

14 pages

Lecture

Lecture

18 pages

Lecture

Lecture

13 pages

Exam

Exam

10 pages

Lecture

Lecture

27 pages

Lecture

Lecture

15 pages

Lecture

Lecture

16 pages

Lecture

Lecture

23 pages

Lecture6

Lecture6

28 pages

Notes

Notes

34 pages

lecture

lecture

15 pages

Midterm

Midterm

11 pages

lecture

lecture

11 pages

lecture

lecture

23 pages

Boosting

Boosting

35 pages

Lecture

Lecture

49 pages

Lecture

Lecture

22 pages

Lecture

Lecture

16 pages

Lecture

Lecture

18 pages

Lecture

Lecture

35 pages

lecture

lecture

22 pages

lecture

lecture

24 pages

Midterm

Midterm

17 pages

exam

exam

15 pages

Lecture12

Lecture12

32 pages

lecture

lecture

19 pages

Lecture

Lecture

32 pages

boosting

boosting

11 pages

pca-mdps

pca-mdps

56 pages

bns

bns

45 pages

mdps

mdps

42 pages

svms

svms

10 pages

Notes

Notes

12 pages

lecture

lecture

42 pages

lecture

lecture

29 pages

lecture

lecture

15 pages

Lecture

Lecture

12 pages

Lecture

Lecture

24 pages

Lecture

Lecture

22 pages

Midterm

Midterm

5 pages

mdps-rl

mdps-rl

26 pages

Load more
Download Lecture
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?