Bayesian association of haplotypes and non genetic factors to regulatory and phenotypic variation in human populations Jim Huang Probabilistic and Statistical Inference Group Edward S Rogers Department of Electrical and Computer Engineering University of Toronto Toronto ON Canada Anitha Kannan and John Winn Microsoft Research Cambridge Machine Learning and Perception Group Cambridge UK ISMB ECCB ISMB ECCB 2007 2007 24 07 2007 24 07 2007 Outline Main contributions Joint Bayesian modelling of genetic variation data and quantitative trait measurements Rich probabilistic model for genotype data State of the art results on predicting missing genotypes ISMB ECCB ISMB ECCB 2007 2007 24 07 2007 24 07 2007 Outline Genotype Unordered pair of SNPs along both chromosomes Haplotype Ordered set of SNPs along a chromosome ISMB ECCB 2007 Presence of recombination hotspots partitions haplotypes into blocks Daly 2001 24 07 2007 Part I Learning haplotype block structure Our model for genotype data should Account for phase parent child information Account for uncertainty in ancestral haplotypes Account for uncertainty in block structure Account for population specific haplotype block statistics Allow for prior knowledge of haplotype block structure ISMB ECCB 2007 24 07 2007 Previous models for genotype data Previous methods learn a low dimensional representation of the genotype data HAPLOBLOCK Greenspan G and Geiger D RECOMB 2003 fastPHASE Scheet P and Stephens M Am J Hum Genet 2006 Hard partitioning of data into set of haplotype blocks using lowdimensional ancestral haplotypes Learn ancestral haplotypes from high dimensional genotype data while accounting for uncertainty in haplotype blocks Jojic N Jojic V and Heckerman D UAI 2004 ISMB ECCB ISMB ECCB 2007 2007 24 07 2007 24 07 2007 Probabilistic generative model for genotype data Low dimensional latent representation Unsupervise d learning via maximum likelihood High dimensional data ISMB ECCB 2007 24 07 2007 A probabilistic model for genotype data ISMB ECCB 2007 24 07 2007 Learning the model for genotype data Maximum likelihood Lower bound on log likelihood Inference ISMB ECCB 2007 Learning Paramete r estimatio 24 07 2007 Variational inference and parameter estimation Exact inference is intractable Approximate the posterior distribution Baum Welch like algorithm Run forward backward algorithm separately on each chain of states Estimate transition probabilities and ancestral haplotypes given distributions over states ISMB ECCB 2007 24 07 2007 Predicting missing genotype data Have we learned a good density model for genotype data Gains from Accounting for uncertainty in haplotype block structure Accounting for uncertainty in ancestral haplotypes Accounting for parental relationships Assess model using cross validation test prediction error ISMB ECCB 2007 24 07 2007 Predicting missing genotype data Crohn s 5q31 data set Daly et al 2001 Crohn s disease data from Chromosome 5q31 containing genotypes for 129 children 258 parents across 103 loci phases given for children For each test set make fraction of data missing Retain model parameters from model learned from training data then draw 1000 samples over missing data Compute fill in error rate over 1000 samples for all missing data ISMB ECCB 2007 24 07 2007 Prediction error for Crohn s 5q31 data ISMB ECCB 2007 24 07 2007 Comparative performance for Crohn s 5q31 data ISMB ECCB 2007 24 07 2007 Reconstructing phase Run EM using 10 random initializations on the full data set Estimate phase from posterior Compute phase error over all loci where phase is known unambiguous and where alleles are completely observed Compute average and standard deviation of phase error over the 10 initializations ISMB ECCB 2007 24 07 2007 Reconstructing phase ISMB ECCB 2007 Daly 5q31 data children w phase phase frozen during EM Daly 5q31 data children w out phase phase learned during EM Daly 5q31 data children w phase parents phase frozen during EM Daly 5q31 data children w out phase parents phase learned during EM Mean phase error rate 0 59 8 21 0 39 9 51 Standard deviation of phase error rate 1 00 1 09 0 07 1 78 Minimum free energy nats 1 50 x 104 2 23 x 104 1 45 x 104 1 36 x 104 24 07 2007 How many ancestors ISMB ECCB 2007 24 07 2007 Establishing haplotype block boundaries Define the recombination prior on transition probabilities Different correspond to different blockiness of data For each locus k can compute the probability of transition pk Can establish a threshold t and establish block boundaries Once blocks are defined can assign block labels ISMB ECCB 2007 lb m n 24 07 2007 Establishing haplotype block boundaries Smaller number of larger blocks Larger number of smaller blocks ISMB ECCB 2007 24 07 2007 Haplotype block structure in the ENm006 region 573 SNP markers for 270 individuals from 3 subpopulations 90 Yoruba individuals 30 parent parent offspring trios from Ibadan Nigeria YRI 90 individuals 30 trios of European descent from Utah CEU 45 Han Chinese individuals from Beijing CHB JPT 45 Japanese individuals from Tokyo JPT ISMB ECCB 2007 24 07 2007 Pattern usage in Chromosome 5q31 ISMB ECCB 2007 24 07 2007 Part II Linking haplotype block structure and gene expression data ISMB ECCB 2007 24 07 2007 Label 4 Label 3 Label 2 Label 1 A model for linking haplotype structure to quantitative trait measurements Relevance Latent block variable profile Observed quantitative trait profile Individual 1 Individual 2 Haplotype block 1 Individual 3 x 1 0 x Individual 4 Individual 5 Individual 1 Individual 2 Haplotype block 2 Individual 3 x x 0 0 Individual 4 Individual 5 ISMB ECCB 2007 22 24 07 2007 A Bayesian model for linking haplotype structure to quantitative measurements blocks b 1 B wbg Block label 0 Relevance variable Tb j Latent block profile Sbj zgj Observed trait bg g 0 0 0 0 Noise precision quantitative traits g 1 G individuals j 1 J ISMB ECCB 2007 24 07 2007 A Bayesian model for linking haplotype structure to quantitative measurements blocks b 1 B wbg Relevance variable Block label Slbbjj 0 Tb j Latent block profile zgj Observed gene expression bg g 0 0 0 0 Noise precision genes g 1 G individuals j 1 J ISMB ECCB 2007 24 07 2007 A Bayesian model for linking haplotype structure to quantitative measurements wbg 0 Relevance variable Latent block profile bg g 0 0 0 0 Noise precision ISMB ECCB 2007 24 07 2007 A Bayesian model for linking haplotype structure to quantitative measurements wbg Block label Relevance variable Tb j
View Full Document