Bayesian association of haplotypes and non-genetic factors to regulatory and phenotypic variation in human populationsOutlineSlide 3Part I: Learning haplotype block structurePrevious models for genotype dataProbabilistic generative model for genotype dataA probabilistic model for genotype dataLearning the model for genotype dataVariational inference and parameter estimationPredicting missing genotype dataSlide 11Prediction error for Crohn’s/5q31 dataComparative performance for Crohn’s/5q31 dataReconstructing phaseSlide 15How many ancestors?Establishing haplotype block boundariesSlide 18Haplotype block structure in the ENm006 regionPattern usage in Chromosome 5q31Part II: Linking haplotype block structure and gene expression dataA model for linking haplotype structure to quantitative trait measurementsA Bayesian model for linking haplotype structure to quantitative measurementsSlide 24Slide 25Slide 26Slide 27Variational Bayes for inferring relationships between haplotype blocks and quantitative measurementsVariational Bayes updatesLinking haplotype blocks to phenotypeLinking haplotype blocks to gene expressionAddressing population stratificationAssociations between haplotype blocks and gene expressionSummaryThe road ahead…Acknowledgements24/07/2007 ISMB/ECCB 200724/07/2007ISMB/ECCB 2007Bayesian association of haplotypes and non-genetic factors to regulatory and phenotypic variation in human populations Anitha Kannan and John WinnJim Huang*Probabilistic and Statistical Inference Group, Edward S. Rogers Department of Electrical and Computer Engineering University of Toronto Toronto, ON, CanadaMicrosoft Research Cambridge Machine Learning and Perception Group Cambridge, UK24/07/2007 ISMB/ECCB 200724/07/2007ISMB/ECCB 2007Outline•Main contributions:•Joint Bayesian modelling of genetic variation data and quantitative trait measurements•Rich probabilistic model for genotype data•State-of-the-art results on predicting missing genotypes24/07/2007 ISMB/ECCB 2007OutlineGenotype: Unordered pair of SNPs along both chromosomesHaplotype: Ordered set of SNPs along a chromosome Presence of recombination hotspots partitions haplotypes into blocks [Daly, 2001]24/07/2007 ISMB/ECCB 2007Part I: Learning haplotype block structure•Our model for genotype data should:–Account for phase & parent-child information–Account for uncertainty in ancestral haplotypes–Account for uncertainty in block structure–Account for population-specific haplotype block statistics–Allow for prior knowledge of haplotype block structure24/07/2007 ISMB/ECCB 200724/07/2007ISMB/ECCB 2007Previous models for genotype data•Previous methods learn a low-dimensional representation of the genotype data:•HAPLOBLOCK (Greenspan, G. and Geiger, D. RECOMB 2003)–Hard partitioning of data into set of haplotype blocks using low-dimensional “ancestral” haplotypes•fastPHASE (Scheet P. and Stephens, M. Am J Hum Genet 2006)–Learn ancestral haplotypes from high-dimensional genotype data while accounting for uncertainty in haplotype blocks•Jojic, N., Jojic, V. and Heckerman, D. UAI 2004.24/07/2007 ISMB/ECCB 2007Low-dimensional latent representationProbabilistic generative model for genotype dataHigh-dimensional dataUnsupervised learning via maximum likelihood24/07/2007 ISMB/ECCB 2007A probabilistic model for genotype data24/07/2007 ISMB/ECCB 2007•Maximum likelihood:•Lower bound on log likelihood:Learning the model for genotype dataInferenceLearning/ Parameter estimation24/07/2007 ISMB/ECCB 2007•Exact inference is intractable!•Approximate the posterior distribution:•Baum-Welch-like algorithm:–Run forward-backward algorithm separately on each chain of states–Estimate transition probabilities and ancestral haplotypes given distributions over states Variational inference and parameter estimationQ(fmjk; sjk; tjkgNk=1; cj) =Q(x)Q(x)24/07/2007 ISMB/ECCB 2007Predicting missing genotype data•Have we learned a good density model for genotype data?•Gains from–Accounting for uncertainty in haplotype block structure–Accounting for uncertainty in ancestral haplotypes–Accounting for parental relationships•Assess model using cross-validation/test prediction error24/07/2007 ISMB/ECCB 2007Predicting missing genotype data•Crohn’s/5q31 data set (Daly et al., 2001)–Crohn’s disease data from Chromosome 5q31 containing genotypes for 129 children + 258 parents across 103 loci (phases given for children)•For each test set, make ρ fraction of data missing•Retain model parameters from model learned from training data, then draw 1000 samples over missing data•Compute fill-in error rate over 1000 samples, for all missing data24/07/2007 ISMB/ECCB 2007Prediction error for Crohn’s/5q31 data24/07/2007 ISMB/ECCB 2007Comparative performance for Crohn’s/5q31 data24/07/2007 ISMB/ECCB 2007Reconstructing phase•Run EM using 10 random initializations on the full data set•Estimate phase from posterior•Compute phase error over all loci where phase is known, unambiguous and where alleles are completely observed•Compute average and standard deviation of phase error over the 10 initializations24/07/2007 ISMB/ECCB 2007Reconstructing phaseDaly 5q31 data (children w/ phase) (phase frozen during EM)Daly 5q31 data (children w/out phase) (phase learned during EM):Daly 5q31 data (children w/ phase + parents) (phase frozen during EM)Daly 5q31 data (children w/out phase + parents) (phase learned during EM)Mean phase error rate0.59% 8.21% 0.39% 9.51%Standard deviation of phase error rate1.00% 1.09% 0.07% 1.78%Minimum free energy (nats)1.50 x 1042.23 x 1041.45 x 1041.36 x 10424/07/2007 ISMB/ECCB 2007How many ancestors?24/07/2007 ISMB/ECCB 2007Establishing haplotype block boundaries•Define the recombination prior γ on transition probabilities–Different γ correspond to different “blockiness” of data•For each locus k, can compute the probability of transition pk –Can establish a threshold t and establish block boundaries•Once blocks are defined, can assign block labels lb = (m,n)24/07/2007 ISMB/ECCB 2007Smaller number of larger blocks…Larger number of smaller blocks…Establishing haplotype block boundaries24/07/2007 ISMB/ECCB 2007Haplotype block structure in the ENm006 region•573 SNP markers for 270 individuals from 3 sub-populations:–90 Yoruba individuals (30 parent-parent-offspring trios) from
View Full Document