Unformatted text preview:

BMI 731- Winter 2004 Haplotype reconstructionThe problemThe problem, cont.Origin of the problemTwo aspects of the problemA Clark’s algorithm, cont.Clark’s algorithm with SNPs in practiceThe EM algorithm solution to problem a) (SNPem)The EM algorithm, cont.Slide 10Slide 11The EM algorithm: performance with SNPsSlide 13A Gibbs sampler approach to haplotype reconstruction, Stephens et al (2001) (PHASE)The Stephens et al Gibbs sampler, some detailsThe Stephens et al Gibbs sampler, more detailsComparisonSlide 18The Stephens et al Gibbs sampler, yet more detailsAn alternative Gibbs sampler, Nui et al (2002)Alternative Gibbs sampler, cont.Alternative sampler, cont: Ligation.Slide 23Alternative sampler completedReferencesBMI 731- Winter 2004 Haplotype reconstruction Catalin BarbacioruDepartment of Biomedical InformaticsOhio State UniversityThe problemWe start with a collection of genotypes in the form of allelic determinations at tightly linked single nucleotide polymorphisms (SNPs) for each of a set of n individuals. For example, we might describe 3 SNPs as follows: Name SNP alleles (major, minor) SNP1 T, A SNP2 A, G SNP3 C, G An individual might have genotype AT at SNP1, AA at SNP2, and CG at SNP3, which we will denote by AT//AA//CG. Possible haplotype pairs for this person are AAC/TAG and AAG/TAC, and without further information, we can’t distinguish between these two pairs.The problem, cont.What can be done? With information on the individual’s parents, we can usually infer the haplotypes, the only problem being that the parents may not be fully informative. For example, if the maternal and paternal genotypes were TA//AA//CC and TT//AA//CG respectively, at SNPs 1, 2 and 3, and the individual is AT//AA//CG, then it would be clear that the haplotypes were AAC/TAG (why?). On the other hand, if the parents both had genotypes AT//AA//CG, then we wouldn’t be able to determine unique haplotypes for the individual. Even in the first case, we might have to make an assumption about the frequency of recombination: what is it? Our problem here is to determine haplotypes, or make good guesses at them without parental genotypes.Origin of the problemWhy do we want to determine haplotypes for individuals at tightly linked SNP loci? a)Haplotypes are more powerful discriminators between cases and controls in disease association studies. Why?b) With haplotypes we can conduct evolutionary studies.c) Use of haplotypes in disease association studies reduces the number of tests to be carried out, and hence the penalty for multiple testing. Is this the same point as a)?Two aspects of the problemWith a random sample of multilocus genotypes at a set of SNPs, we can attempt a) to estimate the frequencies of all possible haplotypes, and b) to infer the haplotypes of all individuals.The first step on this problem was taken by A Clark in 1990. He gave what we might call a parsimony solution to b) above.It goes like this. With a reasonable sample size, we might expect to have some individuals homozygous at every locus, e.g. TT//AA//CC, or heterozygous at just one locus, e.g. TT//AA//CG. With the individuals of former type, we have unambiguously identified one (TAC), and of the latter type two (TAC and TAG) haplotypes present in the population. The algorithm begins by finding all homozygotes and single SNP heterozygotes and tallying the resulting known haplotypes.Now proceed as follows. For each known haplotype, look at all remaining unresolved cases, and ask whether the known haplotype can be made from some combination of ambiguous sites from an unresolved case. For example, if we have identified TAC as a known haplotype from a TT//AA//CC homozygote, and we have an individual AT//AA//CG still unresolved, then we infer that s/he is TAC/AAG, and we have have “resolved” this person’s haplotype and added a putative haplotype to our list. Similarly, a TT//AA//CG individual gives us both TAC and TAG as known haplotypes, and both of these go into the initial list. This chain of inferences is continued until either all haplotypes have been recovered, or until no more new haplotypes can be found in this way. A Clark’s algorithm, cont.Clark’s algorithm with SNPs in practiceThis method should work in principle, but there are three problems that might arise in practice: a) there may be no homozygotes or single SNP heterozygotes in the sample, andso the chain might never get started; b) there may be many unresolved haplotypes left at the end; and c) haplotypes might be erroneously inferred if a crossover product of two actual haplotypes is identical to another true haplotype. The frequency of these problems will depend on averageheterozygosity of the SNPs, number of loci, their recombination rates and the sample size. Clark (1990) did some calculationsand simulations which led him to believe the algorithm wouldperform well, even with relatively small sample sizes. And it did.The EM algorithm solution to problem a)(SNPem)We now describe an EM algorithm to infer haplotype frequencies in a population on the basis of a random sample and the assumption of random mating for haplotypes. Escoffier and Slatkin (1995) call the phase unknown multilocus genotypes, e.g. TT//AA//CG, phenotypes, and keep the term genotype for the corresponding haplotype pair TAC/TAG. Others use the term diplotype for a pair of haplotypes, but neither of these has caught on.The observed data in a random sample of n individuals will be multilocus genotype frequencies, and the natural model is multinomial. The number c of haplotype pairs leading to a given phenotype will depend on the number s of heterozygous SNPs, and will be 2s-1. E.g if our genotype is TT//AA//CG, then we can recover the haplotypes unambiguously (c=1), but for our original case, AT//AA//CG, there were c = 2 possible haplotype pairs.Under the assumption of random mating, the probability of a given genotype is just the sum of 2s-1 squares or products of haplotype probabilities, e.g. P=pr(AT//AA//CG) = pr(AAC/TAG) + pr(AAG/TAC) = 2pr(AAC)pr(TAG) + 2pr(AAG)pr(TAC).The EM algorithm, cont.The probability of a sample of n individuals conditioned by the phenotype frequencies P1, …, Pm (i.e. the likelihood of the data given the parameters) is given by the multinomial probability, mnmnmmPPnnnPPsampleP


View Full Document

OSU BMI 731 - Haplotype reconstruction

Download Haplotype reconstruction
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Haplotype reconstruction and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Haplotype reconstruction 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?