HMM in crosses and small pedigrees cont Lecture 9 Statistics 246 February 19 2004 1 Another problem reconstructing haplotypes The problem here is to reconstruct the childrens haplotypes as in the 2 figure from marker data on both the children and the parents The Lander Green HMM Recap The states of the Markov chain are the inheritance vectors At any locus on a chromosome the entry in the inheritance vector for a non founder are 0 if the parental variant passed on at that locus was grandmaternal and 1 otherwise Consider a two parent two child nuclear family and suppose that the mother and father are 1 2 and 3 4 respectively while the first girl child is 1 3 and the second boy is 2 4 Then the inheritance vector is of length 4 v vgm vgp vbm vbp where gm represents the girl s maternal meiosis gp her paternal meiosis and so on What are the assigments for v at the marker We don t know which of the mother s alleles 1 and 2 came from her mother and which came from her father but we can arbitrarily declare that it was the one she passed on to her daughter and similarly for the alleles 3 and 4 of the father With this assignment we find that v 0 0 1 1 because in each case the boy received alleles from his parents different from his sister s In fact the specification of the paternal and maternal chromosomes of a founder is completely arbitrary and we ll mention later how this can be turned into a symmetry which speeds up the calculations 3 Transition probabilities Now suppose that the same family has genotypes 1 2 3 4 1 3 and 2 4 at a locus near the first one If the recombination fraction between the two loci is r and r is small then we might expect the inheritance vector v at the second locus to coincide with v 0 0 1 1 But what if the genotypes were 1 2 3 4 1 3 and 2 3 respectively This suggests that v 0 0 1 0 with the 1 0 in the boy s paternally inherited chromosome denoting a recombination How do we weigh up these competing possibilities As with the mouse chromosomes we need a transition matrix P r connecting adjacent inheritance vectors The form of P is as in the mouse case namely a tensor power of the 2 2 matrices having 1 r on the diagonal and r on the offdiagonal elements here P R r 4 In general it is an 2nth tensor power where n is the number of non founders Thus we have our states and our transition probability matrix and hence our product Markov chain To complete the specification of our HMM we need observations and the 4 associated emission probabilities and an initial distribution Alternative representation The purpose of inheritance vectors is to describe the possible patterns of gene flow through a pedigree Once ordered pairs of alleles are assigned to founders the 0s and 1s in the inheritance vectors specify the alleles that are passed from parent to offspring down the pedigree As mentioned in the last lecture an alternative representation of this gene flow is via what is known as a descent graph see Lange s book ch 9 A recent paper gave an even more economical representation of what is needed via what the authors called a sparse gene flow tree I leave those interested to consult Abecasis et al Nature Genetics 30 2002 97 101 There in the program Allegro the efficient matrix multiplication that we will describe shortly is replaced by a sparse matrix vector multiplication algorithm more general that the one we will give 5 Observations phenotypes We now turn to our observations The data we have on our small pedigree will generally consist of genotypes at many marker loci and perhaps additional disease or other phenotype data see the pedigree on p 2 where black filling of a square or circle indicated that the person is affected by some specified disease In the case of interest to us here reconstructing haplotypes we ll assume that there are just marker data Suppose the unordered pairs of alleles i e genotypes at marker locus t come from a set t Then for f founders and n non founders our observations come from t f t n Here I could have simply written the f n th power but it is convenient to keep founders and non founders notationally distinct As with the mouse crosses we could add in ambiguity and missing data observations but for simplicity we won t do so here Since we are mainly interested in marker data let s denote a typical vector observation at locus t by mt 6 Emission probabilities Referring to our general discussion of HMM in the previous lecture we now need to specify the equivalent of the emission probabilities q i j k t at each locus t These probabilities are generally functions of the current and previous state but here they just depend on the current state vt and take the form q vt mt t We want this to be q vt mt t pr observations at t mt inheritance vector at t vt but right now vt just describes gene flow we haven t got started To complete our description let us write at at 1 at 2 at 2f 1 at 2f for the assignment of an ordered pair of alleles to each founder Suppose that the frequency of allele at h is pt h Under the population equilibrium assumptions we have previously mentioned pr at h pt h Finally we sum over all at in practice over all at compatible with the observed phenotypes This gets us started towards defining q vt mt t 7 Penetrances Having got started note that alleles at for the founders and an inheritance vector vt gives us a set of ordered genotypes gt at vt at locus t by following the flow We are almost there The observations on each individual in the pedigree can now be assigned probabilities given their ordered genotypes This last step involves the terms we have previously called penetrances probabilities of observed phenotypes given genotypes and a non trivial assumption that the probabilities of the different individuals phenotypes are conditionally independent given their ordered genotypes and depend on the pedigree s ordered genotypes only on their own ordered genotypes i e pr mt gt i pr mt i gt i where the product is over individuals i in the pedigree and the terms in the product are penetrances Putting all this together our HMM emission probabilities have the form q vt mt t pr at pr mt at vt pt h i pr mt i gt i 8 Pedigree calculations The HMM is fully specified as soon as we give an initial distribution for the states and this is simply the uniform distribution each inheritance vector is assigned initial probability 2 2n We now turn to carrying out the calculations efficiently so that we can deal with as large …
View Full Document