Berkeley STATISTICS 246 - Three examples of the EM algorithm - D1700465

Home> Schools> University of California, Berkeley> (STATISTICS) > STATISTICS 246> Three examples of the EM algorithm

DOC PREVIEW

Berkeley STATISTICS 246 - Three examples of the EM algorithm

School name University of California, Berkeley

Course Statistics 246- Statistical Genetics

Pages 19

This preview shows page 1-2-3-4-5-6 out of 19 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Three examples of the EM algorithm Week 12 Lecture 1 Statistics 246 Spring 2002 1 The estimation of linkage from the offspring of selfed heterozygotes R A Fisher and Bhai Balmukand Journal of Genetics 20 79 92 1928 See also Collected Papers of R A Fisher Volume II Paper 71 pp 285 298 2 The problem In modern terminology we have two linked bi allelic loci A and B say with alleles A and a and B and b respectively where A is dominant over a and B is dominant over b A double heterozygote AaBb will produce gametes of four types AB Ab aB and ab Since the loci are linked the types AB and ab will appear with a frequency different from that of Ab and aB say 1 r and r respectively in males and 1 r and r respectively in females Here we suppose that the parental origin of these heterozygotes is from the mating AABB aabb so that r and r are the male and female recombination rates between the two loci The problem is to estimate r and r if possible from the offspring 3 of selfed double heterozygotes Offspring genotypic and phenotypic frequencies Since gametes AB Ab aB and ab are produced in proportions 1 r 2 r 2 r 2 and 1 r 2 respectively by the male parent and 1 r 2 r 2 r 2 and 1 r 2 respectively by the female parent zygotes with genotypes AABB AaBB etc are produced with frequencies 1 r 1 r 4 1 r r 4 etc Exercise Complete the Punnett square of offspring genotypes and their associated frequencies The problem here is this although there are 16 distinct offspring genotypes taking parental origin into account the dominance relations imply that we only observe 4 distinct phenotypes which we denote by A B A b a B and a b Here A res B denotes the dominant while a resp b denotes the recessive phenotype determined by the alleles at A 4 resp B Offspring genotypic and phenotypic probabilities cont Thus individuals with genotypes AABB AaBB AABb or AaBb which account for 9 16 gametic combinations check all exhibit the phenotype A B i e the dominant alternative in both characters while those with genotypes AAbb or Aabb 3 16 exhibit the phenotype A b those with genotypes aaBB and aaBb 3 16 exhibit the phenotype a B and finally the double recessives aabb 1 16 exhibit the phenotype a b It is a slightly surprising fact that the probabilities of the four phenotypic classes are definable in terms of the parameter 1 r 1 r as follows a b has probability 4 easy to see a B and A b both have probabilities 1 4 while A B has probability 1 minus the sum of the preceding which is 2 4 5 Exercise Calculate these phenotypic probabilities Estimation of Now suppose we have a random sample of n offspring from the selfing of our double heterozygote Thus the 4 phenotypic classes will be represented roughly in proportion to their theoretical probabilities their joint distribution being multinomial Mult n 2 4 1 4 1 4 4 Note that here neither r nor r will be separately estimable from these data but only the product 1 r 1 r Note that since we know that r 1 2 and r 1 2 it follows that 1 4 How do we estimate Fisher and Balmukand discuss a variety of methods that were in the literature at the time they wrote and compare them with maximum likelihood which is the method of choice in problems like this We describe a variant on their 6 approach to illustrate the EM algorithm The incomplete data formulation Let us denote cf p 26 of Week 11b the counts of the 4 phenotypic classes by y1 y2 y3 and y4 these having probabilities 2 4 1 4 1 4 and 4 respectively Now the probability of the observing genotype AABB is 4 just as it is with aabb and although this genotype is phenotypically indistinguishable from the 8 others with phenotype A B it is convenient to imagine that we can distinguish them So let us denote their count by x 1 and let x2 denote count of the remainder of that class so that x 1 x2 y1 Note that x2 has marginal probability 1 2 In the jargon of the EM algorithm x1 and x2 are missing data as we only observe their sum y1 Next as in p 26 of Week 11b we let y 2 x3 y3 x4 and y4 x5 We now illustrate the approach of the EM algorithm referring to material in Week 9b and Week 11b for generalities 7 The EM algorithm for this problem The complete data log likelihood at is Ex Check x2 x5 log x3 x4 log 1 The expected value of the complete data log likelihood given observed data taken at E step think of as initial is E x2 y1 y4 log y2 y3 log 1 Now E x2 y1 is just ky1 where k 2 Ex Check The maximum over of the expected value of the complete data log likelihood given observed data taken at M step occurs at ky1 y4 ky1 y2 y3 y4 Ex Check Here we think of as next It should now be clear how the E step calculation of k and the 8 M step calculation of can be iterated Comments on this example This completes our discussion of this example It appeared in the famous EM paper Dempster Laird and Rubin JRSSB 1977 without any explanation of its genetic origins Of course it is an illustration of the EM only for the actual likelihood equation generated by the observed data only is a quadratic and so easy to solve see Fisher Balmukand Thus it is not necessary to use the EM in this case some would say in any case but that is for another time We have omitted any of the fascinating detail provided in Fisher and Balmukand and similarly in Dempster et al Read these papers both are classics with much of interest to you Rather than talk about details concerning the EM most importantly starting and stopping it the issue of global max and SEs for parameter estimates I turn to another important EM example mixtures 9 Fitting a mixture model by EM to discover motifs in biopolymers T L Bailey and C Elkan UCSD Technical Report CS94 351 ISMB94 Here we outline some more EM theory this being relevant to motif discovery We follow the above report as we will be discussing the program MEME written by these authors in a later lecture This part is called MM A finite mixture model supposes that data X X1 Xn arises from two or more groups with known distributions but different unknown parameters 1 g where g is the number of groups and mixing parameters 1 g where the s are non negative and sum to 1 It is convenient to introduce indicator vectors Z Z1 Zn where Zi Zi1 Zig and Zij 1 if Xi is from group …

View Full Document