CSE300: Report on Haplotype Inference Via Hierarchical Genotype Parsing1 IntroductionIn this report we present the algorithmic results from [RU07] related to population genetics, we start with a briefbackground for population genetics and define the problems of interest and discuss the solutions in the preceedingsections.2 Population Genetics BackgroundGenetic variations among the a population sample (D) can often be modeled as recombinations and mutationsfrom a set of ancestor sequences (F ), often in the study of population genetics the population sample D, consistsof haplotypes for every individual in the population sample. Thus the set D = {h1, h2, h2. . . hn}, hi∈ Σm,Σ = {0, 1}. The set F also consists of haplotypes known as the founding sequences. The variations between thehaplotypes in D and F is due to Recombinations and Point Mutations we define these in the next section.2.1 RecombinationsA haplotype h ∈ D is a simple Recombination(without mutations) of F if we can represent h as fi0·fi1·fi2. . . fik,where · stands for concatenation and each fijis a substring from some f ∈ F . The cost of this recombination is ksince we need k substrings from the haplotypes in F , the recombination of this type in which the haplotype h canbe exactly represented as concatenation of k substrings from the haplotypes in F is also know as simple parse.2.2 Point MutationsOften its not always possible to find a simple parse for h ∈ D, because of the mutations happening in h, so if itsnot possible to find a simple parse for h we want to transform h to h′such that h′has a simple parse with respectto some haplotypes in F , so the cost of recombination with mutations is equal to cost of transforming h to h′(which is the hamming distance d(h, h′)) and cost of of simple parse of h′w.r.t F , let d(h, h′) = k′and cost ofsimple parse is k, then the total cost of recombination is k + c ∗ k′, where c is a parameter which give the weightof simple recombinations versus the recombinations with mutations. The value k + c ∗ k′is called parse cost ofh, pc(h) = k + c ∗ k′. The problem of finding a recombination for h (given F ) is called the haplotype parsingproblem.3 Problem and AlgorithmsThe following are the two important problems which is wide interest in the population genetics.• OPT(h,F) :Find a optimal parse(recombination with minimum pc(h)) for each h ∈ D with a given F .• FOPT(D) :Find a set F such that sum of optimal parse cost for each of h ∈ D is minimum, i.e find a Fsuch that Σni=1OP T (hi, F ) is minimum.13.1 OPT(h,F) Finding a recombination with minimum costGiven a predefined set F of founding haplotypes the problem OPT(h,f) asks to find a recombination with a mini-mum cost, that is we are interested in finding a recombination r such that value k + ck′is minimum. We can findthe minimum cost with the following dynamic programming formulation.S(x,y) = optimal parse cost to parse hitill x(i.e till [hi1, hi2, hi3, . . . , hix]) ending with a substring from founder y(i.e recombination = {fip· fir. . . · fiy}).S(0, a) = 0, (∀a ∈ F ){Intialization}S(i, a) = pc(hi, Fai) + min(S(i − 1, a) + (mina′S(i − 1, a′) + 1))pc(S, S′) =0 , if S = S′c , otherwiseOP T (h, F ) = mina′(S(m, a′))For each haplotype h ∈ D we need to spend O(m|F |2) (m is the length of the haplotype |h| = m)time to findOP T (h, F ) and to trace back the actual recombination we need O(m|F |) space but if we use hirschbergs idea forlinear space editscript computation we need only O(|F |) space. This is a polynomial time algorithm to find thecost of optimal recombination for every h ∈ D.3.2 FOPT(D) finding optimal F for a given D to minimize the total parse costIn the previous section we considered a polynomial time algorithm to theOPT(h,F) in which we tried to find aoptimal recombination for minimum parse cost for a given set of founders F , however the problem of selectingthe founders F is a difficult problem, often we are intrested in picking the founders F such that the sum of optimalparse cost for each of h ∈ D is minimum, i.e the problem FOPT(D) asks to find a set of founders F such thatΣni=1OP T (hi, F ) is minimum. The paper [RU07] proves this problem is NP-complete if D ⊂ {0, 1, −}m, thesymbol − is a dont care symbol and matches both 0 or 1 during the parse construction of h. However the sameproblem if D ⊂ {0, 1}2is not addressed by the paper. The decision version of the problem can be stated asfollows.”Given a D ⊂ {0, 1, −}mfind a set of founders F such that Σni=1OP T (hi, F ) ≤ T ”.The above problem can be reduced from Graph coloring problem with K colors, the Graph coloring problemasks if its possible to color the nodes of the graph which a connected with a edge by a different colors, and use atmost K colors.With out the loss of generality let G = (V, E) be the input graph to the K Graph coloring problem, and let|V | = n the proof tries to construct n haplotypes (D = {h1, h2, h3, . . . hn}) , let hijdenote the value of jthelement in haplotype hiand hij∈ {0, 1, − }, the values of hijare constructed from the the graph G as follows.hij=1 , if i == j0 , if (i, j) ∈ E− , otherwise2CLAIM: The graph G has a K coloring iff D has founder set F , |F | = K and Σni=1OP T (hi, F ) = 0(totalparse cost is 0).Proof(if case): Let the F be founder set picked whose total parse cost is 0, since the total parse cost is 0every hi∈ D has to be parsed with out any recombinations by exactly one f ∈ F , from the construction ofthe haplotypes each of hi∈ D has exactly one 1 in it (at position hii∀i), since |F | ≤ n(the number of colorsK ≤ |V |) by pegion hole principle we should have atleast 1 haplotype parsed by f ∈ F , let Vf⊂ V be the setof haplotype id’s parsed by founder f ∈ F (i.e if f ∈ F parses {h7, h10, h5} then Vf= {7, 10, 5}, now we try toprove that for any two(pair) haplotypes (hi, hj) from Vfwe prove that hji6= 0andhij6= 0. From the constructionhii= 1, hjj= 1 , if hji= 0 then the founder f which parses both hiand hjshould have both 1 and 0 at positioni but this is not possible the only way the parse could happen is if hii= 1 then hji= − then only both hiand hjcould be parsed simultaneously, similarly if hjj= 1 then hij= −. So we have just proved that for any i, j ∈ Vf,hij= − and hji= − this means that for any nodes in Vfthere is no edge between them so all of Vfcould becolored with same color.
View Full Document