DOC PREVIEW
UConn CSE 3000 - Lecture notes

This preview shows page 1 out of 3 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 3 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 3 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CSE300: Report on Haplotype Inference Via Hierarchical Genotype Parsing1 IntroductionIn this report we present the algorithmic results from [RU07] related to population genetics, we start with a briefbackground for population genetics and define the problems of interest and discuss the solutions in the preceedingsections.2 Population Genetics BackgroundGenetic variations among the a population sample (D) can often be modeled as recombinations and mutationsfrom a set of ancestor sequences (F ), often in the study of population genetics the population sample D, consistsof haplotypes for every individual in the population sample. Thus the set D = {h1, h2, h2. . . hn}, hi∈ Σm,Σ = {0, 1}. The set F also consists of haplotypes known as the founding sequences. The variations between thehaplotypes in D and F is due to Recombinations and Point Mutations we define these in the next section.2.1 RecombinationsA haplotype h ∈ D is a simple Recombination(without mutations) of F if we can represent h as fi0·fi1·fi2. . . fik,where · stands for concatenation and each fijis a substring from some f ∈ F . The cost of this recombination is ksince we need k substrings from the haplotypes in F , the recombination of this type in which the haplotype h canbe exactly represented as concatenation of k substrings from the haplotypes in F is also know as simple parse.2.2 Point MutationsOften its not always possible to find a simple parse for h ∈ D, because of the mutations happening in h, so if itsnot possible to find a simple parse for h we want to transform h to h′such that h′has a simple parse with respectto some haplotypes in F , so the cost of recombination with mutations is equal to cost of transforming h to h′(which is the hamming distance d(h, h′)) and cost of of simple parse of h′w.r.t F , let d(h, h′) = k′and cost ofsimple parse is k, then the total cost of recombination is k + c ∗ k′, where c is a parameter which give the weightof simple recombinations versus the recombinations with mutations. The value k + c ∗ k′is called parse cost ofh, pc(h) = k + c ∗ k′. The problem of finding a recombination for h (given F ) is called the haplotype parsingproblem.3 Problem and AlgorithmsThe following are the two important problems which is wide interest in the population genetics.• OPT(h,F) :Find a optimal parse(recombination with minimum pc(h)) for each h ∈ D with a given F .• FOPT(D) :Find a set F such that sum of optimal parse cost for each of h ∈ D is minimum, i.e find a Fsuch that Σni=1OP T (hi, F ) is minimum.13.1 OPT(h,F) Finding a recombination with minimum costGiven a predefined set F of founding haplotypes the problem OPT(h,f) asks to find a recombination with a mini-mum cost, that is we are interested in finding a recombination r such that value k + ck′is minimum. We can findthe minimum cost with the following dynamic programming formulation.S(x,y) = optimal parse cost to parse hitill x(i.e till [hi1, hi2, hi3, . . . , hix]) ending with a substring from founder y(i.e recombination = {fip· fir. . . · fiy}).S(0, a) = 0, (∀a ∈ F ){Intialization}S(i, a) = pc(hi, Fai) + min(S(i − 1, a) + (mina′S(i − 1, a′) + 1))pc(S, S′) =0 , if S = S′c , otherwiseOP T (h, F ) = mina′(S(m, a′))For each haplotype h ∈ D we need to spend O(m|F |2) (m is the length of the haplotype |h| = m)time to findOP T (h, F ) and to trace back the actual recombination we need O(m|F |) space but if we use hirschbergs idea forlinear space editscript computation we need only O(|F |) space. This is a polynomial time algorithm to find thecost of optimal recombination for every h ∈ D.3.2 FOPT(D) finding optimal F for a given D to minimize the total parse costIn the previous section we considered a polynomial time algorithm to theOPT(h,F) in which we tried to find aoptimal recombination for minimum parse cost for a given set of founders F , however the problem of selectingthe founders F is a difficult problem, often we are intrested in picking the founders F such that the sum of optimalparse cost for each of h ∈ D is minimum, i.e the problem FOPT(D) asks to find a set of founders F such thatΣni=1OP T (hi, F ) is minimum. The paper [RU07] proves this problem is NP-complete if D ⊂ {0, 1, −}m, thesymbol − is a dont care symbol and matches both 0 or 1 during the parse construction of h. However the sameproblem if D ⊂ {0, 1}2is not addressed by the paper. The decision version of the problem can be stated asfollows.”Given a D ⊂ {0, 1, −}mfind a set of founders F such that Σni=1OP T (hi, F ) ≤ T ”.The above problem can be reduced from Graph coloring problem with K colors, the Graph coloring problemasks if its possible to color the nodes of the graph which a connected with a edge by a different colors, and use atmost K colors.With out the loss of generality let G = (V, E) be the input graph to the K Graph coloring problem, and let|V | = n the proof tries to construct n haplotypes (D = {h1, h2, h3, . . . hn}) , let hijdenote the value of jthelement in haplotype hiand hij∈ {0, 1, − }, the values of hijare constructed from the the graph G as follows.hij=1 , if i == j0 , if (i, j) ∈ E− , otherwise2CLAIM: The graph G has a K coloring iff D has founder set F , |F | = K and Σni=1OP T (hi, F ) = 0(totalparse cost is 0).Proof(if case): Let the F be founder set picked whose total parse cost is 0, since the total parse cost is 0every hi∈ D has to be parsed with out any recombinations by exactly one f ∈ F , from the construction ofthe haplotypes each of hi∈ D has exactly one 1 in it (at position hii∀i), since |F | ≤ n(the number of colorsK ≤ |V |) by pegion hole principle we should have atleast 1 haplotype parsed by f ∈ F , let Vf⊂ V be the setof haplotype id’s parsed by founder f ∈ F (i.e if f ∈ F parses {h7, h10, h5} then Vf= {7, 10, 5}, now we try toprove that for any two(pair) haplotypes (hi, hj) from Vfwe prove that hji6= 0andhij6= 0. From the constructionhii= 1, hjj= 1 , if hji= 0 then the founder f which parses both hiand hjshould have both 1 and 0 at positioni but this is not possible the only way the parse could happen is if hii= 1 then hji= − then only both hiand hjcould be parsed simultaneously, similarly if hjj= 1 then hij= −. So we have just proved that for any i, j ∈ Vf,hij= − and hji= − this means that for any nodes in Vfthere is no edge between them so all of Vfcould becolored with same color.


View Full Document

UConn CSE 3000 - Lecture notes

Download Lecture notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?