Stanford CS 374 - Simple and Fast Inverse Alignment - D2444792

Home> Schools> Stanford University> Computer Science (CS) > CS 374> Simple and Fast Inverse Alignment

DOC PREVIEW

Stanford CS 374 - Simple and Fast Inverse Alignment

School name Stanford University

Course Cs 374- Algorithms in Biology

Pages 15

This preview shows page 1-2-3-4-5 out of 15 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 15 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 15 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 15 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 15 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 15 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 15 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

IntroductionInverse Alignment and Its VariationsReduction to Linear ProgrammingSolving the Linear ProgramApplication to Global AlignmentComputational ResultsConclusionSimple and Fast Inverse AlignmentJohn Kececioglu and Eagu KimDepartment of Computer Science,The University of Arizona, Tucson, AZ 85721, USA{kece, egkim}@cs.arizona.eduAbstract. For as long as biologists have been computing alignments ofsequences, the question of what values to use for scoring substitutionsand gaps has persisted. While some choices for substitution scores arenow common, largely due to convention, there is no standard for choosinggap penalties. An objective way to resolve this question is to learn theappropriate values by solving the Inverse String Alignment Problem:given examples of correct alignments, ﬁnd parameter values that makethe examples be optimal-scoring alignments of their strings.We present a new polynomial-time algorithm for Inverse String Align-ment that is simple to implement, fast in practice, and for the ﬁrst timecan learn hundreds of parameters simultaneously. The approach is alsoﬂexible: minor modiﬁcations allow us to solve inverse unique alignment(ﬁnd parameter values that make the examples be the unique optimalalignments of their strings), and inverse near-optimal alignment (ﬁnd pa-rameter values that make the example alignments be as close to optimalas possible). Computational results with an implementation for globalalignment show that, for the ﬁrst time, we can ﬁnd best-possible valuesfor all 212 parameters of the standard protein-sequence scoring-modelfrom hundreds of alignments in a few minutes of computation.Keywords: Sequence analysis, parametric sequence alignment, substi-tution score matrices, aﬃne gap penalties, supervised learning, linearprogramming, cutting plane algorithms.1 IntroductionPerhaps the most studied problem in computational biology is the alignment ofbiological sequences with substitutions, insertions, and deletions. The standardformulations of string alignment optimize a sum of scores for each type of op-eration, often giving a penalty for a run of insertions or deletions, called a gap,that is linear in the length of the gap. When performing sequence alignment inpractice, the question of what weights and penalties to use inevitably arises. Aninteresting attack on this question is parametric sequence alignment,whereforagiven pair of strings, the alignment problem is solved for eﬀectively all possiblechoices of the scoring parameters, thereby eliminating the need to specify anyweights and penalties. The problem with this approach is that it in eﬀect defersResearch supported by US National Science Foundation Grant DBI-0317498.A. Apostolico et al. (Eds.): RECOMB 2006, LNBI 3909, pp. 441–455, 2006.c Springer-Verlag Berlin Heidelberg 2006442 J. Kececioglu and E. Kimthe question, since eventually a user must choose one of the solutions, and onwhat basis should this be done? Essentially one needs an example of a correctalignment to discriminate among the welter of parametric solutions. When onedoes solve the parametric problem and knows a biologically correct alignmentof the sequences, this alignment is used to decide what region of the parameterspace makes the correct alignment have optimal score.This optimal parameter choice could be found much more directly, however, bysolving inverse parametric alignment. In this problem, the input is an alignmentof a pair of strings, and the output is a choice of parameters that makes the inputalignment be an optimal-scoring alignment of its strings. We present a simple andfast algorithm for inverse parametric alignment that for the ﬁrst time is capableof determining all substitution weights and gap penalties simultaneously. Suchan algorithm, applied to a collection of benchmark protein-sequence alignmentsthat are constructed by aligning their three-dimensional structures, could providethe ﬁrst rigorous way of determining substitution scores and gap penalties forcharacterized classes of proteins.Related Work. The ﬁrst algorithms for parametric alignment of two sequenceswere discovered in the early 1990’s by Waterman, Eggert and Lander [16] andGusﬁeld, Balasubramanian and Naor [7]. These algorithms handled two pa-rameters, usually the gap open and extension penalties with ﬁxed substitu-tion scores. Zimmer and Lengauer [17] addressed numerical stability. Gusﬁeldet al. [7] also bounded the number of regions in the decomposition of the pa-rameter space, and constructed the decomposition with one optimal alignmentcomputation per region. Fern´andez-Baca, Sepp¨al¨ainen and Slutzki [4] showedthese bounds are asymptotically tight, and initiated the study of parametricmultiple-sequence alignment. Gusﬁeld and Stelling [8] released a software im-plementation called XPARAL, and were the ﬁrst to consider inverse parametricalignment, for which they gave a heuristic that attempted to avoid computing adecomposition of the entire parameter space. Pachter and Sturmfels [13] exploredthe relation of algebraic statistics to parametric sequence alignment.Recently, Sun, Fern´andez-Baca and Yu [15] gave the ﬁrst direct algorithmfor inverse parametric alignment. While they consider three parameters, theirsolution eﬀectively ﬁxes one parameter value at zero. For two strings of length n,their algorithm runs in O(n2log n) time. Their approach is involved, and doesnot appear to have been implemented.In contrast to prior work, our algorithm for inverse parametric alignment issimple to implement, does not compute a decomposition of the entire parameterspace, solves both the near-optimal and unique-optimal inverse alignment prob-lems, handles a set of input alignments, and for the ﬁrst time can quickly solveproblems with hundreds of free parameters.Overview. In the next section we give a precise statement of the inverse para-metric alignment problem and two variations: inverse near-optimal and unique-optimal alignment. Section 3 reduces all three variations to linear programming.Section 4 explains how the resulting linear programs, even though they have anexponential number of inequalities, can be solved in polynomial time. Section 5Simple and Fast Inverse Alignment 443then discusses a version of this approach, called a cutting-plane algorithm, thatis highly eﬀective in practice. Finally Section 6 presents experimental resultswith an implementation for inverse optimal and near-optimal

View Full Document