UMD CMSC 351 - Lecture 25: Longest Common Subsequence - D1830458

Home> Schools> University of Maryland, College Park> Computer Science (CMSC) > CMSC 351> Lecture 25: Longest Common Subsequence

DOC PREVIEW

UMD CMSC 351 - Lecture 25: Longest Common Subsequence

School name University of Maryland, College Park

Course Cmsc 351- Algorithms

Pages 3

This preview shows page 1 out of 3 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 3 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 3 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Lecture Notes CMSC 251Lecture 25: Longest Common Subsequence(April 28, 1998)Read: Section 16.3 in CLR.Strings: One important area of algorithm design is the study of algorithms for character strings. There area number of important problems here. Among the most important has to do with efficiently searchingfor a substring or generally a pattern in large piece of text. (This is what text editors and functionslike ”grep” do when you perform a search.) In many instances you do not want to find a piece of textexactly, but rather something that is ”similar”. This arises for example in genetics research. Geneticcodes are stored as long DNA molecules. The DNA strands can be broken down into a long sequenceseach of which is one of four basic types: C, G, T, A.But exact matches rarely occur in biology because of small changes in DNA replication. Exact sub-string search will only find exact matches. For this reason, it is of interest to compute similaritiesbetween strings that do not match exactly. The method of string similarities should be insensitive torandom insertions and deletions of characters from some originating string. There are a number ofmeasures of similarity in strings. The first is the edit distance, that is, the minimum number of singlecharacter insertions, deletions, or transpositions necessary to convert one string into another. The other,which we will study today, is that of determining the length of the longest common subsequence.Longest Common Subsequence: Let us think of character strings as sequences of characters. Given twosequences X = hx1,x2,...,xmiand Z = hz1,z2,...,zki, we say that Z is a subsequence of X ifthere is a strictly increasing sequence of k indices hi1,i2,...,iki(1 ≤i1<i2< ... < ik≤n) suchthat Z = hXi1,Xi2,...,Xiki. For example, let X = hABRACADABRAi and let Z = hAADAAi,then Z is a subsequence of X.Given two strings X and Y , the longest common subsequence of X and Y is a longest sequence Zwhich is both a subsequence of X and Y .For example, let X be as before and let Y = hYABBADABBADOOi. Then the longest commonsubsequence is Z = hABADABAi.The Longest Common Subsequence Problem (LCS) is the following. Given two sequences X =hx1,...,xmiand Y = hy1,...,ynidetermine a longest common subsequence. Note that it is notalways unique. For example the LCS of hABCi and hBACi is either hACi or hBCi.Dynamic Programming Solution: The simple brute-force solution to the problem would be to try all pos-sible subsequences from one string, and search for matches in the other string, but this is hopelesslyinefficient, since there are an exponential number of possible subsequences.Instead, we will derive a dynamic programming solution. In typical DP fashion, we need to break theproblem into smaller pieces. There are many ways to do this for strings, but it turns out for this problemthat considering all pairs of prefixes will suffice for us. A prefix of a sequence is just an initial string ofvalues, Xi= hx1,x2,...,xii.X0is the empty sequence.The idea will be to compute the longest common subsequence for every possible pair of prefixes. Letc[i, j] denote the length of the longest common subsequence of Xiand Yj. Eventually we are interestedin c[m, n] since this will be the LCS of the two entire strings. The idea is to compute c[i, j] assumingthat we already know the values of c[i0,j0]for i0≤ i and j0≤ j (but not both equal). We begin withsome observations.Basis: c[i, 0] = c[j, 0] = 0. If either sequence is empty, then the longest common subsequence isempty.76Lecture Notes CMSC 251Last characters match: Suppose xi= yj. Example: Let Xi= hABCAi and let Yj= hDACAi.Since both end in A, we claim that the LCS must also end in A. (We will explain why later.)Since the A is part of the LCS we may find the overall LCS by removing A from both sequencesand taking the LCS of Xi−1= hABCi and Yj−1= hDACi which is hACi and then adding Ato the end, giving hACAi as the answer. (At first you might object: But how did you know thatthese two A’s matched with each other. The answer is that we don’t, but it will not make the LCSany smaller if we do.)Thus, if xi= yjthen c[i, j]=c[i−1,j−1] + 1.Last characters do not match: Suppose that xi6= yj. In this case xiand yjcannot both be in theLCS (since they would have to be the last character of the LCS). Thus either xiis not part of theLCS, or yjis not part of the LCS (and possibly both are not part of the LCS).In the first case the LCS of Xiand Yjis the LCS of Xi−1and Yj, which is c[i − 1,j]. In thesecond case the LCS is the LCS of Xiand Yj−1which is c[i, j − 1]. We do not know which isthe case, so we try both and take the one that gives us the longer LCS.Thus, if xi6= yjthen c[i, j] = max(c[i − 1,j],c[i, j − 1]).We left undone the business of showing that if both strings end in the same character, then the LCSmust also end in this same character. To see this, suppose by contradiction that both characters end inA, and further suppose that the LCS ended in a different character B. Because A is the last characterof both strings, it follows that this particular instance of the character A cannot be used anywhere elsein the LCS. Thus, we can add it to the end of the LCS, creating a longer common subsequence. Butthis would contradict the definition of the LCS as being longest.Combining these observations we have the following rule:c[i, j]=0 if i =0or j =0,c[i−1,j−1] + 1 if i, j > 0 and xi= yj,max(c[i, j − 1],c[i−1,j]) if i, j > 0 and xi6= yj.Implementing the Rule: The task now is to simply implement this rule. As with other DP solutions, weconcentrate on computing the maximum length. We will store some helpful pointers in a parallel array,b[0..m, 0..n].Longest Common SubsequenceLCS(char x[1..m], char y[1..n]) {int c[0..m, 0..n]for i = 0 to m do {c[i,0] = 0 b[i,0] = SKIPX // initialize column 0}for j = 0 to n do {c[0,j] = 0 b[0,j] = SKIPY // initialize row 0}for i = 1 to m do {for j = 1 to n do {if (x[i] == y[j]) {c[i,j] = c[i-1,j-1]+1 // take X[i] and Y[j] for LCSb[i,j] = ADDXY}else if (c[i-1,j] >= c[i,j-1]) { // X[i] not in LCSc[i,j] = c[i-1,j]b[i,j] = SKIPX}else { // Y[j] not in LCS77Lecture Notes CMSC 251c[i,j] = c[i,j-1]b[i,j] = SKIPY}}}return c[m,n];}LCS Length Table with back pointers includedm==n322122122111210000BDCBABCDB4321043215 start hereABCDB43210432105m==nB1111110000000000BDC012211111111110000002X = BACDBY = BDCBLCS = BCB322122Figure 32: Longest common subsequence example.The running

View Full Document