Inexact Matching of StringsMeasuring Distance of S and TExampleSlide 4Edit DistanceAlignmentAlignments and Edit transcriptsEdit Distance ProblemDefinition of D(i,j)Recurrence RelationWhat the various cases meanComputing D(i,j) valuesInitialization: Base CaseRow i=1Entry i=2, j=2Entry i=2, j=3Calculation methodologiesTracebackWhat the pointers meanTable and transcriptsRunning TimeOperation-Weight Edit DistanceModified Recurrence RelationAlphabet-Weight Edit DistanceSlide 25Measuring Similarity of S and TSlide 27String Similarity ProblemSlide 29Longest Common Subsequence ProblemComputing alignments using linear space.IllustrationLinear space and an alignmentSlide 34Slide 35Slide 36Slide 37Slide 38Slide 39Recursive StepSlide 41Time RequiredInexact Matching of Strings•General Problem–Input•Strings S and T–Questions•How distant is S from T?•How similar is S to T?•Solution Technique–Dynamic programming with cost/similarity/scoring matrixMeasuring Distance of S and T•Consider S and T•We can transform S into T using the following four operations–insertion of a character into S–deletion of a character from S–substitution (replacement) of a character in S by another character (typically in T)–matching (no operation)Example•S = vintner•T = writers•vintner•wintner (Replace v with w)•wrintner (Insert r)•writner (Delete first n)•writer (Delete second n)•writers (Insert S)Example•Edit Transcript (or just transcript):–a string that describes the transformation of one string into the other•Example–RIMDMDMMI–v intner–wri t ersEdit Distance•Edit distance of strings S and T–The minimum number of edit operations (insertion, deletion, replacement) needed to transform string S into string T–Levenshtein distance, Levenshtein appears to have been the first to define this concept•Optimal transcript–An edit transcript of S and T that has the minimum number of edit operations–cooptimal transcriptsAlignment•A global alignment of strings S and T is obtained–by inserting spaces (dashes) into S and T•they should have the same number of characters (including dashes) at the end–then placing two strings over each other matching one character (or dash) in S with a unique character (or dash) in T–Note ALL positions in both S and T are involvedAlignments and Edit transcripts•Example Alignment–v-intner-–wri-t-ers•Alignments and edit transcripts are interrelated–edit transcript: emphasizes process •the specific mutational events–alignment: emphasizes product •the relationship between the two strings–Alignments are often easier to work with and visualize•also generalize better to more than 2 stringsEdit Distance Problem•Input–2 strings S and T•Task–Output edit distance of S and T–Output optimal edit transcript–Output optimal alignment•Solution method–Dynamic ProgrammingDefinition of D(i,j)•Let D(i,j) be the edit distance of S[1..i] and T[1..j]–The edit distance of the first i characters of S with the first j characters of T–Let |S| = n, |T| = m•D(n,m) = edit distance of S and T•We will compute D(i,j) for all i and j such that 0 <= i <= n, 0 <= j <= mRecurrence Relation•Base Case:–For 0 <= i <= n, D(i,0) = i–For 0 <= j <= m, D(0,j) = j•Recursive Case:–0 < i <= n, 0 < j <= m–D(i,j) = min•D(i-1,j) + 1 (what does this mean?)•D(i,j-1) + 1 (what does this mean?)•D(i-1,j-1) + d(i,j) (what does this mean?)–d(i,j) = 0 if S(i) = T(j) and is 1 otherwiseWhat the various cases mean•D(i,j) = min–D(i-1,j) + 1: •Align S[1..i-1] with T[1..j] optimally•Match S(i) with a dash in T–D(i,j-1) + 1•Align S[1..i] with T[1..j-1] optimally•Match a dash in S with T(j)–D(i-1,j-1) + d(i,j)•Align S[1..i-1] with T[1..j-1] optimally•Match S(i) with T(j)Computing D(i,j) valuesD(i,j) w r i t e r s0 1 2 3 4 5 6 70v 1i 2n 3t 4n 5e 6r 7Initialization: Base CaseD(i,j) w r i t e r s0 1 2 3 4 5 6 70 0 1 2 3 4 5 6 7v 1 1i 2 2n 3 3t 4 4n 5 5e 6 6r 7 7Row i=1D(i,j) w r i t e r s0 1 2 3 4 5 6 70 0 1 2 3 4 5 6 7v 1 1 1 2 3 4 5 6 7i 2 2n 3 3t 4 4n 5 5e 6 6r 7 7Entry i=2, j=2D(i,j) w r i t e r s0 1 2 3 4 5 6 70 0 1 2 3 4 5 6 7v 1 1 1 2 3 4 5 6 7i 2 2 2 ?n 3 3t 4 4n 5 5e 6 6r 7 7Entry i=2, j=3D(i,j) w r i t e r s0 1 2 3 4 5 6 70 0 1 2 3 4 5 6 7v 1 1 1 2 3 4 5 6 7i 2 2 2 2 ?n 3 3t 4 4n 5 5e 6 6r 7 7Calculation methodologies•Location of edit distance–D(n,m)•Example was to calculate row by row•Can also calculate column by column•Can also use antidiagonals•Key is to build from upper left cornerTraceback•Using table to construct optimal transcript•Pointers in cell D(i,j)–Set a pointer from cell (i,j) to•cell (i, j-1) if D(i,j) = D(i, j-1) + 1•cell (i-1,j) if D(i,j) = D(i-1,j) + 1•cell (i-1,j-1) if D(i,j) = D(i-1,j-1) + d(i,j)–Follow path of pointers from (n,m) back to (0,0)What the pointers mean•horizontal pointer: cell (i,j) to cell (i, j-1) –Align T(j) with a space in S–Insert T(j) into S•vertical pointer: cell (i,j) to cell (i-1, j)–Align S(i) with a space in T–Delete S(i) from S•diagonal pointer: cell (i,j) to cell (i-1, j-1)–Align S(i) with T(j)–Replace S(i) with T(j)Table and transcripts•The pointers represent all optimal transcripts•Theorem: –Any path from (n,m) to (0,0) following the pointers specifies an optimal transcript.–Conversely, any optimal transcript is specified by such a path. –The correspondence between paths and transcripts is one to one.Running Time•Initialization of table–O(n+m)•Calculating table and pointers–O(nm)•Traceback for one optimal transcript or optimal alignment–O(n+m)Operation-Weight Edit Distance•Consider S and T•We can assign weights to the various operations–insertion/deletion of a character: cost d–substitution (replacement) of a character: cost r–matching: cost e–Previous case: d = r = 1, e = 0Modified Recurrence Relation•Base Case:–For 0 <= i <= n, D(i,0) = i d–For 0 <= j <= m, D(0,j) = j d•Recursive Case:–0 < i <= n, 0 < j <= m–D(i,j) = min•D(i-1,j) + d•D(i,j-1) + d•D(i-1,j-1) + d(i,j)–d(i,j) = e if S(i) = T(j) and is r otherwiseAlphabet-Weight Edit Distance•Define weight of each possible substitution–r(a,b) where a is being replaced by b for all a,b in the alphabet–For example, with DNA, maybe r(A,T) > r(A,G)–Likewise, I(a) may vary by character•Operation-weight edit distance is a special case of this variation•Weighted edit distance refers to this
View Full Document