DOC PREVIEW
MSU CSE 830 - Inexact Matching of Strings

This preview shows page 1-2-3-20-21-40-41-42 out of 42 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 42 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 42 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 42 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 42 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 42 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 42 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 42 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 42 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 42 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Inexact Matching of StringsMeasuring Distance of S and TExampleSlide 4Edit DistanceAlignmentAlignments and Edit transcriptsEdit Distance ProblemDefinition of D(i,j)Recurrence RelationWhat the various cases meanComputing D(i,j) valuesInitialization: Base CaseRow i=1Entry i=2, j=2Entry i=2, j=3Calculation methodologiesTracebackWhat the pointers meanTable and transcriptsRunning TimeOperation-Weight Edit DistanceModified Recurrence RelationAlphabet-Weight Edit DistanceSlide 25Measuring Similarity of S and TSlide 27String Similarity ProblemSlide 29Longest Common Subsequence ProblemComputing alignments using linear space.IllustrationLinear space and an alignmentSlide 34Slide 35Slide 36Slide 37Slide 38Slide 39Recursive StepSlide 41Time RequiredInexact Matching of Strings•General Problem–Input•Strings S and T–Questions•How distant is S from T?•How similar is S to T?•Solution Technique–Dynamic programming with cost/similarity/scoring matrixMeasuring Distance of S and T•Consider S and T•We can transform S into T using the following four operations–insertion of a character into S–deletion of a character from S–substitution (replacement) of a character in S by another character (typically in T)–matching (no operation)Example•S = vintner•T = writers•vintner•wintner (Replace v with w)•wrintner (Insert r)•writner (Delete first n)•writer (Delete second n)•writers (Insert S)Example•Edit Transcript (or just transcript):–a string that describes the transformation of one string into the other•Example–RIMDMDMMI–v intner–wri t ersEdit Distance•Edit distance of strings S and T–The minimum number of edit operations (insertion, deletion, replacement) needed to transform string S into string T–Levenshtein distance, Levenshtein appears to have been the first to define this concept•Optimal transcript–An edit transcript of S and T that has the minimum number of edit operations–cooptimal transcriptsAlignment•A global alignment of strings S and T is obtained–by inserting spaces (dashes) into S and T•they should have the same number of characters (including dashes) at the end–then placing two strings over each other matching one character (or dash) in S with a unique character (or dash) in T–Note ALL positions in both S and T are involvedAlignments and Edit transcripts•Example Alignment–v-intner-–wri-t-ers•Alignments and edit transcripts are interrelated–edit transcript: emphasizes process •the specific mutational events–alignment: emphasizes product •the relationship between the two strings–Alignments are often easier to work with and visualize•also generalize better to more than 2 stringsEdit Distance Problem•Input–2 strings S and T•Task–Output edit distance of S and T–Output optimal edit transcript–Output optimal alignment•Solution method–Dynamic ProgrammingDefinition of D(i,j)•Let D(i,j) be the edit distance of S[1..i] and T[1..j]–The edit distance of the first i characters of S with the first j characters of T–Let |S| = n, |T| = m•D(n,m) = edit distance of S and T•We will compute D(i,j) for all i and j such that 0 <= i <= n, 0 <= j <= mRecurrence Relation•Base Case:–For 0 <= i <= n, D(i,0) = i–For 0 <= j <= m, D(0,j) = j•Recursive Case:–0 < i <= n, 0 < j <= m–D(i,j) = min•D(i-1,j) + 1 (what does this mean?)•D(i,j-1) + 1 (what does this mean?)•D(i-1,j-1) + d(i,j) (what does this mean?)–d(i,j) = 0 if S(i) = T(j) and is 1 otherwiseWhat the various cases mean•D(i,j) = min–D(i-1,j) + 1: •Align S[1..i-1] with T[1..j] optimally•Match S(i) with a dash in T–D(i,j-1) + 1•Align S[1..i] with T[1..j-1] optimally•Match a dash in S with T(j)–D(i-1,j-1) + d(i,j)•Align S[1..i-1] with T[1..j-1] optimally•Match S(i) with T(j)Computing D(i,j) valuesD(i,j) w r i t e r s0 1 2 3 4 5 6 70v 1i 2n 3t 4n 5e 6r 7Initialization: Base CaseD(i,j) w r i t e r s0 1 2 3 4 5 6 70 0 1 2 3 4 5 6 7v 1 1i 2 2n 3 3t 4 4n 5 5e 6 6r 7 7Row i=1D(i,j) w r i t e r s0 1 2 3 4 5 6 70 0 1 2 3 4 5 6 7v 1 1 1 2 3 4 5 6 7i 2 2n 3 3t 4 4n 5 5e 6 6r 7 7Entry i=2, j=2D(i,j) w r i t e r s0 1 2 3 4 5 6 70 0 1 2 3 4 5 6 7v 1 1 1 2 3 4 5 6 7i 2 2 2 ?n 3 3t 4 4n 5 5e 6 6r 7 7Entry i=2, j=3D(i,j) w r i t e r s0 1 2 3 4 5 6 70 0 1 2 3 4 5 6 7v 1 1 1 2 3 4 5 6 7i 2 2 2 2 ?n 3 3t 4 4n 5 5e 6 6r 7 7Calculation methodologies•Location of edit distance–D(n,m)•Example was to calculate row by row•Can also calculate column by column•Can also use antidiagonals•Key is to build from upper left cornerTraceback•Using table to construct optimal transcript•Pointers in cell D(i,j)–Set a pointer from cell (i,j) to•cell (i, j-1) if D(i,j) = D(i, j-1) + 1•cell (i-1,j) if D(i,j) = D(i-1,j) + 1•cell (i-1,j-1) if D(i,j) = D(i-1,j-1) + d(i,j)–Follow path of pointers from (n,m) back to (0,0)What the pointers mean•horizontal pointer: cell (i,j) to cell (i, j-1) –Align T(j) with a space in S–Insert T(j) into S•vertical pointer: cell (i,j) to cell (i-1, j)–Align S(i) with a space in T–Delete S(i) from S•diagonal pointer: cell (i,j) to cell (i-1, j-1)–Align S(i) with T(j)–Replace S(i) with T(j)Table and transcripts•The pointers represent all optimal transcripts•Theorem: –Any path from (n,m) to (0,0) following the pointers specifies an optimal transcript.–Conversely, any optimal transcript is specified by such a path. –The correspondence between paths and transcripts is one to one.Running Time•Initialization of table–O(n+m)•Calculating table and pointers–O(nm)•Traceback for one optimal transcript or optimal alignment–O(n+m)Operation-Weight Edit Distance•Consider S and T•We can assign weights to the various operations–insertion/deletion of a character: cost d–substitution (replacement) of a character: cost r–matching: cost e–Previous case: d = r = 1, e = 0Modified Recurrence Relation•Base Case:–For 0 <= i <= n, D(i,0) = i d–For 0 <= j <= m, D(0,j) = j d•Recursive Case:–0 < i <= n, 0 < j <= m–D(i,j) = min•D(i-1,j) + d•D(i,j-1) + d•D(i-1,j-1) + d(i,j)–d(i,j) = e if S(i) = T(j) and is r otherwiseAlphabet-Weight Edit Distance•Define weight of each possible substitution–r(a,b) where a is being replaced by b for all a,b in the alphabet–For example, with DNA, maybe r(A,T) > r(A,G)–Likewise, I(a) may vary by character•Operation-weight edit distance is a special case of this variation•Weighted edit distance refers to this


View Full Document

MSU CSE 830 - Inexact Matching of Strings

Download Inexact Matching of Strings
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Inexact Matching of Strings and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Inexact Matching of Strings 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?