Unformatted text preview:

Minimum&Edit&Distance&!Defini'on!of!Minimum!Edit!Distance!!Dan!Jurafsky!How&similar&are&two&strings?&• Spell!correc'on!• The!user!typed!“graffe”!Which!is!closest?!!• graf!• graB!• grail!• giraffe!• Computa'onal!Biology!• Align!two!sequences!of!nucleo'des!• Resul'ng!alignment:!• Also!for!Machine!Transla'on,!Informa'on!Extrac'on,!Speech!Recogni'on!AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGACDan!Jurafsky!Edit&Distance&• The!minimum!edit!distance!between!two!strings!• Is!the!minimum!number!of!edi'ng!opera'ons!• Inser'on!• Dele'on!• Subs'tu'on!• Needed!to!transform!one!into!the!other!Dan!Jurafsky!Minimum&Edit&Distance&• Two!strings!and!their!alignment:!Dan!Jurafsky!Minimum&Edit&Distance&• If!each!opera'on!has!cost!of!1!• Distance!between!these!is!5!• If!subs'tu'ons!cost!2!(Levenshtein)!• Distance!between!them!is!8!Dan!Jurafsky!Alignment&in&Computa9onal&Biology&• Given!a!sequence!of!bases!!• An!alignment:!• Given!two!sequences,!align!each!leXer!to!a!leXer!or!gap!-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGACDan!Jurafsky!Other&uses&of&Edit&Distance&in&NLP&• Evalua'ng!Machine!Transla'on!and!speech!recogni'on!R Spokesman confirms senior government adviser was shot!H Spokesman said the senior adviser was shot dead! S I D I!• Named!En'ty!Extrac'on!and!En'ty!Coreference!• IBM!Inc.!announced!today!• IBM!profits!• Stanford!President!John!Hennessy!announced!yesterday!• for!Stanford!University!President!John!Hennessy!!Dan!Jurafsky!How&to&find&the&Min&Edit&Distance?&• Searching!for!a!path!(sequence!of!edits)!from!the!start!string!to!the!final!string:!• Ini9al&state:!the!word!we’re!transforming!• Operators:!insert,!delete,!subs'tute!• Goal&state:!!the!word!we’re!trying!to!get!to!• Path&cost:!what!we!want!to!minimize:!the!number!of!edits!8Dan!Jurafsky!Minimum&Edit&as&Search&• But!the!space!of!all!edit!sequences!is!huge!!• We!can’t!afford!to!navigate!naïvely!• Lots!of!dis'nct!paths!wind!up!at!the!same!state.!• We!don’t!have!to!keep!track!of!all!of!them!• Just!the!shortest!path!to!each!of!those!revisted!states.!9Dan!Jurafsky!Defining&Min&Edit&Distance&• For!two!strings!• X!of!length!n!!• Y!of!length!m#• We!define!D(i,j)!• the!edit!distance!between!X[1..i]!and!Y[1..j]!!• i.e.,!the!first!i!characters!of!X!and!the!first!j!characters!of!Y!• The!edit!distance!between!X!and!Y!is!thus!D(n,m)!Minimum&Edit&Distance&!Defini'on!of!Minimum!Edit!Distance!!Minimum&Edit&Distance&!Compu'ng!Minimum!Edit!Distance!!Dan!Jurafsky!Dynamic&Programming&for&Minimum&Edit&Distance&• Dynamic&programming:!A!tabular!computa'on!of!D(n,m)&• Solving!problems!by!combining!solu'ons!to!subproblems.!• BoXomfup!• We!compute!D(i,j)!for!small!i,j##• And!compute!larger!D(i,j)!based!on!previously!computed!smaller!values!• i.e.,!compute!D(i,j)!for!all!i!(0!<!i!<!n)!!and#j#(0!<!j!<!m)!Dan!Jurafsky!Defining&Min&Edit&Distance&(Levenshtein)&• Ini'aliza'on!D(i,0) = i!D(0,j) = j#• Recurrence!Rela'on:#For each i = 1…M!! For each j = 1…N# D(i-1,j) + 1! D(i,j)= min D(i,j-1) + 1! D(i-1,j-1) + 2; if X(i) ≠ Y(j) ! 0; if X(i) = Y(j)!• Termina'on:#D(N,M) is distance !!Dan!Jurafsky!N 9 O 8 I 7 T 6 N 5 E 4 T 3 N 2 I 1 # 0 1 2 3 4 5 6 7 8 9 # E X E C U T I O N The&Edit&Distance&Table&Dan!Jurafsky!N 9 O 8 I 7 T 6 N 5 E 4 T 3 N 2 I 1 # 0 1 2 3 4 5 6 7 8 9 # E X E C U T I O N The&Edit&Distance&Table&Dan!Jurafsky!N 9 O 8 I 7 T 6 N 5 E 4 T 3 N 2 I 1 # 0 1 2 3 4 5 6 7 8 9 # E X E C U T I O N Edit&Distance&Dan!Jurafsky!N 9 8 9 10 11 12 11 10 9 8 O 8 7 8 9 10 11 10 9 8 9 I 7 6 7 8 9 10 9 8 9 10 T 6 5 6 7 8 9 8 9 10 11 N 5 4 5 6 7 8 9 10 11 10 E 4 3 4 5 6 7 8 9 10 9 T 3 4 5 6 7 8 7 8 9 8 N 2 3 4 5 6 7 8 7 8 7 I 1 2 3 4 5 6 7 6 7 8 # 0 1 2 3 4 5 6 7 8 9 # E X E C U T I O N The&Edit&Distance&Table&Minimum&Edit&Distance&!Compu'ng!Minimum!Edit!Distance!!Minimum&Edit&Distance&!Backtrace!for!Compu'ng!Alignments!!Dan!Jurafsky!Compu9ng&alignments&• Edit!distance!isn’t!sufficient!• We!oBen!need!to!align!each!character!of!the!two!strings!to!each!other!• We!do!this!by!keeping!a!“backtrace”!• Every!'me!we!enter!a!cell,!remember!where!we!came!from!• When!we!reach!the!end,!!• Trace!back!the!path!from!the!upper!right!corner!to!read!off!the!alignment!Dan!Jurafsky!N 9 O 8 I 7 T 6 N 5 E 4 T 3 N 2 I 1 # 0 1 2 3 4 5 6 7 8 9 # E X E C U T I O N Edit&Distance&Dan!Jurafsky!MinEdit&with&Backtrace&Dan!Jurafsky!Adding&Backtrace&to&Minimum&Edit&Distance&• Base!condi'ons:!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!Termina'on:!D(i,0) = i D(0,j) = j D(N,M) is distance #• Recurrence!Rela'on:#For each i = 1…M!! For each j = 1…N# D(i-1,j) + 1! D(i,j)= min D(i,j-1) + 1! D(i-1,j-1) + 2; if X(i) ≠ Y(j) ! 0; if X(i) = Y(j)! LEFT! ptr(i,j)= DOWN! DIAG!!inser'on!dele'on!subs'tu'on!inser'on!dele'on!subs'tu'on!Dan!Jurafsky!The&Distance&Matrix&Slide!adapted!from!Serafim!Batzoglou!y0yMx0xNEvery!nonfdecreasing!path!!!from!(0,0)!to!(M,!N)!!!corresponds!to!!an!alignment!!of!the!two!sequences!!AnoptimalalignmentiscomposedofoptimalsubalignmentsDan!Jurafsky!Result&of&Backtrace&• Two!strings!and!their!alignment:!Dan!Jurafsky!Performance&• Time:!!! !!O(nm)!• Space:!!! !!O(nm)!• Backtrace!!! !!O(n+m)!Minimum&Edit&Distance&!Backtrace!for!Compu'ng!Alignments!!Minimum&Edit&Distance&!Weighted!Minimum!Edit!Distance!!Dan!Jurafsky!Weighted&Edit&Distance&• Why!would!we!add!weights!to!the!computa'on?!•


View Full Document

Stanford CS 124 - LECTURE NOTES

Download LECTURE NOTES
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view LECTURE NOTES and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view LECTURE NOTES 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?