Minimum&Edit&Distance&!Defini'on!of!Minimum!Edit!Distance!!Dan!Jurafsky!How&similar&are&two&strings?&• Spell!correc'on!• The!user!typed!“graffe”!Which!is!closest?!!• graf!• graB!• grail!• giraffe!• Computa'onal!Biology!• Align!two!sequences!of!nucleo'des!• Resul'ng!alignment:!• Also!for!Machine!Transla'on,!Informa'on!Extrac'on,!Speech!Recogni'on!AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGACDan!Jurafsky!Edit&Distance&• The!minimum!edit!distance!between!two!strings!• Is!the!minimum!number!of!edi'ng!opera'ons!• Inser'on!• Dele'on!• Subs'tu'on!• Needed!to!transform!one!into!the!other!Dan!Jurafsky!Minimum&Edit&Distance&• Two!strings!and!their!alignment:!Dan!Jurafsky!Minimum&Edit&Distance&• If!each!opera'on!has!cost!of!1!• Distance!between!these!is!5!• If!subs'tu'ons!cost!2!(Levenshtein)!• Distance!between!them!is!8!Dan!Jurafsky!Alignment&in&Computa9onal&Biology&• Given!a!sequence!of!bases!!• An!alignment:!• Given!two!sequences,!align!each!leXer!to!a!leXer!or!gap!-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGACDan!Jurafsky!Other&uses&of&Edit&Distance&in&NLP&• Evalua'ng!Machine!Transla'on!and!speech!recogni'on!R Spokesman confirms senior government adviser was shot!H Spokesman said the senior adviser was shot dead! S I D I!• Named!En'ty!Extrac'on!and!En'ty!Coreference!• IBM!Inc.!announced!today!• IBM!profits!• Stanford!President!John!Hennessy!announced!yesterday!• for!Stanford!University!President!John!Hennessy!!Dan!Jurafsky!How&to&find&the&Min&Edit&Distance?&• Searching!for!a!path!(sequence!of!edits)!from!the!start!string!to!the!final!string:!• Ini9al&state:!the!word!we’re!transforming!• Operators:!insert,!delete,!subs'tute!• Goal&state:!!the!word!we’re!trying!to!get!to!• Path&cost:!what!we!want!to!minimize:!the!number!of!edits!8Dan!Jurafsky!Minimum&Edit&as&Search&• But!the!space!of!all!edit!sequences!is!huge!!• We!can’t!afford!to!navigate!naïvely!• Lots!of!dis'nct!paths!wind!up!at!the!same!state.!• We!don’t!have!to!keep!track!of!all!of!them!• Just!the!shortest!path!to!each!of!those!revisted!states.!9Dan!Jurafsky!Defining&Min&Edit&Distance&• For!two!strings!• X!of!length!n!!• Y!of!length!m#• We!define!D(i,j)!• the!edit!distance!between!X[1..i]!and!Y[1..j]!!• i.e.,!the!first!i!characters!of!X!and!the!first!j!characters!of!Y!• The!edit!distance!between!X!and!Y!is!thus!D(n,m)!Minimum&Edit&Distance&!Defini'on!of!Minimum!Edit!Distance!!Minimum&Edit&Distance&!Compu'ng!Minimum!Edit!Distance!!Dan!Jurafsky!Dynamic&Programming&for&Minimum&Edit&Distance&• Dynamic&programming:!A!tabular!computa'on!of!D(n,m)&• Solving!problems!by!combining!solu'ons!to!subproblems.!• BoXomfup!• We!compute!D(i,j)!for!small!i,j##• And!compute!larger!D(i,j)!based!on!previously!computed!smaller!values!• i.e.,!compute!D(i,j)!for!all!i!(0!<!i!<!n)!!and#j#(0!<!j!<!m)!Dan!Jurafsky!Defining&Min&Edit&Distance&(Levenshtein)&• Ini'aliza'on!D(i,0) = i!D(0,j) = j#• Recurrence!Rela'on:#For each i = 1…M!! For each j = 1…N# D(i-1,j) + 1! D(i,j)= min D(i,j-1) + 1! D(i-1,j-1) + 2; if X(i) ≠ Y(j) ! 0; if X(i) = Y(j)!• Termina'on:#D(N,M) is distance !!Dan!Jurafsky!N 9 O 8 I 7 T 6 N 5 E 4 T 3 N 2 I 1 # 0 1 2 3 4 5 6 7 8 9 # E X E C U T I O N The&Edit&Distance&Table&Dan!Jurafsky!N 9 O 8 I 7 T 6 N 5 E 4 T 3 N 2 I 1 # 0 1 2 3 4 5 6 7 8 9 # E X E C U T I O N The&Edit&Distance&Table&Dan!Jurafsky!N 9 O 8 I 7 T 6 N 5 E 4 T 3 N 2 I 1 # 0 1 2 3 4 5 6 7 8 9 # E X E C U T I O N Edit&Distance&Dan!Jurafsky!N 9 8 9 10 11 12 11 10 9 8 O 8 7 8 9 10 11 10 9 8 9 I 7 6 7 8 9 10 9 8 9 10 T 6 5 6 7 8 9 8 9 10 11 N 5 4 5 6 7 8 9 10 11 10 E 4 3 4 5 6 7 8 9 10 9 T 3 4 5 6 7 8 7 8 9 8 N 2 3 4 5 6 7 8 7 8 7 I 1 2 3 4 5 6 7 6 7 8 # 0 1 2 3 4 5 6 7 8 9 # E X E C U T I O N The&Edit&Distance&Table&Minimum&Edit&Distance&!Compu'ng!Minimum!Edit!Distance!!Minimum&Edit&Distance&!Backtrace!for!Compu'ng!Alignments!!Dan!Jurafsky!Compu9ng&alignments&• Edit!distance!isn’t!sufficient!• We!oBen!need!to!align!each!character!of!the!two!strings!to!each!other!• We!do!this!by!keeping!a!“backtrace”!• Every!'me!we!enter!a!cell,!remember!where!we!came!from!• When!we!reach!the!end,!!• Trace!back!the!path!from!the!upper!right!corner!to!read!off!the!alignment!Dan!Jurafsky!N 9 O 8 I 7 T 6 N 5 E 4 T 3 N 2 I 1 # 0 1 2 3 4 5 6 7 8 9 # E X E C U T I O N Edit&Distance&Dan!Jurafsky!MinEdit&with&Backtrace&Dan!Jurafsky!Adding&Backtrace&to&Minimum&Edit&Distance&• Base!condi'ons:!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!Termina'on:!D(i,0) = i D(0,j) = j D(N,M) is distance #• Recurrence!Rela'on:#For each i = 1…M!! For each j = 1…N# D(i-1,j) + 1! D(i,j)= min D(i,j-1) + 1! D(i-1,j-1) + 2; if X(i) ≠ Y(j) ! 0; if X(i) = Y(j)! LEFT! ptr(i,j)= DOWN! DIAG!!inser'on!dele'on!subs'tu'on!inser'on!dele'on!subs'tu'on!Dan!Jurafsky!The&Distance&Matrix&Slide!adapted!from!Serafim!Batzoglou!y0yMx0xNEvery!nonfdecreasing!path!!!from!(0,0)!to!(M,!N)!!!corresponds!to!!an!alignment!!of!the!two!sequences!!AnoptimalalignmentiscomposedofoptimalsubalignmentsDan!Jurafsky!Result&of&Backtrace&• Two!strings!and!their!alignment:!Dan!Jurafsky!Performance&• Time:!!! !!O(nm)!• Space:!!! !!O(nm)!• Backtrace!!! !!O(n+m)!Minimum&Edit&Distance&!Backtrace!for!Compu'ng!Alignments!!Minimum&Edit&Distance&!Weighted!Minimum!Edit!Distance!!Dan!Jurafsky!Weighted&Edit&Distance&• Why!would!we!add!weights!to!the!computa'on?!•
View Full Document