String MatchingWhether a candidate x is a factor of textDefinitions• Length[x] = |x|• |text| >> |x|• Alphabet A–Binary, A ={0,1}– Decimal, A ={0,1,2..,9}– English Letters, A ={a,b,c..,z}– DNA bases, A ={A,G,C,T}Naïve String MatchingNaïve String Matching• O(n-m+1) in the worst case: matches at very end• Random x and test means algorithm is efficient• Information from one candidate shift is not exploited in subsequent candidate shiftBoyer Moore Algorithm• Resembles Naïve String Matching• Except matching is done right-to-left• Shifts are done in larger increments• Good suffix heuristic• Bad character heuristicBoyer-MooreBad CharacterGood SuffixBoyer-Moore principle• Good Suffix heuristic and Bad Character heuristic operate independently and in parallel• After mismatch is detected, each proposes an amount by which s can be safely increased without missing a valid shift• Larger of these shifts is selectedBoyer-Moore Implementation• Good Suffix Function–G(x) is a table that for eeach suffix gives the second rightmost occurrence in x– For “estimates” s has value 2, es has value 1• Last occurrence Function– Table containing every letter of alphabet and position of its rightmost occurrence in x– For “estimates”, a=6, b=c=d=0, e=8, etc.Edit Distance• Not clear whether “abbccc” is closer to “aabbcc” or to “abbcccb”• Edit distance is the number of fundamental operations required to transform one string to the otherCost MatrixC[i, j] = cost of transforming x to yEdit Distance (excused, exhausted) = C[i = 7, j = 9] = 3String Matching with ErrorsString Matching with Don’t
View Full Document