1/19/2011 CSE842, Spring 2011, MSU 1CSE 842Natural Language ProcessingLecture 3: Part-of-Speech Tagging1/19/2011 CSE842, Spring 2011, MSU 2AnnouncementHomework 1 will be posted under ~cse842/Homework/hw1 later today. - Due: Feb. 7- Written part at the beginning of the class- Programming part at 11:59pm on the due date via “handin”1/19/2011 CSE842, Spring 2011, MSU 3Minimum Edit Distance• Is the minimum number of editing operations needed to transform one into the other–Insertion–Deletion– Substitution• Many applications in string comparison/alignment, e.g., spell checking, WER in SR, machine translation, bioinformatics, etc. 1/19/2011 CSE842, Spring 2011, MSU 4Minimum Edit Distance• If each operation has cost of 1– Distance between these is 5• If substitutions cost 2 (Levenshtein)– Distance between them is 81/19/2011 CSE842, Spring 2011, MSU 5Minimum Edit DistanceOne possible path1/19/2011 CSE842, Spring 2011, MSU 6Defining Min Edit Distance• For two strings S1of len n, S2of len m– distance(i,j) or D(i,j)• means the edit distance of S1[1..i] and S2[1..j]• i.e., the minimum number of edit operations need to transform the first i characters of S1into the first jcharacters of S2• The edit distance of S1, S2is D(n,m)• We compute D(n,m) by computing D(i,j) for all i(0 ≤ i ≤ n) and j (0 ≤ j ≤ m)• Note the index associated with the source/target string: first is source and second is the target1/19/2011 61/19/2011 CSE842, Spring 2011, MSU 7Defining Min Edit Distance• Base conditions:–D(i,0) = i /* deletion cost*/–D(0,j) = j /* insertion cost*/– Recurrence Relation:D(i-1,j) + 1 /* cost for deletion*/–D(i,j) = min D(i,j-1) + 1 /* cost for insertion*/D(i-1,j-1) + 2; if S1(i) ≠ S2(j) 0; if S1(i) = S2(j)/* cost for substitution */1/19/2011 71/19/2011 CSE842, Spring 2011, MSU 8Dynamic Programming• A tabular computation of D(n,m)• Bottom-up– Compute D(i,j) for smaller i,j– Increase i, j to computer D(i,j) using previously computed values based on smaller indexes. 1/19/2011 81/19/2011 CSE842, Spring 2011, MSU 9N9O8I7T6N5E4T3N2I1# 0 123456789# EXECUT I ONThe Edit Distance Tablesourcetarget1/19/2011 CSE842, Spring 2011, MSU 10N9O8I7T6N5E4T3N2I1# 0 123456789# EXE CUT I ON1/19/2011 CSE842, Spring 2011, MSU 11Acquiring the alignment• We can keep track of a “backtrace”• Every time we enter a cell, remember where we come from• Then when we reach the end, we can trace back from the upper right corner to get an alignment1/19/2011 CSE842, Spring 2011, MSU 12Adding Backtrace to MinEdit• Base conditions:–D(i,0) = i–D(0,j) = j• Recurrence Relation:D(i-1,j) + 1–D(i,j) = min D(i,j-1) + 1D(i-1,j-1) + 1; if S1(i) ≠ S2(j) 0; if S1(i) = S2(j)DOWNLEFTDIAGONAL1/19/2011 CSE842, Spring 2011, MSU 13N9O8I7T6N5E4T3N2I12# 0 123456789# EXE CUT I ONdeletioninsertionsubstitution1/19/2011 CSE842, Spring 2011, MSU 14N9O8I7T6N5E4T3N2I 123# 0 123456789# EXE CUT I ON1/19/2011 CSE842, Spring 2011, MSU 15N9O8I7T6N5E4T3N2I 123# 0 123456789# EXE CUT I ON1/19/2011 CSE842, Spring 2011, MSU 16N989101112111098O878910111098 9I 767891098 910T6567898 91011N545678 9 101110E43456789109T345 6787898N2 3 45678787I 1 234567678# 0 123456789# EXECUT I ON1/19/2011 CSE842, Spring 2011, MSU 17MinEdit with Backtrace1/19/2011 171/19/2011 CSE842, Spring 2011, MSU 18MinEdit with Backtrace1/19/2011 18subinssubdel1/19/2011 CSE842, Spring 2011, MSU 19Performance•Time: O(nm)• Space: O(nm)•Backtrace: O(n+m)1/19/2011 CSE842, Spring 2011, MSU 20Part of Speech (POS) Tagging•POS• Tagsets• Tagging algorithms–Rule-based– Transformation-based learning– HMM Viterbi Algorithm1/19/2011 CSE842, Spring 2011, MSU 21Part of Speech• Eight basic POS:– Noun, verb, pronoun, preposition, adjective, conjunction, article, adverb– Based on morphological and syntactic function• Called: parts-of-speech, lexical categories, word classes, morphological classes, lexical tags...• Lots of debate within linguistics about the number, nature, and universality of these– We’ll completely ignore this debate.1/19/2011 CSE842, Spring 2011, MSU 22Open and Closed Classes• Closed class: a small fixed membership – Prepositions: of, in, by, …– Auxiliaries: may, can, will had, been, …– Pronouns: I, you, she, mine, his, them, …–Usually function words (short common words which play a role in grammar)• Open class: new ones can be created all the time– English has 4: Nouns, Verbs, Adjectives, Adverbs– Many languages have these 4, but not all!1/19/2011 CSE842, Spring 2011, MSU 23Closed Class WordsExamples:– prepositions: on, under, over, …– particles: up, down, on, off, …–determiners: a, an, the, …– pronouns: she, who, I, ..– conjunctions: and, but, or, …– auxiliary verbs: can, may should, …– numerals: one, two, three, third, …1/19/2011 CSE842, Spring 2011, MSU 24Open Class Words• Nouns– Proper nouns (East Lansing, Warton Center, Lou Ann)• English capitalizes these.– Common nouns (the rest). – Count nouns and mass nouns• Count: have plurals, get counted: goat/goats, one goat, two goats• Mass: don’t get counted (snow, salt, communism) (*two snows)• Adverbs: tend to modify things– Unfortunately, John walked home extremely slowly yesterday– Directional/locative adverbs (here,home, downhill)– Degree adverbs (extremely, very, somewhat)– Manner adverbs (slowly, carefully, delicately)•Verbs– In English, have morphological affixes (eat/eats/eaten)1/19/2011 CSE842, Spring 2011, MSU 25Prepositions from CELEX1/19/2011 CSE842, Spring 2011, MSU 26English Particles1/19/2011 CSE842, Spring 2011, MSU 27Part of Speech Tagging• What is POS tagging?• Associate with each word a lexical tag– 45 classes from Penn Treebank– 87 classes from Brown Corpus– 146 classes from C7 tagset (CLAWS system)1/19/2011 CSE842, Spring 2011, MSU 28Why is POS Tagging Useful? • It is a first step of many practical tasks. • Speech synthesis– How to pronounce “lead”?– INsult inSULT– OBject obJECT– OVERflow overFLOW– DIScount disCOUNT– CONtent conTENT•Parsing– Need to know if a word is an N or V before you can parse• Information extraction– Finding names, relations, etc.1/19/2011 CSE842, Spring 2011, MSU 29Penn Treebank• Large Corpora of 4.5 million words of American English– POS Tagged– Syntactic Bracketing• http://www.cis.upenn.edu/~treebank•
View Full Document