DOC PREVIEW
MSU CSE 842 - Natural Language Processing
Course Cse 842-
Pages 12

This preview shows page 1-2-3-4 out of 12 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1/19/2011 CSE842, Spring 2011, MSU 1CSE 842Natural Language ProcessingLecture 3: Part-of-Speech Tagging1/19/2011 CSE842, Spring 2011, MSU 2AnnouncementHomework 1 will be posted under ~cse842/Homework/hw1 later today. - Due: Feb. 7- Written part at the beginning of the class- Programming part at 11:59pm on the due date via “handin”1/19/2011 CSE842, Spring 2011, MSU 3Minimum Edit Distance• Is the minimum number of editing operations needed to transform one into the other–Insertion–Deletion– Substitution• Many applications in string comparison/alignment, e.g., spell checking, WER in SR, machine translation, bioinformatics, etc. 1/19/2011 CSE842, Spring 2011, MSU 4Minimum Edit Distance• If each operation has cost of 1– Distance between these is 5• If substitutions cost 2 (Levenshtein)– Distance between them is 81/19/2011 CSE842, Spring 2011, MSU 5Minimum Edit DistanceOne possible path1/19/2011 CSE842, Spring 2011, MSU 6Defining Min Edit Distance• For two strings S1of len n, S2of len m– distance(i,j) or D(i,j)• means the edit distance of S1[1..i] and S2[1..j]• i.e., the minimum number of edit operations need to transform the first i characters of S1into the first jcharacters of S2• The edit distance of S1, S2is D(n,m)• We compute D(n,m) by computing D(i,j) for all i(0 ≤ i ≤ n) and j (0 ≤ j ≤ m)• Note the index associated with the source/target string: first is source and second is the target1/19/2011 61/19/2011 CSE842, Spring 2011, MSU 7Defining Min Edit Distance• Base conditions:–D(i,0) = i /* deletion cost*/–D(0,j) = j /* insertion cost*/– Recurrence Relation:D(i-1,j) + 1 /* cost for deletion*/–D(i,j) = min D(i,j-1) + 1 /* cost for insertion*/D(i-1,j-1) + 2; if S1(i) ≠ S2(j) 0; if S1(i) = S2(j)/* cost for substitution */1/19/2011 71/19/2011 CSE842, Spring 2011, MSU 8Dynamic Programming• A tabular computation of D(n,m)• Bottom-up– Compute D(i,j) for smaller i,j– Increase i, j to computer D(i,j) using previously computed values based on smaller indexes. 1/19/2011 81/19/2011 CSE842, Spring 2011, MSU 9N9O8I7T6N5E4T3N2I1# 0 123456789# EXECUT I ONThe Edit Distance Tablesourcetarget1/19/2011 CSE842, Spring 2011, MSU 10N9O8I7T6N5E4T3N2I1# 0 123456789# EXE CUT I ON1/19/2011 CSE842, Spring 2011, MSU 11Acquiring the alignment• We can keep track of a “backtrace”• Every time we enter a cell, remember where we come from• Then when we reach the end, we can trace back from the upper right corner to get an alignment1/19/2011 CSE842, Spring 2011, MSU 12Adding Backtrace to MinEdit• Base conditions:–D(i,0) = i–D(0,j) = j• Recurrence Relation:D(i-1,j) + 1–D(i,j) = min D(i,j-1) + 1D(i-1,j-1) + 1; if S1(i) ≠ S2(j) 0; if S1(i) = S2(j)DOWNLEFTDIAGONAL1/19/2011 CSE842, Spring 2011, MSU 13N9O8I7T6N5E4T3N2I12# 0 123456789# EXE CUT I ONdeletioninsertionsubstitution1/19/2011 CSE842, Spring 2011, MSU 14N9O8I7T6N5E4T3N2I 123# 0 123456789# EXE CUT I ON1/19/2011 CSE842, Spring 2011, MSU 15N9O8I7T6N5E4T3N2I 123# 0 123456789# EXE CUT I ON1/19/2011 CSE842, Spring 2011, MSU 16N989101112111098O878910111098 9I 767891098 910T6567898 91011N545678 9 101110E43456789109T345 6787898N2 3 45678787I 1 234567678# 0 123456789# EXECUT I ON1/19/2011 CSE842, Spring 2011, MSU 17MinEdit with Backtrace1/19/2011 171/19/2011 CSE842, Spring 2011, MSU 18MinEdit with Backtrace1/19/2011 18subinssubdel1/19/2011 CSE842, Spring 2011, MSU 19Performance•Time: O(nm)• Space: O(nm)•Backtrace: O(n+m)1/19/2011 CSE842, Spring 2011, MSU 20Part of Speech (POS) Tagging•POS• Tagsets• Tagging algorithms–Rule-based– Transformation-based learning– HMM Viterbi Algorithm1/19/2011 CSE842, Spring 2011, MSU 21Part of Speech• Eight basic POS:– Noun, verb, pronoun, preposition, adjective, conjunction, article, adverb– Based on morphological and syntactic function• Called: parts-of-speech, lexical categories, word classes, morphological classes, lexical tags...• Lots of debate within linguistics about the number, nature, and universality of these– We’ll completely ignore this debate.1/19/2011 CSE842, Spring 2011, MSU 22Open and Closed Classes• Closed class: a small fixed membership – Prepositions: of, in, by, …– Auxiliaries: may, can, will had, been, …– Pronouns: I, you, she, mine, his, them, …–Usually function words (short common words which play a role in grammar)• Open class: new ones can be created all the time– English has 4: Nouns, Verbs, Adjectives, Adverbs– Many languages have these 4, but not all!1/19/2011 CSE842, Spring 2011, MSU 23Closed Class WordsExamples:– prepositions: on, under, over, …– particles: up, down, on, off, …–determiners: a, an, the, …– pronouns: she, who, I, ..– conjunctions: and, but, or, …– auxiliary verbs: can, may should, …– numerals: one, two, three, third, …1/19/2011 CSE842, Spring 2011, MSU 24Open Class Words• Nouns– Proper nouns (East Lansing, Warton Center, Lou Ann)• English capitalizes these.– Common nouns (the rest). – Count nouns and mass nouns• Count: have plurals, get counted: goat/goats, one goat, two goats• Mass: don’t get counted (snow, salt, communism) (*two snows)• Adverbs: tend to modify things– Unfortunately, John walked home extremely slowly yesterday– Directional/locative adverbs (here,home, downhill)– Degree adverbs (extremely, very, somewhat)– Manner adverbs (slowly, carefully, delicately)•Verbs– In English, have morphological affixes (eat/eats/eaten)1/19/2011 CSE842, Spring 2011, MSU 25Prepositions from CELEX1/19/2011 CSE842, Spring 2011, MSU 26English Particles1/19/2011 CSE842, Spring 2011, MSU 27Part of Speech Tagging• What is POS tagging?• Associate with each word a lexical tag– 45 classes from Penn Treebank– 87 classes from Brown Corpus– 146 classes from C7 tagset (CLAWS system)1/19/2011 CSE842, Spring 2011, MSU 28Why is POS Tagging Useful? • It is a first step of many practical tasks. • Speech synthesis– How to pronounce “lead”?– INsult inSULT– OBject obJECT– OVERflow overFLOW– DIScount disCOUNT– CONtent conTENT•Parsing– Need to know if a word is an N or V before you can parse• Information extraction– Finding names, relations, etc.1/19/2011 CSE842, Spring 2011, MSU 29Penn Treebank• Large Corpora of 4.5 million words of American English– POS Tagged– Syntactic Bracketing• http://www.cis.upenn.edu/~treebank•


View Full Document

MSU CSE 842 - Natural Language Processing

Course: Cse 842-
Pages: 12
Download Natural Language Processing
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Natural Language Processing and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Natural Language Processing 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?