New version page

week 1-2

This preview shows page 1-2-19-20 out of 20 pages.

View Full Document
View Full Document

End of preview. Want to read all 20 pages?

Upload your study docs or become a GradeBuddy member to access this document.

View Full Document
Unformatted text preview:

IntroductionText preprocessing: clean and regularize text data.Text parsing: recognize meaningful units and structures of text.Today, many NLP (Natural Language Processing) tools can perform text preprocessing and parsing automatically with reasonably high accuracy on well-understood text data (e.g., news articles).Learning Goals• What are the widely used text preprocessing & parsing steps?• What are the core ideas of the NLP methods for text preprocessing & parsing?• When would existing methods be generalizable to new data?A Typical Text preprocessing and parsing pipelineO'Neal averaged 15.2 points, 9.2 rebounds and 1.0 assists per game.O'Neal averaged 15.2 points , 9.2 rebounds and 1.0 assists per game .sentence segmentation, tokenizationcase foldingstop words removallemmatizationraw texttokenso’neal averaged 15.2 points , 9.2 rebounds and 1.0 assists per game .o’neal averaged 15.2 points 9.2 rebounds 1.0 assists per gameo’neal average 15.2 point 9.2 rebound 1.0 assist per gamePart-of-speech tagging, chunking, named-entity recognitionA Typical Text preprocessing and parsing pipelineO'Neal averaged 15.2 points, 9.2 rebounds and 1.0 assists per game.O'Neal averaged 15.2 points , 9.2 rebounds and 1.0 assists per game .sentence segmentation, tokenizationraw textCD VBD CD NNS , CD NNS CC CD NNS IN NN .NP - NP - NP - NP - NP -PERSON - - - - - - - - - - - -chunking, named entity recognitionPOSPart-of-speech (POS) taggingtokensNoun PhrasesEntitiesSentence Segmentation & TokenizationComputers store text data just a sequence of characters …Sentence Segmentation: segment a text into sentences • rules + exceptions• Is it a sentence separator or a part of a token? Mr. Dr. Yahoo!Tokenization: chunk a text into “tokens” (the smallest unit of analysis in text mining)• A token can be a word, a number, a punctuation, etc.• Not as simple as chunking text by whitespace and other non-alphabetical symbols…• Apostrophe? e.g., Shaquille O’Neal• Comma? 1,600 feet high• Hyphen? C-3PO, R2-D2• [email protected], 608-263-2900, jiepu_jiang• No rules are smart enough to cover all cases …Case-folding: lowercasing everythingCase-folding is widely applied to many text information systems …• e.g., Web search engines returns the same results for “SMART” and “smart”• It helps regularize words in text (e.g., words at the beginning of a sentence)Sometimes letter case may be informative, e.g.,• Will Smith • the US health care system• He is ABSOLUTELY a geniusStop words removalStop words• Words that can be ignored in text analysis, e.g., counting words frequencies• Usually not very informative for representing the topics of texts • (but usually very helpful for understanding the structures of texts)• Usually have very high frequencies• Remove or not? Depends on needs and text analytics methods…An example list of stop words (from Lucene)• a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, such, that, the, their, then, there, these, they, this, to, was, will, with• Lucene is a widely used text retrieval system -- it removes stop words because they are not helpful for keyword searchLemmatization & StemmingPurpose• To categorize words with the same root or lemma • Plural  singular, verb (different tenses), adj & adv etc.• Example: “cats” and “cat”; “search”, “searches”, “searching”Methods• Rule-based: defines and performs a set of rules (e.g., suffix stripping)• Dictionary-based: e.g., can handle exceptionsPorter StemmingRule-based, a list of suffix-stripping rules• Just some examples• -sses  -ss, e.g., caresses  caress• -ies  -i, e.g., ponies  poni• remove -s, e.g., cats  cat• eed  ee, e.g., agreed  agree• remove -ed, e.g., plastered  plaster• remove -ing, e.g., motoring  motor• -ational  -ate, e.g., relational  relate• -tional  -tion, e.g., conditional  condition• Iterative: organization  organize  organ • Cannot handle exceptions• Sometimes hard to interpret (as the outputs are stems, which may not be words)Krovetz Stemming: Rule + Dictionaryby Robert Krovetz• R. Krovertz. Viewing morphology as an inference process. SIGIR 1993.Use of dictionary to handle exceptions• Large dictionary of “head words” in a dictionary, e.g., lists of country names and nationalities, proper nouns, etc.• If a term is a head word, do not stem it• policy ≠ police and gravity ≠ grave and marbled ≠ marble• If it appears as an entry, convert to the headword• Otherwise, fall back to Porter-like rule-based approachStems generated by Krovetz stemming are actual wordsPorter and Krovetz StemmingOriginal P orter Krovetzcommunities commun communitygenerated gener generatesignificantly significantli significantsuccessfully successfulli successfuladditionally addition additionalrelatives rel relativeinternationally internation internationalimportantly importantli importantlaos lao laoscomputers comput computerproceeds proce proceedscontents content contentssafer safer safeExamples of stemming “errors”OverstemmingUnderstemmingOriginal P orter Krovetzorganization organ organizationorgan organ organheading head headinghead head headOriginal P orter Krovetzeuropean european europeeurope europ europeurgency urgenc urgenturgent urgent urgentA Typical Text preprocessing and parsing pipelineO'Neal averaged 15.2 points, 9.2 rebounds and 1.0 assists per game.O'Neal averaged 15.2 points , 9.2 rebounds and 1.0 assists per game .sentence segmentation, tokenizationraw textCD VBD CD NNS , CD NNS CC CD NNS IN NN .NP - NP - NP - NP - NP -PERSON - - - - - - - - - - - -chunking, named entity recognitionPOSPart-of-speech (POS) taggingtokensNoun PhrasesEntitiesPart of Speech (POS) Tagging• A part of speech is a category of words that have similar grammatical properties.• e.g., noun, pronoun, verb, adjective, etc.• POS tagging annotates each word in a sentence with a part-of-speech marker.• Most common POS tags used today is the Penn Treebank POS tagset• 36 POS tags and some other tags for punctuation and currency symbols• Fine-grained categories• Lowest level of syntactic analysis.• Useful for subsequent parsing such as chunking and named entity recognition.John saw the saw and decided to take it to the table.NNP VBD DT NN CC VBD TO VB PRP IN DT NNWord tokenPOS


View Full Document
Loading Unlocking...
Login

Join to view week 1-2 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view week 1-2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?