week 1-2

Home> Academic Documents> week 1-2

week 1-2

Download Save

Unformatted text preview:

IntroductionText preprocessing: clean and regularize text data.Text parsing: recognize meaningful units and structures of text.Today, many NLP (Natural Language Processing) tools can perform text preprocessing and parsing automatically with reasonably high accuracy on well-understood text data (e.g., news articles).Learning Goals• What are the widely used text preprocessing & parsing steps?• What are the core ideas of the NLP methods for text preprocessing & parsing?• When would existing methods be generalizable to new data?A Typical Text preprocessing and parsing pipelineO'Neal averaged 15.2 points, 9.2 rebounds and 1.0 assists per game.O'Neal averaged 15.2 points , 9.2 rebounds and 1.0 assists per game .sentence segmentation, tokenizationcase foldingstop words removallemmatizationraw texttokenso’neal averaged 15.2 points , 9.2 rebounds and 1.0 assists per game .o’neal averaged 15.2 points 9.2 rebounds 1.0 assists per gameo’neal average 15.2 point 9.2 rebound 1.0 assist per gamePart-of-speech tagging, chunking, named-entity recognitionA Typical Text preprocessing and parsing pipelineO'Neal averaged 15.2 points, 9.2 rebounds and 1.0 assists per game.O'Neal averaged 15.2 points , 9.2 rebounds and 1.0 assists per game .sentence segmentation, tokenizationraw textCD VBD CD NNS , CD NNS CC CD NNS IN NN .NP - NP - NP - NP - NP -PERSON - - - - - - - - - - - -chunking, named entity recognitionPOSPart-of-speech (POS) taggingtokensNoun PhrasesEntitiesSentence Segmentation & TokenizationComputers store text data just a sequence of characters …Sentence Segmentation: segment a text into sentences • rules + exceptions• Is it a sentence separator or a part of a token? Mr. Dr. Yahoo!Tokenization: chunk a text into “tokens” (the smallest unit of analysis in text mining)• A token can be a word, a number, a punctuation, etc.• Not as simple as chunking text by whitespace and other non-alphabetical symbols…• Apostrophe? e.g., Shaquille O’Neal• Comma? 1,600 feet high• Hyphen? C-3PO, R2-D2• [email protected], 608-263-2900, jiepu_jiang• No rules are smart enough to cover all cases …Case-folding: lowercasing everythingCase-folding is widely applied to many text information systems …• e.g., Web search engines returns the same results for “SMART” and “smart”• It helps regularize words in text (e.g., words at the beginning of a sentence)Sometimes letter case may be informative, e.g.,• Will Smith • the US health care system• He is ABSOLUTELY a geniusStop words removalStop words• Words that can be ignored in text analysis, e.g., counting words frequencies• Usually not very informative for representing the topics of texts • (but usually very helpful for understanding the structures of texts)• Usually have very high frequencies• Remove or not? Depends on needs and text analytics methods…An example list of stop words (from Lucene)• a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, such, that, the, their, then, there, these, they, this, to, was, will, with• Lucene is a widely used text retrieval system -- it removes stop words because they are not helpful for keyword searchLemmatization & StemmingPurpose• To categorize words with the same root or lemma • Plural  singular, verb (different tenses), adj & adv etc.• Example: “cats” and “cat”; “search”, “searches”, “searching”Methods• Rule-based: defines and performs a set of rules (e.g., suffix stripping)• Dictionary-based: e.g., can handle exceptionsPorter StemmingRule-based, a list of suffix-stripping rules• Just some examples• -sses  -ss, e.g., caresses  caress• -ies  -i, e.g., ponies  poni• remove -s, e.g., cats  cat• eed  ee, e.g., agreed  agree• remove -ed, e.g., plastered  plaster• remove -ing, e.g., motoring  motor• -ational  -ate, e.g., relational  relate• -tional  -tion, e.g., conditional  condition• Iterative: organization  organize  organ • Cannot handle exceptions• Sometimes hard to interpret (as the outputs are stems, which may not be words)Krovetz Stemming: Rule + Dictionaryby Robert Krovetz• R. Krovertz. Viewing morphology as an inference process. SIGIR 1993.Use of dictionary to handle exceptions• Large dictionary of “head words” in a dictionary, e.g., lists of country names and nationalities, proper nouns, etc.• If a term is a head word, do not stem it• policy ≠ police and gravity ≠ grave and marbled ≠ marble• If it appears as an entry, convert to the headword• Otherwise, fall back to Porter-like rule-based approachStems generated by Krovetz stemming are actual wordsPorter and Krovetz StemmingOriginal P orter Krovetzcommunities commun communitygenerated gener generatesignificantly significantli significantsuccessfully successfulli successfuladditionally addition additionalrelatives rel relativeinternationally internation internationalimportantly importantli importantlaos lao laoscomputers comput computerproceeds proce proceedscontents content contentssafer safer safeExamples of stemming “errors”OverstemmingUnderstemmingOriginal P orter Krovetzorganization organ organizationorgan organ organheading head headinghead head headOriginal P orter Krovetzeuropean european europeeurope europ europeurgency urgenc urgenturgent urgent urgentA Typical Text preprocessing and parsing pipelineO'Neal averaged 15.2 points, 9.2 rebounds and 1.0 assists per game.O'Neal averaged 15.2 points , 9.2 rebounds and 1.0 assists per game .sentence segmentation, tokenizationraw textCD VBD CD NNS , CD NNS CC CD NNS IN NN .NP - NP - NP - NP - NP -PERSON - - - - - - - - - - - -chunking, named entity recognitionPOSPart-of-speech (POS) taggingtokensNoun PhrasesEntitiesPart of Speech (POS) Tagging• A part of speech is a category of words that have similar grammatical properties.• e.g., noun, pronoun, verb, adjective, etc.• POS tagging annotates each word in a sentence with a part-of-speech marker.• Most common POS tags used today is the Penn Treebank POS tagset• 36 POS tags and some other tags for punctuation and currency symbols• Fine-grained categories• Lowest level of syntactic analysis.• Useful for subsequent parsing such as chunking and named entity recognition.John saw the saw and decided to take it to the table.NNP VBD DT NN CC VBD TO VB PRP IN DT NNWord tokenPOS

View Full Document


School:
Email:
New Password:
Confirm Password:

Sign up for free to view:

Please select your school