Slide 1Slide 2Slide 3Slide 4Slide 5Slide 6Slide 7Slide 8Slide 9Slide 10Slide 11Slide 12Slide 13Slide 14Slide 15Slide 16Slide 17Slide 18Slide 19Slide 20Slide 21Slide 22Slide 23Slide 24Slide 25Slide 26Slide 27Slide 28Slide 29Slide 30Slide 31Slide 32Slide 33Slide 341Introduction to NLP Tools09/23/20032Motivation•Machine Translation–From English to French•What’s needed?3Motivation Cont’d (1)•Syntactic parser•Part-Of-Speech Tagger–Example: NP -> adj noun•Morphological Analyzer–Example: “tools” -> “tool” “Who is he?” -> “Who is he ?”•Semantic Analyzer –Word sense disambiguate (“wash dishes”)–Choose the correct translation4Motivation Cont’d (2)•Lexicons–The information of the wordHow many senses? What’s the possible translationsof the word? •Corpus–Useful for learning a tool–Useful for evaluation5Outline•Lexicons•Text corpora•Morphological tools•Part-Of-Speech(POS) taggers•Syntactic parsers•Semantic knowledge bases and semantic parser•Speech tools6Lexicons•Definition–A repository for words•Lexicons in LDC(Linguistic Data Consortium)–creating and sharing linguistic resources: data, tools and standards. •CELEX•WordNet7CELEX•Dutch Center for Lexical Information•Lexical databases of English , Dutch and German•21,000 nouns, 8,000 adjectives and 6,000 verbs•English:–English Orthography, Lemmas–English Phonology, Lemmas–English Morphology, Lemmas–English Syntax, Lemmas–English Frequency, Lemmas–English Orthography, Wordforms–English Phonology, Wordforms–English Morphology, Wordforms–English Frequency, Wordforms–English Corpus Types–English Frequency, Syllables8WordNet•A database of lexical relations•Inspired by current psycholinguistic theories of human lexical memory•Synset: a set of synonyms, representing one underlying lexical concept–Example: •fool {chump, fish, fool, gull, mark, patsy, fall guy, sucker, schlemiel, shlemiel, soft touch, mug}•Relations link the synsets: hypernym, Has-Member, Member-Of, Antonym, etc.9WordNet Cont’d•Examplepu-erh.cs.utexas.edu$ wn bike -partnPart Meronyms of noun bike2 senses of bike Sense 1motorcycle, bike HAS PART: mudguard, splashguardSense 2bicycle, bike, wheel HAS PART: bicycle seat, saddle HAS PART: bicycle wheel HAS PART: chain HAS PART: coaster brake HAS PART: handlebar HAS PART: mudguard, splashguard HAS PART: pedal, treadle, foot lever HAS PART: sprocket, sprocket wheel•ExamplePu-erh.cs.utexas.edu$wn bikeInformation available for noun bike -hypen Hypernyms -hypon, -treen Hyponyms & Hyponym Tree -synsn Synonyms (ordered by frequency) -partn Has Part Meronyms -meron All Meronyms -famln Familiarity & Polysemy Count -coorn Coordinate Sisters -simsn Synonyms (grouped by similarity of meaning) -hmern Hierarchical Meronyms -grepn List of Compound Words -over Overview of SensesInformation available for verb bike -hypev Hypernyms -hypov, -treev Hyponyms & Hyponym Tree -synsv Synonyms (ordered by frequency) -famlv Familiarity & Polysemy Count -framv Verb Frames -simsv Synonyms (grouped by similarity of meaning) -grepv List of Compound Words -over Overview of Senses10Corpus•Definition–Collections of text and speech•LDC•Penn Treebank•DSO•Hansard11Some of the Top Corpus from LDC•TIPSTER –Information Retrieval, Data Extrraction datasets–TIPSTER project, TREC project•TIMIT Acoustic-Phonetic Continuous Speech Corpus–A corpus of read speech designed to –Provide speech data for the acquisition of acousticphonetic knowledge –Useful for the development and evaluation of automatic speech recognition systems•ECI(European Corpus Initiative Multilingual Corpus) multilingual electronic text corpus•NTIMIT–A phonetically–balanced, continuous speech, telephone bandwidth speech database12Penn Treebank•A collection of corpora•Tagged with POS, Syntactic roles, predicate/argument structure, dysfluency annotation•How are they made–Hand correction of the output of an errorful automatic process•3 million words–1 million words tagged with predicate/argument structure for extraction semantic knowledge13Penn Treebank Cont.’d•Corpora–Wall Street Journal –ATIS (Air Travel Information System)–Brown Corpus–IBM Manual Sentences–Library of America Texts: Mark Twain, Henry Adams, Herman Melville ...–MUC-3 Messages•Example:( (S (NP-SBJ Rally 's) (VP operates and franchises (NP (NP (QP about 160) fast-food restaurants) (PP-LOC throughout (NP the U.S))))Seeking/VBG to/TO block/VB[ the/DT investors/NNS ]from/IN buying/VBG[ more/JJR shares/NNS ]./.14DSO•Word Sense Corpus–Contains sentences in which about 192,800 word occurrences have been tagged with WordNet senses–Taken from the Brown corpus and the Wall Street Journal corpus–121 nouns and 70 verbs15Hansard•Official records (Hansards) of the 36th Canadian Parliament, both in English of French•1.3 million pairs of aligned sentences of English and French–Example•Comme il est 14 h 30, la Chambre s'ajourne jusqu'\xe0 lundi prochain, \xe0 11 heures, conform\xe9ment au paragraphe 24(1) du R\xe8glement.•It being 2.30 p.m., the House stands adjourned until Monday next at 11 a.m., pursuant to Standing Order 24(1).•Useful for Machine Translation16Morphological Tools•PC-KIMMO–A two-level morphological parser•Porter Stemmer•Penn Treebank Tokenizer–Seperate document into words–“dog?” -> “dog ?”17Porter Stemmer•Simple algorithm, use a set of cascaded rewrite rules–Example•Ational->ATE (relational->relate)•Stem:–The main morpheme of the word, supplying the main meaning•Fast•Used very widely in Information Retrieval–Run stemmer on keywords and the words in the documents18Part-Of-Speech(POS) Taggers•Part-Of-Speech: noun, verb, pronoun, etc.•Brill’s Tagger•HMM Tagger•MXPOST19Brill’s Tagger•Transformation-Based Learning(TBL) tagger•/projects/nlp/brill-pos-tagger•First labels every word with its most-likely tag•Then Use Learned TBL Rules to correct mistakes–Example:•Change NN to VB when the previous tag is
View Full Document