DOC PREVIEW
MSU CSE 842 - CSE 842 Natural Language Processing
Course Cse 842-
Pages 18

This preview shows page 1-2-3-4-5-6 out of 18 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 18 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 18 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 18 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 18 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 18 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 18 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 18 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

4/11/2011 CSE842, Spring 2011, MSU 1CSE 842Natural Language ProcessingLecture 22: Information Extraction and Question Answering4/11/2011 CSE842, Spring 2011, MSU 2Today’s Topic• Information Extraction • Information Retrieval• Question Answering4/11/2011 CSE842, Spring 2011, MSU 3What is Information Extraction4/11/2011 CSE842, Spring 2011, MSU 4Information Extraction (IE)• Identify specific pieces of information (data) in an unstructured or semi-structured textual document.• Transform unstructured information in a corpus of documents or web pages into a structured database• Applied to different types of text:– Newspaper articles– Scientific articles– Web pages–Newsgroup messages– Classified ads– Medical literatures4/11/2011 CSE842, Spring 2011, MSU 5A Bare Bone Extraction SystemTokenization Morphological and Lexical ProcessingSyntactic AnalysisSemantics Analysis4/11/2011 CSE842, Spring 2011, MSU 6Flesh for the BonesTokenizationMorphological and Lexical ProcessingSyntactic AnalysisSemantic AnalysisWord SegmentationPart of Speech and Word Sense TaggingNER, CoreferenceMerge Partial Results Parsing4/11/2011 CSE842, Spring 2011, MSU 7An ExampleCiting high fuel prices, United Airlines said Friday it has increased fares by $6 per round trip on flights to some cities also served by lower-cost carriers. American Airlines, a unit of AMR Corp., immediately matched the move, spokesman Tim Wagner said. United, a unit of UAL Corp., said the increase took effect Thursday and applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Denver to San Francisco. 4/11/2011 CSE842, Spring 2011, MSU 8Some basic problems• Named entity recognition (NER)• Coreference resolution (last lecture)• Relation detection and classification• Event detection and classification • Temporal analysis4/11/2011 CSE842, Spring 2011, MSU 9• Named Entities are proper names in texts, i.e. the names of persons, organizations, locations, times and quantities. • NER is the task of processing a text and identifying named entitiesName Entity Recognition4/11/2011 CSE842, Spring 2011, MSU 10Name Entity Recognition4/11/2011 CSE842, Spring 2011, MSU 11Why is Named Entity Recognition difficult?• Names too numerous to include in dictionaries• Variationse.g. John Smith, Mr Smith, John• Changing constantlynew names invent unknown words• AmbiguityFor some proper nouns it is hard to determine the categoryName Entity Recognition4/11/2011 CSE842, Spring 2011, MSU 12Name Entity RecognitionExample of common categorical ambiguities:4/11/2011 CSE842, Spring 2011, MSU 13Delimit the named entities in a text and tag them with NECategories:– entity names - ENAMEX– temporal expressions - TIMEX– number expressions - NUMEXSubcategories of tags– captured by a SGML tag attribute called TYPEExample4/11/2011 CSE842, Spring 2011, MSU 14• Original text:The U.K. satellite television broadcaster said its subscriber base grew 17.5 percent during the past year to 5.35 million• Tagged text:The <ENAMEX TYPE="LOCATION">U.K.</ENAMEX>satellite television broadcaster said its subscriber base grew <NUMEX TYPE="PERCENT">17.5 percent</NUMEX>during <TIMEX TYPE="DATE">the past year</TIMEX> to 5.35 millionExample4/11/2011 CSE842, Spring 2011, MSU 15Maximum Entropy for NER (Borthwick et al., 1998)• Outcomes: N classes for MUC7 (N = 7)– Person Name, Company Name, Date• For a particular class x: x_start, x_continue, x_end, x_unique• 4N+1 tags (29 tags)[Jerry Lee Lewis flew to Paris]Per_start Per_cont. Per_end other other loc_unique4/11/2011 CSE842, Spring 2011, MSU 16Types of Features• Binary features– Token properties which are either on or off for a given token (e.g., All-caps, 2-digit-number, only-digits, initial-cap)– Overlapping allowed⎪⎩⎪⎨⎧===elsestartlocationaandtruebdcapitalizetokencurrentifbaf:0_)(__:1),(4/11/2011 CSE842, Spring 2011, MSU 17Types of Features• Lexical features– Lexical lookup for words in the context for a current token– Lexicon is built automatically (just build a vocabulary V as “all words w: c(w) > 2”⎪⎩⎪⎨⎧===−elseuniquepersonaandMrbtokenViewLexicalifbaf:0_""))((_:1),(14/11/2011 CSE842, Spring 2011, MSU 18Types of Features• Section features– The current section of the article: title, textbody• Dictionary features– Multi-words entries of pre-classified NE words (e.g., Michigan State University)• External Systems Feature: use other taggers⎪⎩⎪⎨⎧===elseuniquepersonaanduniquepersonbtokenSystemAifbaf:0__))((:1),(04/11/2011 CSE842, Spring 2011, MSU 19Feature Selection• Put all possible features from the classes to be included into the model into a feature pool– Lexical features for range w-2,..w0,..w2, vocabulary size of V, then 5*(V+1)*29 lexical features.• Select all features which fire at least three times on the training corpus• Features which predict the tag other have to fire six times to be included• Lexical features which activate on w-2and w2are excluded if they predict other4/11/2011 CSE842, Spring 2011, MSU 20Evaluation Metrics• Precision and Recall:– Precision: correct answers / answers produced– Recall: correct answers / total possible correct answers• F-measure:–Where β is a parameter representing relative importance of P and R. )(*)1(22RPRPF++=ββ4/11/2011 CSE842, Spring 2011, MSU 21Evaluation• F-measurement: 97.12% (used three other systems as external features)• Human performance: 96.95-97.60.4/11/2011 CSE842, Spring 2011, MSU 22NER as Sequence Labeling4/11/2011 CSE842, Spring 2011, MSU 23Common Features4/11/2011 CSE842, Spring 2011, MSU 24Sequential Classification4/11/2011 CSE842, Spring 2011, MSU 25CONLL 2003 Shared Task•http://cnts.uia.ac.be/conll2003/ner4/11/2011 CSE842, Spring 2011, MSU 26Relation Detection and Classification• Two subtasks: – Detecting the presence of a relation– If so, classify the relation 4/11/2011 CSE842, Spring 2011, MSU 27Relation Detection and ClassificationA model-based view of the relations and entities4/11/2011 CSE842, Spring 2011, MSU 28Common Features used for ClassificationDuring classification of the <American Airlines, Tim Wagner> tuple4/11/2011 CSE842, Spring 2011, MSU 29Information Retrieval IRSystemQuery StringDocumentcorpusRankedDocuments1. Doc12. Doc23. Doc3..4/11/2011


View Full Document

MSU CSE 842 - CSE 842 Natural Language Processing

Course: Cse 842-
Pages: 18
Download CSE 842 Natural Language Processing
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view CSE 842 Natural Language Processing and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view CSE 842 Natural Language Processing 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?