Stanford CS 124 - Information Extraction and Named Entity Recognition

Unformatted text preview:

Information Extraction and Named Entity RecognitionIntroducing the tasks:Getting simple structured information out of textChristopher ManningInformation Extraction• Information extraction (IE) systems• Find and understand limited relevant parts of texts• Gather information from many pieces of text•Produce a structured representation of relevant information: •Produce a structured representation of relevant information: • relations (in the database sense), a.k.a.,• a knowledge base• Goals:1. Organize information so that it is useful to people2. Put information in a semantically precise form that allows further inferences to be made by computer algorithmsChristopher ManningInformation Extraction (IE)• IE systems extract clear, factual information• Roughly: Who did what to whom when?• E.g.,•Gathering earnings, profits, board members, headquarters, etc. from •Gathering earnings, profits, board members, headquarters, etc. from company reports • The headquarters of BHP Billiton Limited, and the global headquarters of the combined BHP Billiton Group, are located in Melbourne, Australia. • headquarters(“BHP Biliton Limited”, “Melbourne, Australia”)• Learn drug-gene product interactions from medical research literatureChristopher ManningLow-level information extraction• Is now available – and I think popular – in applications like Apple or Google mail, and web indexing• Often seems to be based on regular expressions and name listsChristopher ManningLow-level information extractionChristopher ManningWhy is IE hard on the web?A book,Not a toyNeed thispriceTitleChristopher ManningHow is IE useful? Classified Advertisements (Real Estate)Background:• Plain text advertisements•Lowest common <ADNUM>2067206v1</ADNUM><DATE>March 02, 1998</DATE><ADTITLE>MADDINGTON $89,000</ADTITLE><ADTEXT>OPEN 1.00 - 1.45<BR>U 11 / 10 BERTRAM ST<BR> •Lowest common denominator: only thing that 70+ newspapers using many different publishing systems can all handleU 11 / 10 BERTRAM ST<BR> NEW TO MARKET Beautiful<BR>3 brm freestanding<BR>villa, close to shops & bus<BR>Owner moved to Melbourne<BR> ideally suit 1st home buyer,<BR> investor & 55 and over.<BR>Brian Hazelden 0418 958 996<BR> R WHITE LEEMING 9332 3477</ADTEXT>Christopher ManningChristopher ManningWhy doesn’t text search (IR) work?What you search for in real estate advertisements:• Town/suburb. You might think easy, but:• Real estate agents: Coldwell Banker, Mosman• Phrases: Only 45 minutes from Parramatta• Multiple property ads have different suburbs in one ad• Money: want a range not a textual match• Multiple amounts: was $155K, now $145K• Variations: offers in the high 700s [but not rents for $270]• Bedrooms: similar issues: br, bdr, beds, B/RChristopher ManningNamed Entity Recognition (NER)• A very important sub-task: find and classify names in text, for example:•The decision by the independent MP Andrew Wilkieto •The decision by the independent MP Andrew Wilkieto withdraw his support for the minority Labor government sounded dramatic but it should not further threaten its stability. When, after the 2010 election, Wilkie, Rob Oakeshott, Tony Windsor and the Greens agreed to support Labor, they gave just two guarantees: confidence and supply.Christopher Manning• A very important sub-task: find and classify names in text, for example:•The decision by the independent MP Andrew Wilkieto Named Entity Recognition (NER)•The decision by the independent MP Andrew Wilkieto withdraw his support for the minority Labor government sounded dramatic but it should not further threaten its stability. When, after the 2010 election, Wilkie, Rob Oakeshott, Tony Windsor and the Greens agreed to support Labor, they gave just two guarantees: confidence and supply.Christopher Manning• A very important sub-task: find and classify names in text, for example:•The decision by the independent MP Andrew Wilkieto Named Entity Recognition (NER)Person•The decision by the independent MP Andrew Wilkieto withdraw his support for the minority Labor government sounded dramatic but it should not further threaten its stability. When, after the 2010 election, Wilkie, Rob Oakeshott, Tony Windsor and the Greens agreed to support Labor, they gave just two guarantees: confidence and supply.PersonDateLocationOrgani-zationChristopher ManningNamed Entity Recognition (NER)• The uses:• Named entities can be indexed, linked off, etc.• Sentiment can be attributed to companies or products•A lot of IE relations are associations between named entities•A lot of IE relations are associations between named entities• For question answering, answers are often named entities.• Concretely:• Many web pages tag various entities, with links to bio or topic pages, etc.• Reuters’ OpenCalais, Evri, AlchemyAPI, Yahoo’s Term Extraction, …• Apple/Google/Microsoft/… smart recognizers for document contentInformation Extraction and Named Entity RecognitionIntroducing the tasks:Getting simple structured information out of textEvaluation of Named Entity RecognitionPrecision, Recall, and the F measure;their extension to sequencesChristopher ManningThe 2-by-2 contingency tablecorrect not correctselected tp fpnot selectedfntnnot selectedfntnChristopher ManningPrecision and recall• Precision: % of selected items that are correctRecall: % of correct items that are selectedcorrect not correctselected tp fpnot selected fn tnChristopher ManningA combined measure: F• A combined measure that assesses the P/R tradeoff is F measure (weighted harmonic mean):PRF+==2)1(1β• The harmonic mean is a very conservative average; see IIR §8.3• People usually use balanced F1 measure• i.e., with β = 1 (that is, α = ½): F = 2PR/(P+R)RPPRRPF++=−+=2)1(1)1(11ββααChristopher ManningQuiz questionWhat is the F1?P= 40% R= 40%F= P= 40% R= 40%F1 =Christopher ManningQuiz questionWhat is the F1?P= 75% R= 25%F= P= 75% R= 25%F1 =Christopher ManningThe Named Entity Recognition TaskTask: Predict entities in a textForeign ORGMinistry ORGMinistry ORGspokesman OShen PERGuofang PERtold OReuters ORG: :}Standardevaluationis per entity, not per tokenChristopher ManningPrecision/Recall/F1 for IE/NER• Recall and precision are straightforward for tasks like IR and text categorization, where there is only one grain size (documents)• The measure behaves a bit funnily for IE/NER when there are boundary errors (which are


View Full Document

Stanford CS 124 - Information Extraction and Named Entity Recognition

Download Information Extraction and Named Entity Recognition
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Information Extraction and Named Entity Recognition and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Information Extraction and Named Entity Recognition 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?