DOC PREVIEW
Columbia COMS W4705 - Information Extraction

This preview shows page 1-2-3-4-5 out of 14 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 14 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 14 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 14 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 14 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 14 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 14 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Slide 1Information Extraction (IE) - TaskNamed Entity TaggerIE for Template Filling Relation DetectionIE for Question AnsweringApproachesApproach for NERSupervised Approach for relation detectionPattern MatchingSemi-supervised approach AutoSlog-TS (Riloff 1996)Slide 11Task 12: (DARPA – GALE year2) Produce a biography of [person]Biography – two approachesPattern Matching for Relation DetectionInformation ExtractionCS4705Idea: ‘extract’ or tag particular types of information from arbitrary text or transcribed speechInformation Extraction (IE) - TaskIdentify types and boundaries of named entity◦For example:Alexander Mackenzie , (January 28, 1822 ‐ April 17, 1892), a building contractor and writer, was the second Prime Minister of Canada from ….-> <PERSON>Alexander Mackenzie</PERSON> , (<TIMEX >January 28, 1822 <TIMEX> ‐ <TIMEX>April 17, 1892</TIMEX>), a building contractor and writer, was the second Prime Minister of <GPE>Canada</GPE> from ….Named Entity TaggerIE for Template FillingRelation DetectionGiven a set of documents and a domain of interest, fill a table of required fields.• For example:Number of car accidents per vehicle type and number of casualties in the accidents.Q: When was Gandhi born?A: October 2, 1869Q: Where was Bill Clinton educated?A: Georgetown University in Washington, D.C.Q: What was the education of Yassir Arafat?A: Civil EngineeringQ: What is the religion of Noam Chomsky?A: JewishIE for Question AnsweringStatistical sequence labelingSupervisedSemi-supervised and bootstrappingApproaches<PERSON>Alexander Mackenzie</PERSON> , (<TIMEX >January 28, 1822 <TIMEX> ‐ <TIMEX>April 17, 1892</TIMEX>), a building contractor and writer, was the second Prime Minister of <GPE>Canada</GPE> from ….Statistical sequence labeling techniques can be used – similar to POS tagging◦Word-by-word sequence labeling◦Example of featuresPOS tagsSyntactic constituentsShape featuresPresence in a named entity listApproach for NERGiven a corpus of annotated relations between entities, train two classifiers:◦A binary classifierGiven a span of text and two entities -> decide if there is a relationship between these two entitiesFeatures◦Types of two named entities◦Bag of words◦POS of words in betweenExample:◦A rented SUV went out of control on Sunday, causing the death of seven people in Brooklyn◦Relation: Type = Accident, Vehicle Type = SUV, casualty = 7, weather = ?Pros and Cons?Supervised Approach for relation detectionHow can we come up with these patterns?Manually?◦Task and domain-specific◦Tedious, time consuming, not scalablePattern MatchingMUC-4 task: extract information about terrorist events in Latin AmericaTwo corpora:◦Domain-dependent corpus that contains relevant information◦A set of irrelevant documentsAlgorithm:1. Using heuristics, all patterns are extracted from both corpora. For example:Rule: <Subj> passive-verb<Subj> was murdered<Subj> was called2. Pattern Ranking: The output patterns are then ranked by the frequency of their occurrences in corpus1/corpus23. Filter out the patterns by handSemi-supervised approachAutoSlog-TS (Riloff 1996)1. Name(s), aliases:2. *Date of Birth or Current Age:3. *Date of Death:4. *Place of Birth:5. *Place of Death:6. Cause of Death:7. Religion (Affiliations):8. Known loca(ons and dates:9. Last known address:10. Previous domiciles:11. Ethnic or tribal affiliations:12. Immediate family members13. Na(ve Language spoken:14. Secondary Languages spoken:15. Physical Characteristics16. Passport number and country of issue:17. Professional positions:18. Education19. Party or other organization affiliations:20. Publica(ons (titles and dates):Task 12: (DARPA – GALE year2)Produce a biography of [person]To obtain high precision, we handle each slot independently using bootstrapping to learn IE patterns.To improve the recall, we utilize a biographical sentence classifierBiography – two approachesPatterns:◦“[CAR_TYPE] went out of control on [TIMEX], causing the death of [NUM] people”◦ “[PERSON] was born in [GPE]”◦ “[PERSON] was graduated from [FAC]”◦ “[PERSON] was killed by <X>” Matching Techniques◦Exact matchingPros and Cons?◦Flexible matching (e.g., [X] was .* killed .* by [Y])Pros and Cons?Pattern Matching for Relation


View Full Document

Columbia COMS W4705 - Information Extraction

Download Information Extraction
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Information Extraction and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Information Extraction 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?