DOC PREVIEW
CORNELL CS 674 - Empirical Methods in Information Extraction

This preview shows page 1-2-3 out of 8 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Empirical Methods in Information ExtractionBy Claire CardiePresentation by Dusty SargentBackground Domain-specific task differs from more general problems studied so far Summarizes important points in a text with respect to a target topic Structures information for storage into databaseBackground (cont’d) MUC (Message Understanding Conference) evaluates systems Provides answer keys and texts for particular topic Recall = (# correct slot fillers in output template) / (# of slot-fillers in answer key) Precision = (# correct slot fillers in output template) / (# of slot-fillers in output template) Has been used in practical applicationsApplications Summarize medical records (test results, diagnoses, symptoms, etc.) Extract information about terrorist activities from radio or television broadcasts Keep records of corporate mergers and acquisitions Build knowledge bases from information found in websites Create job listings from web-based classified ads, job-search sites and newsgroupsPerformance State of the art systems reach 50% recall and 70% precision on complicated extraction problems Can reach 90% precision and recall on the easiest extraction tasks Human error rate also high for information extraction Best systems have only twice error rate of human experts trained for same task Still a lot of room for improvement Time consuming development phase and cause of errors difficult to determineArchitecture Traditional NLP approach with full syntactic and semantic analysis of input text Less common simple approach with keyword matching and little linguistic analysisArchitecture (cont’d) Tagging and tokenization: divide input into sentences and words, part-of-speech tag and disambiguation word senses Sentence analysis: partial parse and tag with respect to semantic roles Extraction: identify relevant entities and relations between them, specific to the domain Merging: coreference resolution between extracted entities and events Template generation: map extracted information into domain specific output formatCorpus-based Learning Used for the underlying tasks of information extraction Can apply to preliminary stages of the architecture Difficulty in finding enough training data for all the levels of analysis required Expensive to retrain the system for each domain to which it must be applied Standard NLP learning techniques difficult to apply to later stages: learning extraction patterns, coreferenceresolution, template generation New training corpus needed for each task; difficult to learn general patterns from answer keysLearning Extraction Patterns Use general pattern matching techniques for extraction phase Acquire good extraction patterns from training corpus with empirical methods Similar to Candidate Elimination Algorithm Extraction patterns ordered from general to specific, need balance between the two Need general patterns to apply to more than one case Patterns must be specific enough that they do not apply in the wrong contextAutoSlog One of earliest systems for learning extraction patters, by Lehnert and Riloff (1992 –1993) Learns “concept nodes”, domain-specific semantic frames, maximum of one slot per frame Concept nodes used with CIRCUS parser for the final extraction taskConcept Node Definition Concept: the concept to be extracted, e.g. Damaged-Object Trigger: word that activates pattern Position: syntactic position where the concept is likely to be found in the sentence Constraints: constraints on argument at “Position”necessary for extraction to occur; can be hard or soft Enabling Conditions: constraints on linguistic context of trigger wordExample Application Example: “...the twister occurred at approximately 7:15pm and destroyed two mobile homes.” Concept is Damaged-Object Concept node is activated by trigger word “destroyed” Enabling Condition: “destroyed” occurs in active voice Position: direct-object of verb “destroyed” Constraints: direct-object of “destroyed” must be a physical object Result: “two mobile homes” is extracted to fill the Damaged-Object slot of the concept nodeConcept Node Algorithm Concept nodes applied during partial parsing phase of the extraction system When trigger word encountered, check for enabling conditions If met, extract phrase in appropriate position Test phrase for constraints If constraints met, label phrase as instance of the concept typeLearning Concept Nodes Learning algorithm specific to domain Requires training text with noun phrases annotated with concept type, or uses answer keys Uses partial parse and small set of linguistic patterns to help learn concept nodes New version, AutoSlog-TS, only needs to be given texts marked as relevant or irrelevant to the domain of the extraction taskLearning Algorithm Find sentence in which target noun phrase occurs in training data Parse the sentence with partial parser Apply the list of linguistic patterns in order If a pattern linguistic pattern applies to the sentence, create a concept node definition from the appropriate elements of the sentenceLearning Example “Witnesses confirm that the twister occurred without warning at approximately 7:15pm and destroyed two mobile homes(Damaged_Object)”. Target noun phrase is “two mobile homes”, marked in training corpus as an instance of the concept Damaged_Object, or found in the Damaged_Object field in the answer key Step 1: find the above sentence in the training corpus, in which the target noun phrase occurs Step 2: parser determines that “two mobile homes” was the direct object of active verb “destroyed” in the third clause Step 3: match third clause to the following linguistic pattern: <active-voice-verb> followed by <target-np> = <direct-object> Step 4: generate the concept node seen previously from matched constituents, context, concept type, and semantic classAutoSlog-TS Improved version needs only relevant and irrelevant texts as training data Adapts AutoSlog to use statistical techniques Nearly matches performance of AutoSlog on MUC 4 extraction task, using a fraction of the human effort Scans corpus once and generates an extraction pattern for every noun phrase Scans again and ranks extraction patterns according to some ranking functionPALKA Learns


View Full Document

CORNELL CS 674 - Empirical Methods in Information Extraction

Download Empirical Methods in Information Extraction
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Empirical Methods in Information Extraction and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Empirical Methods in Information Extraction 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?