Stanford CS 276B - Information Extraction II - D2614974

Home> Schools> Stanford University> Computer Science (CS) > CS 276B> Information Extraction II

Stanford CS 276B - Information Extraction II

Course Cs 276b- Text Information Retrieval, Mining, and Exploitation

Pages 8

Download Save

Unformatted text preview:

11CS276BWeb Search and MiningWinter 2005Lecture 8Information Extraction IIFeb 1, 2005(includes slides borrowed from BBN, David Blei, Nick Kushmerick, Andrew McCallum, and Ray Mooney) 2Information ExtractionLeslie Pack Kaelbling, Michael L. Littmanand Andrew W. Moore. ReinforcementLearning: A Survey. Journal of ArtificialIntelligence Research, pages 237-285,May 1996.AuthorTitleJournalReferencesHeaders3Plan for IEFirst classIntroduction to the IE problemWrappers and Wrapper InductionTraditional NLP-based IE: MUC competitionsToday Pattern Learning Systems: RapierProbabilistic sequence models: HMMs4Learning for IEWriting accurate patterns for each slot for each domain (e.g. each web site) requires laborious software engineering.Alternative is to use machine learning:Build a training set of documents paired with human-produced filled extraction templates.Learn extraction patterns for each slot using an appropriate machine learning algorithm. 5Automatic Pattern-Learning SystemsPros:Portable across domainsTend to have broad coverageRobust in the face of degraded input.Automatically find appropriate statistical patternsSystem knowledge not needed by those who supply the domain knowledge.Cons:Annotated training data, and lots of it, is needed.Isn’t necessarily better or cheaper than hand-built sol’nExamples:Riloff et al., AutoSlog, Soderland WHISK (UMass); Mooney et al. Rapier (UTexas); Ciravegna (Sheffield)Learn lexico-syntactic patterns from templatesTrainerDecoderModelLanguageInputAnswersAnswersLanguageInput6Rapier [Califf & Mooney, AAAI-99]Rapier learns three regex-style patterns for each slot:Pre-filler pattern  Filler pattern  Post-filler patternOne of several recent trainable IE systems that incorporate linguistic constraints. (See also: SIFT [Miller et al, MUC-7]; SRV[Freitag, AAAI-98]; Whisk [Soderland, MLJ-99].)RAPIER rules for extracting “transaction price”“…paid $11M for the company…”“…sold to the bank for an undisclosedamount…”“…paid Honeywell an undisclosedprice…”27Part-of-speech tags & Semantic classesPart of speech: syntactic role of a specific wordnoun (nn), proper noun (nnp), adjectve (jj), adverb (rb), determiner (dt), verb (vb), “.” (“.”), …NLP: Well-known algorithms for automatically assigning POS tags to English, French, Japanese, … (>95% accuracy)Semantic Classes: Synonyms or other related words“Price” class: price, cost, amount, …“Month” class: January, February, March, …, December“US State” class: Alaska, Alabama, …, Washington, WyomingWordNet: large on-line thesaurus containing (among other things) semantic classes8Rapier rule matching example“…sold to the bank for an undisclosed amount…”POS: vb pr det nn pr det jj nnSClass: price“…paid Honeywell an undisclosed price…”POS: vb nnp det jj nnSClass: price9Rapier Rules: DetailsRapier rule :=pre-filler patternfiller patternpost-filler patternpattern := subpattern +subpattern := constraint +constraint :=Word - exact word that must be presentTag - matched word must have given POS tagClass - semantic class of matched wordCan specify disjunction with “{…}”List length N - between 0 and N words satisfying other constraints10Rapier’s Learning AlgorithmInput: set of training examples (list of documents annotated with “extract thissubstring”)Output: set of rulesInit: Rules = a rule that exactly matches each training example Repeat several times:Seed: Select M examples randomly and generate the Kmost-accurate maximally-general filler-only rules(prefiller = postfiller = “true”).Grow:Repeat For N = 1, 2, 3, …Try to improve K best rules by adding N context wordsof prefiller or postfiller contextKeep:Rules = Rules ∪ the best of the K rules – subsumed rules11Learning example (one iteration)2 examples:‘… located in Atlanta, Georgia…”‘… offices in Kansas City, Missouri…’maximally specific rules(high precision, low recall)maximally general rules(low precision, high recall)appropriately general rule (high precision, high recall)InitSeedGrow12Statistical generative modelsPrevious discussion examined systems that use explicit extraction patterns/rulesSequence Models are statistical models of whole token sequences that effectively label subsequencesBest known case is generative Hidden Markov Models (HMMs) Pros:Well-understood underlying statistical models make it easy to used wide range of tools from statistical decision theoryPortable, broad coverage, robust, good recallCons:Range of features and patterns usable may be limitedNot necessarily as good for complex multi-slot patterns313Name Extraction via HMMsTextSpeechRecognitionExtractorSpeechEntitiesNEModelsLocationsPersonsOrganizationsThe delegation, which included the commander of the U.N. troops in Bosnia, Lt. Gen. Sir Michael Rose, went to the Serb stronghold of Pale, near Sarajevo, for talks with Bosnian Serb leader Radovan Karadzic.TrainingProgramtrainingsentencesanswersThe delegation, which included the commander of the U.N. troops in Bosnia, Lt. Gen. Sir Michael Rose, went to the Serb stronghold of Pale, near Sarajevo, for talks with Bosnian Serb leader Radovan Karadzic.An easy but successful HMM application:•Prior to 1997 - no learning approach competitive with hand-built rule systems •Since 1997 - Statistical approaches (BBN (Bikelet al. 1997), NYU, MITRE, CMU/JustSystems) achieve state-of-the-art performance14Applying HMMs to IEDocument ⇒ generated by a stochastic process modelled by an HMMToken ⇒ wordState ⇒ “reason/explanation” for a given token‘Background’ state emits tokens like ‘the’, ‘said’, …‘Money’ state emits tokens like ‘million’, ‘euro’, …‘Organization’ state emits tokens like ‘university’, ‘company’, …Extraction: via the Viterbi algorithm, a dynamic programming technique for efficiently computing the most likely sequence of states that generated a document.15HMM formalismHMM = states s1, s2, …(special start state s1special end state sn)token alphabet a1, a2, …state transition probs P(si|sj)token emission probs P(ai|sj)Widely used in many

View Full Document


School:
Email:
New Password:
Confirm Password:

Stanford CS 276B - Information Extraction II

Sign up for free to view:

Please select your school