Lecture 16 Information Extraction Oct 26 2007 ChengXiang Zhai Most slides are from Eugene Agichtein s and William Cohen s tutorials CS511 Advanced Database Management Systems 1 The Value of Text Data Unstructured text data is the primary form of human generated information Blogs web pages news scientific literature online reviews Semi structured data database generated see Prof Bing Liu s KDD webinar http www cs uic edu liub WCM Refs html The techniques discussed here are complimentary to structured object extraction methods Need to extract structured information to effectively manage search and mine the data Information Extraction mature but active research area Intersection of Computational Linguistics Machine Learning Data mining Databases and Information Retrieval Traditional focus on accuracy of extraction Recently attention paid to scalability CS511 Advanced Database Management Systems 2 Example Answering Queries Over Text For years Microsoft Corporation CEO Bill Gates was against open source But today he appears to have changed his mind We can be open source We love the concept of shared source said Bill Veghte a Microsoft VP That s a super important shift for us in terms of code access Select Name From PEOPLE Where Organization Microsoft PEOPLE Name Bill Gates Bill Veghte Richard Stallman Richard Stallman founder of the Free Software Foundation countered saying CS511 Advanced Database Management Systems Title Organization CEO Microsoft VP Microsoft Founder Free Soft Bill Gates Bill Veghte from William Cohen s IE tutorial 2003 3 IE History Pre Web Mostly news articles De Jong s FRUMP 1982 Hand built system to fill Schank style scripts from news wire Message Understanding Conference MUC DARPA 87 95 TIPSTER 92 96 Early work dominated by hand built models E g SRI s FASTUS hand built FSMs But by 1990 s some machine learning Lehnert Cardie Grishman and then HMMs Elkan Leek 97 BBN Bikel et al 98 CS511 Advanced Database Management Systems 4 IE History Web AAAI 94 Spring Symposium on Software Agents Much discussion of ML applied to Web Maes Mitchell Etzioni Tom Mitchell s WebKB 96 Build KB s from the Web Wrapper Induction Initially hand build then ML Soderland 96 Kushmeric 97 Citeseer Cora FlipDog contEd courses corpInfo WebFountain IBM KnowItAll University of Washington CS511 Advanced Database Management Systems 5 IE History Other Domains Biology Gene protein entity extraction Protein protein fact interaction Automated curation integration of databases At CMU SLIF Murphy et al subcellular information from images text in journal articles At UIUC BeeSpace http www beespace uiuc edu Email EPCA PAL RADAR CALO intelligent office assistant that understands some part of email At CMU web site update requests office space requests calendar scheduling requests social network analysis of email CS511 Advanced Database Management Systems 6 Landscape of IE Tasks 1 4 Degree of Formatting Text paragraphs without formatting Grammatical sentences and some formatting links Astro Teller is the CEO and co founder of BodyMedia Astro holds a Ph D in Artificial Intelligence from Carnegie Mellon University where he was inducted as a national Hertz fellow His M S in symbolic and heuristic computation and B S in computer science are from Stanford University His work in science literature and business has appeared in international media from the New York Times to CNN to NPR Non grammatical snippets rich formatting links CS511 Advanced Database Management Systems Tables 7 Landscape of IE Tasks 2 4 Intended Breadth of Coverage Web site specific Formatting Amazon com Book Pages CS511 Advanced Database Management Systems Genre specific Layout Resumes Wide non specific Language University Names 8 Landscape of IE Tasks 3 4 Complexity E g word patterns Closed set Regular set U S states U S phone numbers He was born in Alabama Phone 413 545 1323 The big Wyoming sky The CALD main office can be reached at 412 268 1299 Complex pattern U S postal addresses University of Arkansas P O Box 140 Hope AR 71802 Headquarters 1128 Main Street 4th Floor Cincinnati Ohio 45210 CS511 Advanced Database Management Systems Ambiguous patterns needing context and many sources of evidence Person names was among the six houses sold by Hope Feldman that year Pawel Opalinski Software Engineer at WhizBang Labs 9 Landscape of IE Tasks 4 4 Single Field Record Jack Welch will retire as CEO of General Electric tomorrow The top role at the Connecticut company will be filled by Jeffrey Immelt Single entity Binary relationship Person Jack Welch Relation Person Title Person Jack Welch Title CEO Person Jeffrey Immelt Location Connecticut N ary record Relation Company Title Out In Succession General Electric CEO Jack Welsh Jeffrey Immelt Relation Company Location Company General Electric Location Connecticut Named entity extraction CS511 Advanced Database Management Systems 10 Landscape of IE Techniques 1 1 Models Classify Pre segmented Candidates Lexicons Abraham Lincoln was born in Kentucky member Abraham Lincoln was born in Kentucky Abraham Lincoln was born in Kentucky Abraham Lincoln was born in Kentucky Classifier Classifier Alabama Alaska Wisconsin Wyoming Boundary Models Sliding Window which class which class Try alternate window sizes Finite State Machines Abraham Lincoln was born in Kentucky Context Free Grammars Abraham Lincoln was born in Kentucky P Classifier PP which class VP NP BEGIN END BEGIN NP END VP S Any of these models can be used to capture words formatting or both CS511 Advanced Database Management Systems rse V pa V ely NNP lik NNP Mo st Most likely state sequence BEGIN 11 Hand Coded Methods Easy to construct in some cases e g to recognize prices phone numbers zip codes conference names etc Intuitive to debug and maintain Especially if written in a high level language Can incorporate domain knowledge Scalability issues Labor intensive to create Highly domain specific Often corpus specific ContactPattern RegularExpression Email body can be reached at Rule matches can be expensive IBM Avatar CS511 Advanced Database Management Systems 12 Machine Learning Methods Can work well when lots of training data easy to construct Can capture complex patterns that are hard to encode with handcrafted rules e g determine whether a review is positive or negative extract long complex gene names Non local dependencies The human T cell leukemia lymphotropic virus type 1 Tax protein represses MyoD dependent transcription by inhibiting MyoD binding
View Full Document