CS276B Text Information Retrieval, Mining, and ExploitationPowerPoint PresentationProduct informationProduct infoIt’s difficult because of textual inconsistency: digital camerasClassified Advertisements (Real Estate)Slide 7Why doesn’t text search (IR) work?Extracting Job Openings from the WebSlide 10Knowledge Extraction VisionTask: Information ExtractionOther applications of IE SystemsWhat about XML?Task: Wrapper InductionAmazon Book DescriptionExtracted Book TemplateTemplate TypesWrappers: Simple Extraction PatternsSimple Template ExtractionPre-Specified Filler ExtractionWrapper tool-kitsWrapper inductionWrapper induction: Delimiter-based extractionLearning LR wrappersLR: Finding r1LR: Finding l1, l2 and r2A problem with LR wrappersOne (of many) solutions: HLRTMore sophisticated wrappersBoosted wrapper inductionBWI: The basic ideaNatural Language ProcessingThree generations of IE systemsTrainable IE systemsMUC: the genesis of IESlide 37Slide 38Slide 39Slide 40Grep++ = Casacaded greppingSlide 42Slide 43Rule-based Extraction ExamplesEvaluating IE AccuracyMUC Information Extraction: State of the Art c. 1997Summary and preludeGood Basic IE ReferencesCS276BText Information Retrieval, Mining, and ExploitationLecture 6Information Extraction IJan 28, 2003(includes slides borrowed from Oren Etzioni, Andrew McCallum, Nick Kushmerick, BBN, and Ray Mooney)Product informationProduct infoCNET markets this informationHow do they get most of it?Phone callsTyping.It’s difficult because of textual inconsistency: digital camerasImage Capture Device: 1.68 million pixel 1/2-inch CCD sensorImage Capture Device Total Pixels Approx. 3.34 million Effective Pixels Approx. 3.24 millionImage sensor Total Pixels: Approx. 2.11 million-pixelImaging sensor Total Pixels: Approx. 2.11 million 1,688 (H) x 1,248 (V)CCD Total Pixels: Approx. 3,340,000 (2,140[H] x 1,560 [V] )Effective Pixels: Approx. 3,240,000 (2,088 [H] x 1,550 [V] )Recording Pixels: Approx. 3,145,000 (2,048 [H] x 1,536 [V] )These all came off the same manufacturer’s website!!And this is a very technical domain. Try sofa beds.Classified Advertisements (Real Estate)Background:Advertisements are plain textLowest common denominator: only thing that 70+ newspapers with 20+ publishing systems can all handle<ADNUM>2067206v1</ADNUM><DATE>March 02, 1998</DATE><ADTITLE>MADDINGTON $89,000</ADTITLE><ADTEXT>OPEN 1.00 - 1.45<BR>U 11 / 10 BERTRAM ST<BR> NEW TO MARKET Beautiful<BR>3 brm freestanding<BR>villa, close to shops & bus<BR>Owner moved to Melbourne<BR> ideally suit 1st home buyer,<BR> investor & 55 and over.<BR>Brian Hazelden 0418 958 996<BR> R WHITE LEEMING 9332 3477</ADTEXT>Why doesn’t text search (IR) work?What you search for in real estate advertisements:Suburbs. You might think easy, but:Real estate agents: Coldwell Banker, MosmanPhrases: Only 45 minutes from ParramattaMultiple property ads have different suburbsMoney: want a range not a textual matchMultiple amounts: was $155K, now $145KVariations: offers in the high 700s [but not rents for $270]Bedrooms: similar issues (br, bdr, beds, B/R)Extracting Job Openings from the Web foodscience.com-Job2 JobTitle: Ice Cream Guru Employer: foodscience.com JobCategory: Travel/Hospitality JobFunction: Food Services JobLocation: Upper MidwestContact Phone: 800-488-2611 DateExtracted: January 8, 2001 Source: www.foodscience.com/jobs_midwest.html OtherCompanyJobs: foodscience.com-Job1Knowledge Extraction VisionMulti-dimensional Meta-data ExtractionJ F M A M J J AEMPLOYEE / EMPLOYER Relationships:Jan Clesius works for Clesius EnterprisesBill Young works for InterMedia Inc.COMPANY / LOCATION Relationshis:Clesius Enterprises is in New York, NYInterMedia Inc. is in Boston, MAMeta-DataIndia Bombing NY Times Andhra Bhoomi Dinamani Dainik JagranTopic DiscoveryConcept IndexingThread CreationTerm TranslationDocument TranslationStory SegmentationEntity ExtractionFact ExtractionTask: Information ExtractionGoal: being able to answer semantic queries (a.k.a. “database queries”) using “unstructured” natural language sourcesIdentify specific pieces of information in a un-structured or semi-structured textual document.Transform this unstructured information into structured relations in a database/ontology.Suppositions:A lot of information that could be represented in a structured semantically clear format isn’tIt may be costly, not desired, or not in one’s control (screen scraping) to change this.Other applications of IE SystemsJob resumes: BurningGlass, MohomineSeminar announcementsContinuing education courses info from the webMolecular biology information from MEDLINE, e.g, Extracting gene drug interactions from biomed textsSummarizing medical patient records by extracting diagnoses, symptoms, physical findings, test results. Gathering earnings, profits, board members, etc. [corporate information] from web, company reports Verification of construction industry specifications documents (are the quantities correct/reasonable?)Extraction of political/economic/business changes from newspaper articlesWhat about XML?Don’t XML, RDF, OIL, SHOE, DAML, XSchema, … obviate the need for information extraction?!??!Yes:IE is sometimes used to “reverse engineer” HTML database interfaces; extraction would be much simpler if XML were exported instead of HTML.Ontology-aware editors will make it easer to enrich content with metadata.No:Terabytes of legacy HTML.Data consumers forced to accept ontological decisions of data providers (eg, <NAME>John Smith</NAME> vs.<NAME first="John" last="Smith"/> ).Will you annotate every email you send? Every memo you write? Every photograph you scan?Task: Wrapper InductionWrapper InductionSometimes, the relations are structural.Web pages generated by a database.Tables, lists, etc.Wrapper induction is usually regular relations which can be expressed by the structure of the document:the item in bold in the 3rd column of the table is the priceHandcoding a wrapper in Perl isn’t very viablesites are numerous, and their surface structure mutates rapidly (around 10% failures each month)Wrapper induction techniques can also learn: If there is a page about a research project X and there is a link near the word ‘people’ to a page that is about a person Y then Y is a member of the
View Full Document