Unformatted text preview:

CS276B Text Information Retrieval, Mining, and ExploitationPowerPoint PresentationProduct informationProduct infoIt’s difficult because of textual inconsistency: digital camerasClassified Advertisements (Real Estate)Slide 7Why doesn’t text search (IR) work?Extracting Job Openings from the WebSlide 10Knowledge Extraction VisionTask: Information ExtractionOther applications of IE SystemsWhat about XML?Task: Wrapper InductionAmazon Book DescriptionExtracted Book TemplateTemplate TypesWrappers: Simple Extraction PatternsSimple Template ExtractionPre-Specified Filler ExtractionWrapper tool-kitsWrapper inductionWrapper induction: Delimiter-based extractionLearning LR wrappersLR: Finding r1LR: Finding l1, l2 and r2A problem with LR wrappersOne (of many) solutions: HLRTMore sophisticated wrappersBoosted wrapper inductionBWI: The basic ideaNatural Language ProcessingThree generations of IE systemsTrainable IE systemsMUC: the genesis of IESlide 37Slide 38Slide 39Slide 40Grep++ = Casacaded greppingSlide 42Slide 43Rule-based Extraction ExamplesEvaluating IE AccuracyMUC Information Extraction: State of the Art c. 1997Summary and preludeGood Basic IE ReferencesCS276BText Information Retrieval, Mining, and ExploitationLecture 6Information Extraction IJan 28, 2003(includes slides borrowed from Oren Etzioni, Andrew McCallum, Nick Kushmerick, BBN, and Ray Mooney)Product informationProduct infoCNET markets this informationHow do they get most of it?Phone callsTyping.It’s difficult because of textual inconsistency: digital camerasImage Capture Device: 1.68 million pixel 1/2-inch CCD sensorImage Capture Device Total Pixels Approx. 3.34 million Effective Pixels Approx. 3.24 millionImage sensor Total Pixels: Approx. 2.11 million-pixelImaging sensor Total Pixels: Approx. 2.11 million 1,688 (H) x 1,248 (V)CCD Total Pixels: Approx. 3,340,000 (2,140[H] x 1,560 [V] )Effective Pixels: Approx. 3,240,000 (2,088 [H] x 1,550 [V] )Recording Pixels: Approx. 3,145,000 (2,048 [H] x 1,536 [V] )These all came off the same manufacturer’s website!!And this is a very technical domain. Try sofa beds.Classified Advertisements (Real Estate)Background:Advertisements are plain textLowest common denominator: only thing that 70+ newspapers with 20+ publishing systems can all handle<ADNUM>2067206v1</ADNUM><DATE>March 02, 1998</DATE><ADTITLE>MADDINGTON $89,000</ADTITLE><ADTEXT>OPEN 1.00 - 1.45<BR>U 11 / 10 BERTRAM ST<BR> NEW TO MARKET Beautiful<BR>3 brm freestanding<BR>villa, close to shops & bus<BR>Owner moved to Melbourne<BR> ideally suit 1st home buyer,<BR> investor & 55 and over.<BR>Brian Hazelden 0418 958 996<BR> R WHITE LEEMING 9332 3477</ADTEXT>Why doesn’t text search (IR) work?What you search for in real estate advertisements:Suburbs. You might think easy, but:Real estate agents: Coldwell Banker, MosmanPhrases: Only 45 minutes from ParramattaMultiple property ads have different suburbsMoney: want a range not a textual matchMultiple amounts: was $155K, now $145KVariations: offers in the high 700s [but not rents for $270]Bedrooms: similar issues (br, bdr, beds, B/R)Extracting Job Openings from the Web foodscience.com-Job2 JobTitle: Ice Cream Guru Employer: foodscience.com JobCategory: Travel/Hospitality JobFunction: Food Services JobLocation: Upper MidwestContact Phone: 800-488-2611 DateExtracted: January 8, 2001 Source: www.foodscience.com/jobs_midwest.html OtherCompanyJobs: foodscience.com-Job1Knowledge Extraction VisionMulti-dimensional Meta-data ExtractionJ F M A M J J AEMPLOYEE / EMPLOYER Relationships:Jan Clesius works for Clesius EnterprisesBill Young works for InterMedia Inc.COMPANY / LOCATION Relationshis:Clesius Enterprises is in New York, NYInterMedia Inc. is in Boston, MAMeta-DataIndia Bombing NY Times Andhra Bhoomi Dinamani Dainik JagranTopic DiscoveryConcept IndexingThread CreationTerm TranslationDocument TranslationStory SegmentationEntity ExtractionFact ExtractionTask: Information ExtractionGoal: being able to answer semantic queries (a.k.a. “database queries”) using “unstructured” natural language sourcesIdentify specific pieces of information in a un-structured or semi-structured textual document.Transform this unstructured information into structured relations in a database/ontology.Suppositions:A lot of information that could be represented in a structured semantically clear format isn’tIt may be costly, not desired, or not in one’s control (screen scraping) to change this.Other applications of IE SystemsJob resumes: BurningGlass, MohomineSeminar announcementsContinuing education courses info from the webMolecular biology information from MEDLINE, e.g, Extracting gene drug interactions from biomed textsSummarizing medical patient records by extracting diagnoses, symptoms, physical findings, test results. Gathering earnings, profits, board members, etc. [corporate information] from web, company reports Verification of construction industry specifications documents (are the quantities correct/reasonable?)Extraction of political/economic/business changes from newspaper articlesWhat about XML?Don’t XML, RDF, OIL, SHOE, DAML, XSchema, … obviate the need for information extraction?!??!Yes:IE is sometimes used to “reverse engineer” HTML database interfaces; extraction would be much simpler if XML were exported instead of HTML.Ontology-aware editors will make it easer to enrich content with metadata.No:Terabytes of legacy HTML.Data consumers forced to accept ontological decisions of data providers (eg, <NAME>John Smith</NAME> vs.<NAME first="John" last="Smith"/> ).Will you annotate every email you send? Every memo you write? Every photograph you scan?Task: Wrapper InductionWrapper InductionSometimes, the relations are structural.Web pages generated by a database.Tables, lists, etc.Wrapper induction is usually regular relations which can be expressed by the structure of the document:the item in bold in the 3rd column of the table is the priceHandcoding a wrapper in Perl isn’t very viablesites are numerous, and their surface structure mutates rapidly (around 10% failures each month)Wrapper induction techniques can also learn: If there is a page about a research project X and there is a link near the word ‘people’ to a page that is about a person Y then Y is a member of the


View Full Document

Stanford CS 276B - Information Extraction I

Download Information Extraction I
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Information Extraction I and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Information Extraction I 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?