Stanford CS 276B - Text Mining I - D3023465

Home> Schools> Stanford University> Computer Science (CS) > CS 276B> Text Mining I

Stanford CS 276B - Text Mining I

Course Cs 276b- Text Information Retrieval, Mining, and Exploitation

Pages 10

Download Save

Unformatted text preview:

11CS276BWeb Search and MiningLecture 10Text Mining IFeb 8, 2005(includes slides borrowed from Marti Hearst) 2Text MiningTodayIntroductionLexicon constructionTopic Detection and TrackingFutureTwo more text mining lecturesQuestion AnsweringSummarization…and more3The business opportunity in text mining…0102030405060708090100Data volume Market CapUnstructuredStructured4Corporate Knowledge “Ore”EmailInsurance claimsNews articlesWeb pagesPatent portfoliosIRCScientific articlesCustomer complaint lettersContractsTranscripts of phone calls with customersTechnical documentsStuff not very accessible via standard data-mining5Text Knowledge Extraction Tasks Small Stuff. Useful nuggets of information that a user wants:Question AnsweringInformation Extraction (DB filling)Thesaurus Generation Big Stuff. Overviews:Summary Extraction (documents or collections)Categorization (documents)Clustering (collections) Text Data Mining: Interesting unknown correlations that one can discover6Text MiningThe foundation of most commercial “text mining” products is all the stuff we have already covered:Information Retrieval engineWeb spider/searchText classificationText clusteringNamed entity recognitionInformation extraction (only sometimes)Is this text mining? What else is needed?27One tool: Question Answering Goal: Use Encyclopedia/other source to answer “Trivial Pursuit-style” factoid questions Example: “What famed English site is found on Salisbury Plain?” Method:Heuristics about question type: who, when, whereMatch up noun phrases within and across documents (much use of named entitiesCoreference is a classic IE problem too!More focused response to user need than standard vector space IRMurax, Kupiec, SIGIR 1993; huge amount of recent work8Another tool: SummarizingHigh-level summary or survey of all main points?How to summarize a collection?Example: sentence extraction from a single document (Kupiec et al. 1995; much subsequent work)Start with training set, allows evaluationCreate heuristics to identify important sentences:position, IR score, particular discourse cuesClassification function estimates the probability a given sentence is included in the abstract42% average precision9IBM Text Miner terminology: Example of Vocabulary foundCertificate of depositCMOsCommercial bankCommercial paperCommercial Union AssuranceCommodity Futures Trading CommissionConsul RestaurantConvertible bondCredit facilityCredit lineDebt securityDebtor countryDetroit EdisonDigital EquipmentDollars of debtEnd-MarchEnserchEquity warrantEurodollar…10What is Text Data Mining?Peoples’ first thought: Make it easier to find things on the Web.But this is information retrieval!The metaphor of extracting ore from rock:Does make sense for extracting documents of interest from a huge pile.But does not reflect notions of DM in practice. Rather:finding patterns across large collectionsdiscovering heretofore unknown information11Real Text DMWhat would finding a pattern across a large text collection really look like?Discovering heretofore unknown information is not what we usually do with text.(If it weren’t known, it could not have been written by someone!)However, there is a field whose goal is to learn about patterns in text for its own sake …Research that exploits patterns in text does so mainly in the service of computational linguistics, rather than for learning about and exploring text collections.12Definitions of Text MiningText mining mainly is about somehow extracting the information and knowledge from text;2 definitions:Any operation related to gathering and analyzing text from external sources for business intelligence purposes;Discovery of knowledge previously unknown to the user in text;Text mining is the process of compiling, organizing, and analyzing large document collections to support the delivery of targeted types of information to analysts and decision makers and to discover relationships between related facts that span wide domains of inquiry.313True Text Data Mining:Don Swanson’s Medical Work Given medical titles and abstractsa problem (incurable rare disease)some medical expertise find causal links among titlessymptomsdrugsresults  E.g.: Magnesium deficiency related to migraineThis was found by extracting features from medical literature on migraines and nutrition14Swanson Example (1991)Problem: Migraine headaches (M)Stress is associated with migraines;Stress can lead to a loss of magnesium;calcium channel blockers prevent some migrainesMagnesium is a natural calcium channel blocker;Spreading cortical depression (SCD) is implicated in some migraines;High levels of magnesium inhibit SCD;Migraine patients have high platelet aggregability;Magnesium can suppress platelet aggregability.All extracted from medical journal titles15Swanson’s TDMTwo of his hypotheses have received some experimental verification.His techniqueOnly partially automatedRequired medical expertiseFew people are working on this kind of information aggregation problem.16Gathering EvidencemigrainemagnesiumstressCCBPASCDAll NutritionResearchAll MigraineResearch17Or maybe it was already known?18Lexicon Construction419What is a Lexicon?A database of the vocabulary of a particular domain (or a language)More than a list of words/phrasesUsually some linguistic informationMorphology (manag- e/es/ing/ed → manage)Syntactic patterns (transitivity etc)Often some semantic informationIs-a hierarchySynonymyNumbers convert to normal form: Four → 4Date convert to normal formAlternative names convert to explicit formMr. Carr, Tyler, Presenter → Tyler Carr20Lexica in Text MiningMany text mining tasks require named entity recognition.Named entity recognition requires a lexicon in most cases.Example 1: Question answeringWhere is Mount Everest?A list of geographic locations increases accuracyExample 2: Information extractionConsider scraping book data from amazon.comTemplate contains field “publisher”A list of publishers increases accuracyManual construction is expensive: 1000s of person hours!Sometimes an unstructured

View Full Document


School:
Email:
New Password:
Confirm Password:

Stanford CS 276B - Text Mining I

Sign up for free to view:

Please select your school