11CS276BWeb Search and MiningLecture 10Text Mining IFeb 8, 2005(includes slides borrowed from Marti Hearst) 2Text MiningTodayIntroductionLexicon constructionTopic Detection and TrackingFutureTwo more text mining lecturesQuestion AnsweringSummarization…and more3The business opportunity in text mining…0102030405060708090100Data volume Market CapUnstructuredStructured4Corporate Knowledge “Ore”EmailInsurance claimsNews articlesWeb pagesPatent portfoliosIRCScientific articlesCustomer complaint lettersContractsTranscripts of phone calls with customersTechnical documentsStuff not very accessible via standard data-mining5Text Knowledge Extraction Tasks Small Stuff. Useful nuggets of information that a user wants:Question AnsweringInformation Extraction (DB filling)Thesaurus Generation Big Stuff. Overviews:Summary Extraction (documents or collections)Categorization (documents)Clustering (collections) Text Data Mining: Interesting unknown correlations that one can discover6Text MiningThe foundation of most commercial “text mining” products is all the stuff we have already covered:Information Retrieval engineWeb spider/searchText classificationText clusteringNamed entity recognitionInformation extraction (only sometimes)Is this text mining? What else is needed?27One tool: Question Answering Goal: Use Encyclopedia/other source to answer “Trivial Pursuit-style” factoid questions Example: “What famed English site is found on Salisbury Plain?” Method:Heuristics about question type: who, when, whereMatch up noun phrases within and across documents (much use of named entitiesCoreference is a classic IE problem too!More focused response to user need than standard vector space IRMurax, Kupiec, SIGIR 1993; huge amount of recent work8Another tool: SummarizingHigh-level summary or survey of all main points?How to summarize a collection?Example: sentence extraction from a single document (Kupiec et al. 1995; much subsequent work)Start with training set, allows evaluationCreate heuristics to identify important sentences:position, IR score, particular discourse cuesClassification function estimates the probability a given sentence is included in the abstract42% average precision9IBM Text Miner terminology: Example of Vocabulary foundCertificate of depositCMOsCommercial bankCommercial paperCommercial Union AssuranceCommodity Futures Trading CommissionConsul RestaurantConvertible bondCredit facilityCredit lineDebt securityDebtor countryDetroit EdisonDigital EquipmentDollars of debtEnd-MarchEnserchEquity warrantEurodollar…10What is Text Data Mining?Peoples’ first thought: Make it easier to find things on the Web.But this is information retrieval!The metaphor of extracting ore from rock:Does make sense for extracting documents of interest from a huge pile.But does not reflect notions of DM in practice. Rather:finding patterns across large collectionsdiscovering heretofore unknown information11Real Text DMWhat would finding a pattern across a large text collection really look like?Discovering heretofore unknown information is not what we usually do with text.(If it weren’t known, it could not have been written by someone!)However, there is a field whose goal is to learn about patterns in text for its own sake …Research that exploits patterns in text does so mainly in the service of computational linguistics, rather than for learning about and exploring text collections.12Definitions of Text MiningText mining mainly is about somehow extracting the information and knowledge from text;2 definitions:Any operation related to gathering and analyzing text from external sources for business intelligence purposes;Discovery of knowledge previously unknown to the user in text;Text mining is the process of compiling, organizing, and analyzing large document collections to support the delivery of targeted types of information to analysts and decision makers and to discover relationships between related facts that span wide domains of inquiry.313True Text Data Mining:Don Swanson’s Medical Work Given medical titles and abstractsa problem (incurable rare disease)some medical expertise find causal links among titlessymptomsdrugsresults E.g.: Magnesium deficiency related to migraineThis was found by extracting features from medical literature on migraines and nutrition14Swanson Example (1991)Problem: Migraine headaches (M)Stress is associated with migraines;Stress can lead to a loss of magnesium;calcium channel blockers prevent some migrainesMagnesium is a natural calcium channel blocker;Spreading cortical depression (SCD) is implicated in some migraines;High levels of magnesium inhibit SCD;Migraine patients have high platelet aggregability;Magnesium can suppress platelet aggregability.All extracted from medical journal titles15Swanson’s TDMTwo of his hypotheses have received some experimental verification.His techniqueOnly partially automatedRequired medical expertiseFew people are working on this kind of information aggregation problem.16Gathering EvidencemigrainemagnesiumstressCCBPASCDAll NutritionResearchAll MigraineResearch17Or maybe it was already known?18Lexicon Construction419What is a Lexicon?A database of the vocabulary of a particular domain (or a language)More than a list of words/phrasesUsually some linguistic informationMorphology (manag- e/es/ing/ed → manage)Syntactic patterns (transitivity etc)Often some semantic informationIs-a hierarchySynonymyNumbers convert to normal form: Four → 4Date convert to normal formAlternative names convert to explicit formMr. Carr, Tyler, Presenter → Tyler Carr20Lexica in Text MiningMany text mining tasks require named entity recognition.Named entity recognition requires a lexicon in most cases.Example 1: Question answeringWhere is Mount Everest?A list of geographic locations increases accuracyExample 2: Information extractionConsider scraping book data from amazon.comTemplate contains field “publisher”A list of publishers increases accuracyManual construction is expensive: 1000s of person hours!Sometimes an unstructured
View Full Document