Unformatted text preview:

1CS490D:Introduction to Data MiningProf. Chris CliftonMarch 26, 2004Text MiningData Mining in Text• Association search in text corpuses provides suggestive information– Groups of related entities– Clusters that identify topics• Flexibility is crucial– Describe what an interesting pattern would look like– What causes items to be considered associatedsame document, sequential associations, ?– Choice of techniques to rank the results• Integrate with Information Retrieval systems– Common base preprocessing (e.g. Natural Language processing)– Need IR system to explore/understand text mining results2Why Text is Hard• Lack of structure– Hard to preselect only data relevant to questions asked– Lots of irrelevant “data” (words that don’t correspond to interesting concepts) • Errors in information– Misleading/wrong information in text– Synonyms/homonyms: concept identification hard– Difficult to parse meaningI believe X is a key player vs. I doubt X is a key player• Sheer volume of “patterns”– Need ability to focus on user needs• Consequence for results:– False associations– Vague, dull associationsWhat About Existing Products?Data Mining Tools• Designed for particular types of analysis on structured data– Structure of data helps define known relationship– Small, inflexible set of “pattern templates”• Text is “free flow of ideas”, tough to capture precise meaning– Many patterns exist that aren’t relevant to problem• Experiments with COTS products on tagged text corpuses demonstrate these problems– “Discovery overload”: many irrelevant patterns, density of actionable items too low– Lack of integration with Information Retrieval systems makes further exploration/understanding of results difficult3What About Existing Products?“Text Mining” Information Retrieval Tools• “Text Mining” is (mis?)used to mean information retrieval– IBM TextMiner (now called “IBM Text Search Engine”)–http://www.ibm.com/software/data/iminer/fortext/ibm_tse.html– DataSet http://www.ds-dataset.com/default.htm• These are Information Retrieval products– Goal is get the right document• May use data mining technology (clustering, association)– Used to improve retrieval, not discover associations among concepts• No capability to discover patterns among concepts in the documents.• May incorporate technologies such as concept extraction that ease integration with a Knowledge Discovery in Text systemWhat About Existing Products?Concept Visualization• Goal: Visualize concepts in a corpus– SemioMaphttp://www.semio.com/– SPIREhttp://www.pnl.gov/Statistics/research/spire.html– Aptex Convectishttp://www.aptex.com/products-convectis.htm• High-level concept visualization– Good for major trends, patterns• Find concepts related to a particular query– Helps find patterns if you know some of the instances of the pattern• Hard to visualize “rare event”patterns4What About Existing Products?Corpus-Specific Text Mining• Some “Knowledge Discovery in Text” products– Technology Watch (patent office)http://www.ibm.com/solutions/businessintelligence/textmining/techwatch.htm– TextSmart (survey responses)http://www.spss.com/textsmart• Provide limited types of analyses– Fixed “questions” to be answered– Primarily high-level (similar to concept visualization)• Domain-specific– Designed for specific corpus and task– Substantial development to extend to new domain or corpusWhat About Existing Products?Text Mining Tools• True “Text Mining” just beginning to come to market– Associations: ClearForesthttp://www.clearforest.com– Semantic Networks: Megaputer’s TextAnalyst™http://www.megaputer.com/taintro.html– IBM Intelligent Miner for Text (toolkit)http://www.ibm.com/software/data/iminer/fortext• Currently limited capabilities (but improving)– Further research needed– Directed research will ensure the right problems are solved• Major Problem: Flood of Information– Analyzing results as bad as reading the documents5Scenario: Find Active Leaders in a Region• Goal: Identify people to negotiate with prior to relief effort– Want “general picture" of a region– No expert that already knows the situation is available• Problems:– No clear “central authority”; problems are regional– Many claim power/control, few have it for long– Must include all key players in a region• Solution: Find key players over time– Who is key today?– Past players (may make a comeback)Example: Association Rules in News Stories• Goal: Find related (competing or cooperating) players in regions• Simple association rules (any associated concepts) gives too many results• Flexible search for associations allows us to specify what we want: Gives fewer, more appropriate resultsPerson1 Person2SupportNatalie Allen Linden Soles117Leon Harris Joie Chen53Ron Goldman Nicole Simpson19. . .Mobotu SeseSekoLaurent Kabila10Person1 Person2 Place SupportMobutoSese SekoLaurentKabilaKinshasa 76ConventionalData Mining System ArchitectureDataRepositoryDataMiningToolPatternsUsing Conventional Tools:Text Mining System ArchitectureTextCorpusRepositoryConceptExtractionAssociationRuleProductPerson1 Person2Natalie Allen Linden Soles117Leon Harris Joie Chen53Ron Goldman Nicole Simpson19. . .Mobotu SeseSekoLaurent Kabila10Goal: FindCooperating/Combating Leadersin a territoryToo Many Results7FlexibleText Mining System ArchitectureTextCorpusRepositoryConceptExtractionQueryFlocksDBMSPerson1 and Person2at PlacePattern DescriptionP er s o n1 P er s o n2 P la ceM o b u toS e s e S ek oL au ren tK a b ilaK in sh asa 7Flexible: Adapts to new tasksText Mining System ArchitectureTextCorpusRepositoryConceptExtractionPerson1 then Person2at Place in 1 monthPattern DescriptionQueryFlocksDBMSPerson1Person2Place Avg.DaysMobutoSese SekoLaurentKabilaKinshasa128Example of Flexible Association SearchBroadcast News Navigator Concept Correlation ToolNews Source (default all) Broadcast Dates All Dates From Date To Date Last Week Last Month Last Year Find correlations between Person Location Organization, Person Location Organization, and Person Location Organization, reported in connection with the same , within days, having a minimum of co-occurrences. Rank results by . 60 MinutesABC News NightlineABC World NewsBritish ForeignCBS Evening NewsCNN Early EditionCNN Early PrimeCNN


View Full Document

Purdue CS 490D - Lecture notes

Download Lecture notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?