New version page

UNCC MBAD 6201 - Text Mining

Upgrade to remove ads
Upgrade to remove ads
Unformatted text preview:

Text MiningDefinitionData Mining vs. Text MiningDisputationConfusionTwo FoundationsInformation RetrievalSlide 8Application IText ClusteringApplication IIGlobal vs. Local AnalysisApplication IIICo-citationCo-citation Analysis StepsThe Raw Co-citation Frequency MatrixResultArtificial IntelligenceSelf-Organizing Map (SOM)Slide 20Slide 21SOM ApplicationFolksonomySlide 24Slide 25Natural Language ProcessingSlide 27WordNetSlide 29GlossarySlide 31An ExampleSlide 33ReferenceSlide 35Slide 36Slide 37Text MiningPresenter: Hanmei FanDefinition•Text mining also is known as Text Data Mining (TDM) and Knowledge Discovery in Textual Database (KDT).[1]•A process of identifying novel information from a collection of texts (also known as a corpus). [2]Data Mining vs. Text Mining•Data Mining–process directly–Identify causal relationship–Structured numeric transaction data residing in rational data warehouse•Text Mining–Linguistic processing or natural language processing (NLP)–Discover heretofore unknown information[2]–Applications deal with much more diverse and eclectic collections of systems and formats[4]Disputation•Hearst: non-novel, novel.–Text mining is not a simple extension of data mining applied to unstructured database.–Text mining is the process of mining precious nuggets of ore from a mountain otherwise worthless rock.•Kroeze: non-novel, semi-novel, and novel–Non-novel: data/information retrieval–Semi-novel: knowledge discovery (standard data-mining, metadata mining, and standard text mining)–Novel: intelligent text miningConfusion•Is text mining the same as information extraction? No!•Information Extraction (IE)–Extract facts about pre-specified entities, events or relationships from unrestricted text sources.–No novelty: only information is already present is extracted.Two Foundations•Information Retrieval (IR)•Artificial Intelligence (AI)Information Retrieval•The science of searching for–Information in documents–Documents themselves–Metadata which describe documents–Text, sound, images or data, within database: relational stand-alone database or hypertext networked databases such as the Internet or intranets.Information Retrieval•Gerard SaltonFunctional overview of IRApplication I•Semi-novel•Text clustering: group similar documents for further examination -> create thematic overviews of text collections Issues:–Information needs is vague–Even if a topic were available, the words used to describe it may not be known to the user–The words used to describe a topic may not be those used to discuss the topic and may thus fail to appear in articles of interest.–Even if some words used in discussion of the topic were available, documents may fail to use precisely those words. [5]Text ClusteringApplication II•Semi-novel•Automatically generating term associations to aid in query expansion–Word mismatch•Clustering•Global / local analysis[7]Global vs. Local Analysis•Global Analysis–Expensive–Clustering based on all documents–Provides a thesaurus-like resource•Local Analysis–Cost-efficient–Clustering based on documents returned from previous query–Only provides a small test collectionApplication III•Semi-novel•Uing co-citation analysis to find general topics within a collection or identify central web pagesCo-citation•Bibliographic Co-Citation is a popular similarity measure used to establish a subject similarity between two items. •E.g. basic idea of Google’s algorithmBACCo-citation Analysis Steps•Selection of the core set of items for the study. •Retrieval of co-citation frequency information for the core set. •Compilation of the raw co-citation frequency matrix. •Correlation analysis to convert the raw frequencies into correlation coefficients. •Multivariate analysis of the correlation matrix, using principle components analysis, cluster analysis or multidimensional scaling techniques. •Interpretation of the resulting ``map'' and validation.The Raw Co-citation Frequency MatrixResultArtificial Intelligence•Artificial intelligence (AI) is a branch of computer science and engineering that deals with intelligent behavior, learning, and adaptation in machines.Self-Organizing Map (SOM)•One category of neural network models.•Neighboring cells in a neural network compete in their activities by means of mutual lateral interactions, and develop adaptively into specific detectors of different signal patterns.•Here, learning is called competitive, unsupervised, or self-organizing.[6]Self-Organizing Map (SOM)•An example of how SOM works?–Each node has two vectors: input vector, weight vector (location)–Input vector (Red, Green, Blue)–Red (255, 0, 0) Green (0, 255,0) Blue (0, 0, 255)–Randomize the weight vectors of nodes in the map–Calculate the Euclidean distance formula to find out the smallest difference between the input vector and the weight vector (Best Matching Unit or Winner)–Pulling neighbors closer to the input vector–Repeat a large number of cycle.Self-Organizing Map (SOM)•In attempting to devise neural network models for linguistic representation, the difficulty is how to find metric distance relations between symbolic items.SOM Application•Web Anlysis–Problem: directory-based search engines such as Yahoo! analyze, index and categorize web content manually.–Solution: •high-precision noun phrase indexing was performed to each page•A vector space model of noun phrases and their associated weights were used to present each page•All pages were categorized by a SOM clustering program[4]Folksonomy•folksonomy is an Internet-based information retrieval methodology consisting of collaboratively generated, open-ended labels that categorize content such as Web pages, online photographs, and Web links. •Example: http://del.icio.usFolksonomy•Benefits:–Lower content categorization costs–respond quickly to changes and innovations –the capacity of its tags to describe the "aboutness" of an Internet resourceFolksonomy•Lack of standard: polysemy, synonym•Meta noise: inaccurate or irrelevant metadataNatural Language Processing•Natural language processing (NLP) is a subfield of artificial intelligence and linguistics. It studies the problems of automated generation and understanding of natural human languages. •Statistical natural language processing uses stochastic, probabilistic and statistical methods to resolve some of the


View Full Document
Download Text Mining
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Text Mining and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Text Mining 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?