DOC PREVIEW
UIC CS 583 - CS583-textMining

This preview shows page 1-2-3-23-24-25-26-47-48-49 out of 49 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 49 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 49 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 49 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 49 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 49 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 49 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 49 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 49 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 49 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 49 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 49 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Chapter 7: Text miningText miningInformation Retrieval (IR)Information RetrievalText ProcessingStop wordsStemmingBasic stemming methodsFrequency countsVector Space RepresentationVector Space and Document SimilarityQuery formatsAn ExampleAn Example (cont.)Relevance judgment for IRPrecision and RecallPrecision and Recall (cont)Slide 18Relationship of R and PP-R diagramAlternative measuresWeb Search as a huge IR systemDifferent search enginesVector Space Based Document ClassificationSlide 25Classification in Vector spaceTest doc = GovernmentRocchio Classification MethodRocchio ClassificationNaïve Bayesian ClassifierNaïve Bayesian Classifier (multinomial model)k Nearest Neighbor ClassificationExampleExample: k=6 (6NN)Linear classifiers: Binary ClassificationLinear programming / PerceptronLinear Classifiers (cont.)Which hyperplane?Support Vector Machine (SVM)Optimal hyperplaneA Geometrical InterpretationSVM formulation: separable caseNon-separable case Soft margin SVMIllustration:Non-separable caseExtension to Non-linear Decision surfaceKernel TrickComments of SVMDocument clusteringSummaryUIC - CS 594 Bing Liu 1Chapter 7: Text miningUIC - CS 594 Bing Liu 2Text miningIt refers to data mining using text documents as data. There are many special techniques for pre-processing text documents to make them suitable for mining. Most of these techniques are from the field of “Information Retrieval”.UIC - CS 594 Bing Liu 3Information Retrieval (IR)Conceptually, information retrieval (IR) is the study of finding needed information. I.e., IR helps users find information that matches their information needs. Historically, information retrieval is about document retrieval, emphasizing document as the basic unit.Technically, IR studies the acquisition, organization, storage, retrieval, and distribution of information. IR has become a center of focus in the Web era.UIC - CS 594 Bing Liu 4Information RetrievalUser InformationSearch/selectInfo. NeedsQueriesStored InformationTranslating info.needs to queriesMatching queriesTo stored information Query result evaluationDoes information found match user’s information needs?UIC - CS 594 Bing Liu 5Text ProcessingWord (token) extractionStop words StemmingFrequency countsUIC - CS 594 Bing Liu 6Stop wordsMany of the most frequently used words in English are worthless in IR and text mining – these words are called stop words.the, of, and, to, ….Typically about 400 to 500 such wordsFor an application, an additional domain specific stop words list may be constructedWhy do we need to remove stop words?Reduce indexing (or data) file sizestopwords accounts 20-30% of total word counts.Improve efficiencystop words are not useful for searching or text miningstop words always have a large number of hitsUIC - CS 594 Bing Liu 7StemmingTechniques used to find out the root/stem of a word:E.g.,user engineering users engineered used engineer using stem: use engineerUsefulnessimproving effectiveness of IR and text mining matching similar wordsreducing indexing sizecombing words with same roots may reduce indexing size as much as 40-50%.UIC - CS 594 Bing Liu 8Basic stemming methodsremove endingif a word ends with a consonant other than s, followed by an s, then delete s.if a word ends in es, drop the s.if a word ends in ing, delete the ing unless the remaining word consists only of one letter or of th.If a word ends with ed, preceded by a consonant, delete the ed unless this leaves only a single letter.…...transform wordsif a word ends with “ies” but not “eies” or “aies” then “ies --> y.”UIC - CS 594 Bing Liu 9Frequency countsCounts the number of times a word occurred in a document.Counts the number of documents in a collection that contains a word.Using occurrence frequencies to indicate relative importance of a word in a document.if a word appears often in a document, the document likely “deals with” subjects related to the word.UIC - CS 594 Bing Liu 10Vector Space RepresentationA document is represented as a vector:(W1, W2, … … , Wn)Binary:Wi= 1 if the corresponding term i (often a word) is in the documentWi= 0 if the term i is not in the documentTF: (Term Frequency)Wi= tfi where tfi is the number of times the term occurred in the documentTF*IDF: (Inverse Document Frequency)Wi =tfi*idfi=tfi*log(N/dfi)) where dfi is the number of documents contains term i, and N the total number of documents in the collection.UIC - CS 594 Bing Liu 11Vector Space and Document Similarity-Each indexing term is a dimension. A indexing term is normally a word. -Each document is a vector-Di = (ti1, ti2, ti3, ti4, ... tin)-Dj = (tj1, tj2, tj3, tj4, ..., tjn)-Document similarity is defined asn1kjkn1kn1kjkikji22ikttt*t)D ,(D SimilarityUIC - CS 594 Bing Liu 12Query formatsQuery is a representation of the user’s information needsNormally a list of words. Query as a simple question in natural languageThe system translates the question into executable queriesQuery as a document“Find similar documents like this one”The system defines what the similarity isUIC - CS 594 Bing Liu 13An ExampleA document Space is defined by three terms:hardware, software, usersA set of documents are defined as:A1=(1, 0, 0), A2=(0, 1, 0), A3=(0, 0, 1)A4=(1, 1, 0), A5=(1, 0, 1), A6=(0, 1, 1)A7=(1, 1, 1) A8=(1, 0, 1). A9=(0, 1, 1)If the Query is “hardware and software”what documents should be retrieved?UIC - CS 594 Bing Liu 14An Example (cont.)In Boolean query matching:document A4, A7 will be retrieved (“AND”)retrieved:A1, A2, A4, A5, A6, A7, A8, A9 (“OR”)In similarity matching (cosine): q=(1, 1, 0)S(q, A1)=0.71, S(q, A2)=0.71, S(q, A3)=0S(q, A4)=1, S(q, A5)=0.5, S(q, A6)=0.5S(q, A7)=0.82, S(q, A8)=0.5, S(q, A9)=0.5Document retrieved set (with ranking)={A4, A7, A1, A2, A5, A6, A8, A9}UIC - CS 594 Bing Liu 15Relevance judgment for IRA measurement of the outcome of a search or retrieval The judgment on what should or should not be retrieved.There is no simple answer to what is relevant and what is not relevant: need human users. difficult to define subjective depending on knowledge, needs, time,, etc. The central concept of


View Full Document

UIC CS 583 - CS583-textMining

Download CS583-textMining
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view CS583-textMining and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view CS583-textMining 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?