UCI ICS 278 - Lecture 12 Text Mining - D2943041

Home> Schools> University of California, Irvine> (ICS) > ICS 278> Lecture 12 Text Mining

UCI ICS 278 - Lecture 12 Text Mining

School name University of California, Irvine

Course Ics 278- Principles of Data Mining

Pages 30

Download Save

Unformatted text preview:

ICS 278: Data Mining Lecture 12: Text MiningText MiningText ClassificationTrimming the VocabularyClassification IssuesSlide 6Classifying Term VectorsProbabilistic “Generative” ClassifiersNaïve Bayes Classifier for TextMultinomial Classifier for TextComparing Naïve Bayes and Multinomial modelsWebKB Data SetProbabilistic Model ComparisonSlide 14Sample Learning Curve (Yahoo Science Data)Comments on Generative Models for TextSlide 17Linear ClassifiersGeometry of Linear ClassifiersOptimal Hyperplane and MarginOptimal Separating HyperplaneSketch of Optimization ProblemSupport Vector MachinesSlide 24Classic Reuters Data SetDumais et al. 1998: Reuters - AccuracySlide 27Slide 28Other issues in text classificationFurther Reading on Text ClassificationData Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC IrvineICS 278: Data MiningLecture 12: Text MiningPadhraic SmythDepartment of Information and Computer ScienceUniversity of California, IrvineData Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC IrvineText Mining•Information Retrieval•Text Classification•Text Clustering•Information ExtractionData Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC IrvineText Classification•Text classification has many applications–Spam email detection–Automated tagging of streams of news articles, e.g., Google News–Automated creation of Web-page taxonomies•Data Representation–“Bag of words” most commonly used: either counts or binary–Can also use “phrases” for commonly occuring combinations of words•Classification Methods–Naïve Bayes widely used (e.g., for spam email)•Fast and reasonably accurate–Support vector machines (SVMs)•Typically the most accurate method in research studies•But more complex computationally–Logistic Regression (regularized)•Not as widely used, but can be competitive with SVMs (e.g., Zhang and Oles, 2002)Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC IrvineTrimming the Vocabulary•Stopword removal: –remove “non-content” words•very frequent “stop words” such as “the”, “and”….–remove very rare words, e.g., that only occur a few times in 100k documents–Can remove 30% or more of the original unique words•Stemming:–Reduce all variants of a word to a single term–E.g., {draw, drawing, drawings} -> “draw”–Porter stemming algorithm (1980)•relies on a preconstructed suffix list with associated rules•e.g. if suffix=IZATION and prefix contains at least one vowel followed by a consonant, replace with suffix=IZE–BINARIZATION => BINARIZE•This still often leaves p ~ O(104) terms => a very high-dimensional classification problem!Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC IrvineClassification Issues•Typically many features, p ~ O(104) terms•Consider n sample points in p dimensions–Binary labels => 2n possible labelings (or dichotomies) –A labeling is linearly separable if we can separate the labels with a hyperplane–Let f(n,p) = fraction of the 2n possible labelings that are linear f(n, p) = 1 n <= p + 1 2/ 2n  (n-1 choose i) n > p+1Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC IrvineData Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC IrvineClassifying Term Vectors•Typically multiple different words may be helpful in classifying a particular class, e.g.,–Class = “finance”–Words = “stocks”, “return”, “interest”, “rate”, etc.–Thus, classifiers that combine multiple features often do well, e.g,•Naïve Bayes, Logistic regression, SVMs,–Classifiers based on single features (e.g., trees) do less well•Linear classifiers often perform well in high-dimensions– In many cases fewer documents in training data than dimensions,•i.e., n < p => training data are linearly separable–So again, naïve Bayes, logistic regression, linear SVMS, are all useful–Question becomes: which linear discriminant to select?Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC IrvineProbabilistic “Generative” Classifiers•Model p( x | ck ) for each class and perform classification via Bayes rule, c = arg max { p( ck | x ) } = arg max { p( x | ck ) p(ck) }•How to model p( x | ck )?–p( x | ck ) = probability of a “bag of words” x given a class ck–Two commonly used approaches (for text):•Naïve Bayes: treat each term xj as being conditionally independent, given ck•Multinomial: model a document with N words as N tosses of a p-sided die–Other models possible but less common,•E.g., model word order by using a Markov chain for p( x | ck )Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC IrvineNaïve Bayes Classifier for Text•Naïve Bayes classifier = conditional independence model –Assumes conditional independence assumption given the class: p( x | ck ) =  p( xj | ck )–Note that we model each term xj as a discrete random variable–Binary terms (Bernoulli): p( x | ck ) =  p( xj = 1 | ck )  p( xj = 0 | ck )–Non-binary terms (counts): p( x | ck ) =  p( xj = k | ck ) can use a parametric model (e.g., Poisson) or non-parametric model (e.g., histogram) for p(xj = k | ck ) distributions.Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC IrvineMultinomial Classifier for Text•Multinomial Classification model–Assume that the data are generated by a p-sided die (multinomial model) –where Nx = number of terms (total count) in document x nj = number of times term j occurs in the document–p(Nx| ck) = probability a document has length Nx, e.g., Poisson model•Can be dropped if thought not to be class dependent–Here we have a single random variable for each class, and the p( xj = i | ck ) probabilities sum to 1 over i (i.e., a multinomial model)–Probabilities typically only defined and evaluated for i=1, 2, 3… –But “zero counts” could also be modeled if desired•This would be equivalent to a Naïve Bayes model with a geometric distribution on countsjNxjkjkxkncxpcNpcp1)|()|()|(xData Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC IrvineComparing Naïve

View Full Document


School:
Email:
New Password:
Confirm Password:

UCI ICS 278 - Lecture 12 Text Mining

Sign up for free to view:

Please select your school