1 CSCI 5417 Information Retrieval Systems Jim Martin!Lecture 13 10/6/2011 10/17/11 CSCI5417-IR 2Text classification First Naïve Bayes Simple, fast, low training and testing cost Then K Nearest Neighbor classification Simple, can easily leverage inverted index, high variance, non-linear Today Linear classifier s A very quick tour SVMs Some empirical evaluation and comparison Text-specific issues in classification2 Where we are Classification and naïve Bayes Chapter 13 Vector space classification Chapter 14 Machine learning Chapter 15 10/17/11 CSCI5417-IR 310/17/11 CSCI5417-IR 4K Nearest Neighbors Classification To classify document d into class c Define k-neighborhood N as k nearest neighbors of d Count number of documents i in N that belong to c Estimate P(c|d) as i/k Choose as class argmaxc P(c|d) I.e. majority class3 10/17/11 CSCI5417-IR 5Example: k=6 (6NN) Government Science Arts P(science| )? 10/17/11 CSCI5417-IR 6Nearest Neighbor with Inverted Index Naively finding nearest neighbors requires a linear search through |D| documents in collection But if cosine is the similarity metric then determining k nearest neighbors is the same as determining the k best retrievals using the test document as a query to a database of training documents. So just use standard vector space inverted index methods to find the k nearest neighbors. What are the caveats to this????4 10/17/11 CSCI5417-IR 7kNN: Discussion No feature selection necessary Scales well with large number of classes Don’t need to train n classifiers for n classes Scores can be hard to convert to probabilities No training necessary Sort of… still need to figure out tf-idf, stemming, stop-lists, etc. All that requires tuning, which really is training. 10/17/11 CSCI5417-IR 8Classes in a Vector Space Government Science Arts5 10/17/11 CSCI5417-IR 9Test Document = Government Government Science Arts Learning to classify is often viewed as a way to directly or indirectly learning those decision boundaries 10/17/11 CSCI5417-IR 10Bias vs. Variance: Choosing the correct model capacity6 10/17/11 CSCI5417-IR 11kNN vs. Naive Bayes Bias/Variance tradeoff Variance Capacity Bias Generalization kNN has high variance and low bias. Infinite memory NB has low variance and high bias. Consider: Is an object a tree? Too much capacity/variance, low bias Botanist who memorizes every tree Will always say “no” to new object (e.g., # leaves) Not enough capacity/variance, high bias Lazy botanist Says “yes” if the object is green 10/17/11 CSCI5417-IR 12Linear Classifiers Methods that attempt to separate data into classes by learning a linear separator in the space representing the objects. Unlike k-NN these methods explicitly seek a generalization (representation of a separator) in the space. Not a characterization of the classes though (ala naïve Bayes). These methods seek to characterize a way to separate the classes.7 10/17/11 CSCI5417-IR 13Example Suppose you had collected data concerning the relationship between the use of vague adjectives in real estate ads and whether the house subsequently sold for more or less than the asking price (Levitt and Dubner, 2005) and by how much. Consider “cute” or “charming” vs. “stainless” or “granite”. You might end up with a table like... 10/17/11 CSCI5417-IR 14Classification Example Clearly, hot properties are not associated with vague adjectives.8 10/17/11 CSCI5417-IR 15Linear Regression Example 10/17/11 CSCI5417-IR 16Regression Example Definition of a line y= mx + b Slope (m) and intercept (b) $$$ = w_0 + w_1*Num_Adjectives 16550 + -4900*Num_Adjectives What if you had more features? In general9 10/17/11 CSCI5417-IR 17Learning How to learn the weights? The slope and intercept in our case? Search through the space of weights for the values that optimize some goodness metric In this case, sum of the squared differences between the training examples and the predicted values. 10/17/11 CSCI5417-IR 18Regression to Classification Regression maps numbers (features) to numbers and we’re interested in mapping features to discrete categories... Let’s think first about the binary case10 10/17/11 CSCI5417-IR 19Regression to Classification For the regression case, the line we learned is used to compute a value. But, given a set of +/- values we could just have easily search for a line that best separates the space into two regions (above and below) the line Points above are + and the values below are -. If we move beyond 2 dimensions (features) than we have a hyperplane instead of a line. 10/17/11 CSCI5417-IR 20Regression to Classification Training in this case is a little different. We’re not learning to produce a number, we’re trying to best separate points. That is, the y values are 0/1 (one for each class) the features are weighted. Find the set of weights that best separates the training examples The simplest answer is to find a hyperplane that minimizes the number of misclassifications In the best case, places one set of points on one side11 Break Quiz average was 34. 10/17/11 CSCI5417-IR 21ML Course at Stanford 10/17/11 CSCI5417-IR 2212 Problems There may be an infinite number of such separators. Which one should we choose? There may be no separators that can perfectly distinguish the 2 classes. What then? What do you do if you have more than 2 classes? 10/17/11 CSCI5417-IR 2310/17/11 CSCI5417-IR 24Problem 1: Which Hyperplane? Most methods find a separating hyperplane, but not necessarily an optimal one E.g., perceptrons,
View Full Document