1 CSCI 5417 Information Retrieval Systems Jim Martin!Lecture 14 10/11/2011 10/17/11 CSCI5417-IR 2Today 10/11 Finish up classification Clustering Flat clustering Hierarchical clustering2 10/17/11 CSCI5417-IR 3SVM Summary Support vector machines (SVM) Choose hyperplane based on support vectors Support vector = “critical” point close to decision boundary Degree-1 SVMs are just fancy linear classifiers. Best performing text classifier But there are cheaper methods that perform about as well as SVM, such as logistic regression (MaxEnt) Partly popular due to availability of SVMlight SVMlight is accurate and fast – and free (for research) Also libSVM, tinySVM, Weka 10/17/11 CSCI5417-IR 4The Real World P. Jackson and I. Moulinier: Natural Language Processing for Online Applications “There is no question concerning the commercial value of being able to classify documents automatically by content. There are myriad potential applications of such a capability for corporate Intranets, government departments, and Internet publishers” “Understanding the data is one of the keys to successful categorization, yet this is an area in which most categorization tool vendors are extremely weak. Many of the ‘one size fits all’ tools on the market have not been tested on a wide range of content types.”3 10/17/11 CSCI5417-IR 5The Real World Gee, I’m building a text classifier for real, now! What should I do? How much training data do you have? None Very little Quite a lot A huge amount and its growing 10/17/11 CSCI5417-IR 6Manually written rules No training data, but adequate domain expertise go with hand-written rules If (wheat or grain) and not (whole or bread) then Categorize as grain In practice, rules get a lot bigger than this Can also be phrased using tf or tf.idf weights With careful crafting (human tuning on development data) performance is high: Construe: 94% recall, 84% precision over 675 categories (Hayes and Weinstein 1990) Amount of work required is huge Estimate 2 days per class … plus ongoing maintenance4 10/17/11 CSCI5417-IR 7Very little data? If you’re just doing supervised classification, you should stick to something with high bias There are theoretical results that naïve Bayes should do well in such circumstances (Ng and Jordan 2002 NIPS) An interesting research approach is to explore semi-supervised training methods Bootstrapping, EM over unlabeled documents, … The practical answer is to get more labeled data as soon as you can How can you insert yourself into a process where humans will be willing to label data for you 10/17/11 CSCI5417-IR 8A reasonable amount of data? Perfect, use an SVM But if you are using a supervised ML approach, you should probably be prepared with the “hybrid” solution Users like to hack, and management likes to be able to implement quick fixes immediately Hackers like regular expressions5 10/17/11 CSCI5417-IR 9A huge amount of data? This is great in theory for doing accurate classification… But it could easily mean that expensive methods like SVMs (training time) or kNN (testing time) are quite impractical Naïve Bayes can come back into its own again! Or other methods with linear training/test complexity like regularized logistic regression 10/17/11 CSCI5417-IR 10How many categories? A few (well separated ones)? Easy! A zillion closely related ones? Library of Congress classifications, MeSH terms, Reuters... Quickly gets difficult! Evaluation is tricky6 10/17/11 CSCI5417-IR 11How can one tweak performance? Aim to exploit any domain-specific useful features that give special meanings or that zone the data an author byline, mail headers, titles, zones in texts. Aim to collapse things that would be treated as different but shouldn’t be. Part numbers, chemical formulas, gene/protein names, dates, etc. 10/17/11 CSCI5417-IR 12Do “hacks” help? You bet! You can get a lot of value by differentially weighting contributions from different document zones: Upweighting title words helps (Cohen & Singer 1996) Doubling the weighting on the title words is a good rule of thumb Upweighting the first sentence of each paragraph helps (Murata, 1999) Upweighting sentences that contain title words helps (Ko et al, 2002)7 10/17/11 CSCI5417-IR 13Measuring Classification Figures of Merit Not just accuracy; in the real world, there are economic measures: Your choices are: Do no classification That has a cost (hard to compute) Do it all manually Has an easy to compute cost if doing it like that now Do it all with an automatic classifier Mistakes have a cost Do it with a combination of automatic classification and manual review of uncertain/difficult/“new” cases Commonly the last method is most cost efficient and is adopted 10/17/11 CSCI5417-IR 14A common problem: Concept Drift Categories change over time Example: “president of the united states” 1999: clinton is great feature 2002: clinton is bad feature One measure of a text classification system is how well it protects against concept drift. Can favor simpler models like Naïve Bayes Feature selection: can be bad in protecting against concept drift8 10/17/11 CSCI5417-IR 15The Concept Drift Problem Things change Example: “president of the united states” 1999: clinton is great feature 2010: clinton is bad feature One measure of a text classification system is how well it protects against concept drift. Can favor simpler models like Naïve Bayes 10/17/11 CSCI5417-IR 16What is Clustering? Clustering: the process of grouping a set of objects into classes of similar objects It is the most common form of unsupervised learning
View Full Document