CSCI 5417 Information Retrieval Systems Jim MartinToday 10/11SVM SummaryThe Real WorldSlide 5Manually written rulesVery little data?A reasonable amount of data?A huge amount of data?How many categories?How can one tweak performance?Do “hacks” help?Measuring Classification Figures of MeritA common problem: Concept DriftThe Concept Drift ProblemWhat is Clustering?Clustering in IRFor improving ad hoc searchFor better navigation of search resultsIssues for clusteringClustering AlgorithmsPartitioning AlgorithmsK-MeansK-Means AlgorithmK Means Example (K=2)Termination conditionsEfficiency: Medoid As Cluster RepresentativeEvaluation of clusteringApproaches to evaluatingAnecdotal evaluationUser inspectionGround truth comparisonExternal Evaluation of Cluster QualityPurity exampleUtility viewpointMisc. Clustering TopicsTerm vs. document spaceFeature selectionClustering peopleLabeling clustersHow to Label ClustersLabelingHierarchical ClusteringSlide 44Hierarchical Clustering algorithmsHierarchical -> PartitionHierarchical Agglomerative Clustering (HAC)Slide 48“Closest pair” in ClusteringSingle Link Agglomerative ClusteringSingle Link ExampleComplete Link Agglomerative ClusteringComplete Link ExampleCSCI 5417Information Retrieval SystemsJim MartinLecture 1410/11/201101/14/19 CSCI 5417 - IR 2Today 10/11Finish up classificationClusteringFlat clusteringHierarchical clustering01/14/19 CSCI 5417 - IR 3SVM SummarySupport vector machines (SVM)Choose hyperplane based on support vectorsSupport vector = “critical” point close to decision boundaryDegree-1 SVMs are just fancy linear classifiers.Best performing text classifierBut there are cheaper methods that perform about as well as SVM, such as logistic regression (MaxEnt)Partly popular due to availability of SVMlightSVMlight is accurate and fast – and free (for research)Also libSVM, tinySVM, Weka01/14/19 CSCI 5417 - IR 4The Real WorldP. Jackson and I. Moulinier: Natural Language Processing for Online Applications“There is no question concerning the commercial value of being able to classify documents automatically by content. There are myriad potential applications of such a capability for corporate Intranets, government departments, and Internet publishers”“Understanding the data is one of the keys to successful categorization, yet this is an area in which most categorization tool vendors are extremely weak. Many of the ‘one size fits all’ tools on the market have not been tested on a wide range of content types.”01/14/19 CSCI 5417 - IR 5The Real WorldGee, I’m building a text classifier for real, now!What should I do?How much training data do you have?NoneVery littleQuite a lotA huge amount and its growing01/14/19 CSCI 5417 - IR 6Manually written rulesNo training data, but adequate domain expertise go with hand-written rulesIf (wheat or grain) and not (whole or bread) thenCategorize as grainIn practice, rules get a lot bigger than thisCan also be phrased using tf or tf.idf weightsWith careful crafting (human tuning on development data) performance is high:Construe: 94% recall, 84% precision over 675 categories (Hayes and Weinstein 1990)Amount of work required is hugeEstimate 2 days per class … plus ongoing maintenance01/14/19 CSCI 5417 - IR 7Very little data?If you’re just doing supervised classification, you should stick to something with high biasThere are theoretical results that naïve Bayes should do well in such circumstances (Ng and Jordan 2002 NIPS)An interesting research approach is to explore semi-supervised training methodsBootstrapping, EM over unlabeled documents, …The practical answer is to get more labeled data as soon as you canHow can you insert yourself into a process where humans will be willing to label data for you01/14/19 CSCI 5417 - IR 8A reasonable amount of data?Perfect, use an SVMBut if you are using a supervised ML approach, you should probably be prepared with the “hybrid” solutionUsers like to hack, and management likes to be able to implement quick fixes immediatelyHackers like regular expressions01/14/19 CSCI 5417 - IR 9A huge amount of data?This is great in theory for doing accurate classification…But it could easily mean that expensive methods like SVMs (training time) or kNN (testing time) are quite impracticalNaïve Bayes can come back into its own again!Or other methods with linear training/test complexity like regularized logistic regression01/14/19 CSCI 5417 - IR 10How many categories?A few (well separated ones)?Easy!A zillion closely related ones?Library of Congress classifications, MeSH terms, Reuters...Quickly gets difficult!Evaluation is tricky01/14/19 CSCI 5417 - IR 11How can one tweak performance?Aim to exploit any domain-specific useful features that give special meanings or that zone the dataan author byline, mail headers, titles, zones in texts.Aim to collapse things that would be treated as different but shouldn’t be.Part numbers, chemical formulas, gene/protein names, dates, etc.01/14/19 CSCI 5417 - IR 12Do “hacks” help?You bet!You can get a lot of value by differentially weighting contributions from different document zones:Upweighting title words helps (Cohen & Singer 1996)Doubling the weighting on the title words is a good rule of thumbUpweighting the first sentence of each paragraph helps (Murata, 1999)Upweighting sentences that contain title words helps (Ko et al, 2002)01/14/19 CSCI 5417 - IR 13Measuring ClassificationFigures of MeritNot just accuracy; in the real world, there are economic measures:Your choices are:Do no classificationThat has a cost (hard to compute)Do it all manuallyHas an easy to compute cost if doing it like that nowDo it all with an automatic classifierMistakes have a costDo it with a combination of automatic classification and manual review of uncertain/difficult/“new” casesCommonly the last method is most cost efficient and is adopted01/14/19 CSCI 5417 - IR 14A common problem: Concept DriftCategories change over timeExample: “president of the united states”1999: clinton is great feature2002: clinton is bad featureOne measure of a text classification system is how well it protects against concept drift.Can favor simpler models like Naïve BayesFeature selection: can be bad in protecting
View Full Document