CU-Boulder CSCI 5417 - Lecture 14 - D2816818

Home> Schools> University of Colorado at Boulder> Computer Science (CSCI) > CSCI 5417> Lecture 14

DOC PREVIEW

CU-Boulder CSCI 5417 - Lecture 14

School name University of Colorado at Boulder

Course Csci 5417- Information Retrieval Systems

Pages 53

This preview shows page 1-2-3-4-24-25-26-50-51-52-53 out of 53 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 53 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 53 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 53 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 53 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 53 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 53 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 53 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 53 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 53 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 53 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 53 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 53 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CSCI 5417 Information Retrieval Systems Jim MartinToday 10/11SVM SummaryThe Real WorldSlide 5Manually written rulesVery little data?A reasonable amount of data?A huge amount of data?How many categories?How can one tweak performance?Do “hacks” help?Measuring Classification Figures of MeritA common problem: Concept DriftThe Concept Drift ProblemWhat is Clustering?Clustering in IRFor improving ad hoc searchFor better navigation of search resultsIssues for clusteringClustering AlgorithmsPartitioning AlgorithmsK-MeansK-Means AlgorithmK Means Example (K=2)Termination conditionsEfficiency: Medoid As Cluster RepresentativeEvaluation of clusteringApproaches to evaluatingAnecdotal evaluationUser inspectionGround truth comparisonExternal Evaluation of Cluster QualityPurity exampleUtility viewpointMisc. Clustering TopicsTerm vs. document spaceFeature selectionClustering peopleLabeling clustersHow to Label ClustersLabelingHierarchical ClusteringSlide 44Hierarchical Clustering algorithmsHierarchical -> PartitionHierarchical Agglomerative Clustering (HAC)Slide 48“Closest pair” in ClusteringSingle Link Agglomerative ClusteringSingle Link ExampleComplete Link Agglomerative ClusteringComplete Link ExampleCSCI 5417Information Retrieval SystemsJim MartinLecture 1410/11/201101/14/19 CSCI 5417 - IR 2Today 10/11Finish up classificationClusteringFlat clusteringHierarchical clustering01/14/19 CSCI 5417 - IR 3SVM SummarySupport vector machines (SVM)Choose hyperplane based on support vectorsSupport vector = “critical” point close to decision boundaryDegree-1 SVMs are just fancy linear classifiers.Best performing text classifierBut there are cheaper methods that perform about as well as SVM, such as logistic regression (MaxEnt)Partly popular due to availability of SVMlightSVMlight is accurate and fast – and free (for research)Also libSVM, tinySVM, Weka01/14/19 CSCI 5417 - IR 4The Real WorldP. Jackson and I. Moulinier: Natural Language Processing for Online Applications“There is no question concerning the commercial value of being able to classify documents automatically by content. There are myriad potential applications of such a capability for corporate Intranets, government departments, and Internet publishers”“Understanding the data is one of the keys to successful categorization, yet this is an area in which most categorization tool vendors are extremely weak. Many of the ‘one size fits all’ tools on the market have not been tested on a wide range of content types.”01/14/19 CSCI 5417 - IR 5The Real WorldGee, I’m building a text classifier for real, now!What should I do?How much training data do you have?NoneVery littleQuite a lotA huge amount and its growing01/14/19 CSCI 5417 - IR 6Manually written rulesNo training data, but adequate domain expertise go with hand-written rulesIf (wheat or grain) and not (whole or bread) thenCategorize as grainIn practice, rules get a lot bigger than thisCan also be phrased using tf or tf.idf weightsWith careful crafting (human tuning on development data) performance is high:Construe: 94% recall, 84% precision over 675 categories (Hayes and Weinstein 1990)Amount of work required is hugeEstimate 2 days per class … plus ongoing maintenance01/14/19 CSCI 5417 - IR 7Very little data?If you’re just doing supervised classification, you should stick to something with high biasThere are theoretical results that naïve Bayes should do well in such circumstances (Ng and Jordan 2002 NIPS)An interesting research approach is to explore semi-supervised training methodsBootstrapping, EM over unlabeled documents, …The practical answer is to get more labeled data as soon as you canHow can you insert yourself into a process where humans will be willing to label data for you01/14/19 CSCI 5417 - IR 8A reasonable amount of data?Perfect, use an SVMBut if you are using a supervised ML approach, you should probably be prepared with the “hybrid” solutionUsers like to hack, and management likes to be able to implement quick fixes immediatelyHackers like regular expressions01/14/19 CSCI 5417 - IR 9A huge amount of data?This is great in theory for doing accurate classification…But it could easily mean that expensive methods like SVMs (training time) or kNN (testing time) are quite impracticalNaïve Bayes can come back into its own again!Or other methods with linear training/test complexity like regularized logistic regression01/14/19 CSCI 5417 - IR 10How many categories?A few (well separated ones)?Easy!A zillion closely related ones?Library of Congress classifications, MeSH terms, Reuters...Quickly gets difficult!Evaluation is tricky01/14/19 CSCI 5417 - IR 11How can one tweak performance?Aim to exploit any domain-specific useful features that give special meanings or that zone the dataan author byline, mail headers, titles, zones in texts.Aim to collapse things that would be treated as different but shouldn’t be.Part numbers, chemical formulas, gene/protein names, dates, etc.01/14/19 CSCI 5417 - IR 12Do “hacks” help?You bet!You can get a lot of value by differentially weighting contributions from different document zones:Upweighting title words helps (Cohen & Singer 1996)Doubling the weighting on the title words is a good rule of thumbUpweighting the first sentence of each paragraph helps (Murata, 1999)Upweighting sentences that contain title words helps (Ko et al, 2002)01/14/19 CSCI 5417 - IR 13Measuring ClassificationFigures of MeritNot just accuracy; in the real world, there are economic measures:Your choices are:Do no classificationThat has a cost (hard to compute)Do it all manuallyHas an easy to compute cost if doing it like that nowDo it all with an automatic classifierMistakes have a costDo it with a combination of automatic classification and manual review of uncertain/difficult/“new” casesCommonly the last method is most cost efficient and is adopted01/14/19 CSCI 5417 - IR 14A common problem: Concept DriftCategories change over timeExample: “president of the united states”1999: clinton is great feature2002: clinton is bad featureOne measure of a text classification system is how well it protects against concept drift.Can favor simpler models like Naïve BayesFeature selection: can be bad in protecting

View Full Document