CU-Boulder CSCI 5417 - Clustering - D2067264

Home> Schools> University of Colorado at Boulder> Computer Science (CSCI) > CSCI 5417> Clustering

CU-Boulder CSCI 5417 - Clustering

School name University of Colorado at Boulder

Course Csci 5417- Information Retrieval Systems

Pages 27

Download Save

Unformatted text preview:

1 CSCI 5417 Information Retrieval Systems Jim Martin!Lecture 14 10/11/2011 10/17/11 CSCI5417-IR 2Today 10/11  Finish up classification  Clustering  Flat clustering  Hierarchical clustering2 10/17/11 CSCI5417-IR 3SVM Summary  Support vector machines (SVM)  Choose hyperplane based on support vectors  Support vector = “critical” point close to decision boundary  Degree-1 SVMs are just fancy linear classifiers.  Best performing text classifier  But there are cheaper methods that perform about as well as SVM, such as logistic regression (MaxEnt)  Partly popular due to availability of SVMlight  SVMlight is accurate and fast – and free (for research)  Also libSVM, tinySVM, Weka 10/17/11 CSCI5417-IR 4The Real World P. Jackson and I. Moulinier: Natural Language Processing for Online Applications  “There is no question concerning the commercial value of being able to classify documents automatically by content. There are myriad potential applications of such a capability for corporate Intranets, government departments, and Internet publishers”  “Understanding the data is one of the keys to successful categorization, yet this is an area in which most categorization tool vendors are extremely weak. Many of the ‘one size fits all’ tools on the market have not been tested on a wide range of content types.”3 10/17/11 CSCI5417-IR 5The Real World  Gee, I’m building a text classifier for real, now!  What should I do?  How much training data do you have?  None  Very little  Quite a lot  A huge amount and its growing 10/17/11 CSCI5417-IR 6Manually written rules  No training data, but adequate domain expertise go with hand-written rules  If (wheat or grain) and not (whole or bread) then  Categorize as grain  In practice, rules get a lot bigger than this  Can also be phrased using tf or tf.idf weights  With careful crafting (human tuning on development data) performance is high:  Construe: 94% recall, 84% precision over 675 categories (Hayes and Weinstein 1990)  Amount of work required is huge  Estimate 2 days per class … plus ongoing maintenance4 10/17/11 CSCI5417-IR 7Very little data?  If you’re just doing supervised classification, you should stick to something with high bias  There are theoretical results that naïve Bayes should do well in such circumstances (Ng and Jordan 2002 NIPS)  An interesting research approach is to explore semi-supervised training methods  Bootstrapping, EM over unlabeled documents, …  The practical answer is to get more labeled data as soon as you can  How can you insert yourself into a process where humans will be willing to label data for you 10/17/11 CSCI5417-IR 8A reasonable amount of data?  Perfect, use an SVM  But if you are using a supervised ML approach, you should probably be prepared with the “hybrid” solution  Users like to hack, and management likes to be able to implement quick fixes immediately  Hackers like regular expressions5 10/17/11 CSCI5417-IR 9A huge amount of data?  This is great in theory for doing accurate classification…  But it could easily mean that expensive methods like SVMs (training time) or kNN (testing time) are quite impractical  Naïve Bayes can come back into its own again!  Or other methods with linear training/test complexity like regularized logistic regression 10/17/11 CSCI5417-IR 10How many categories?  A few (well separated ones)?  Easy!  A zillion closely related ones?  Library of Congress classifications, MeSH terms, Reuters...  Quickly gets difficult!  Evaluation is tricky6 10/17/11 CSCI5417-IR 11How can one tweak performance?  Aim to exploit any domain-specific useful features that give special meanings or that zone the data  an author byline, mail headers, titles, zones in texts.  Aim to collapse things that would be treated as different but shouldn’t be.  Part numbers, chemical formulas, gene/protein names, dates, etc. 10/17/11 CSCI5417-IR 12Do “hacks” help?  You bet!  You can get a lot of value by differentially weighting contributions from different document zones:  Upweighting title words helps (Cohen & Singer 1996)  Doubling the weighting on the title words is a good rule of thumb  Upweighting the first sentence of each paragraph helps (Murata, 1999)  Upweighting sentences that contain title words helps (Ko et al, 2002)7 10/17/11 CSCI5417-IR 13Measuring Classification Figures of Merit  Not just accuracy; in the real world, there are economic measures:  Your choices are:  Do no classification  That has a cost (hard to compute)  Do it all manually  Has an easy to compute cost if doing it like that now  Do it all with an automatic classifier  Mistakes have a cost  Do it with a combination of automatic classification and manual review of uncertain/difficult/“new” cases  Commonly the last method is most cost efficient and is adopted 10/17/11 CSCI5417-IR 14A common problem: Concept Drift  Categories change over time  Example: “president of the united states”  1999: clinton is great feature  2002: clinton is bad feature  One measure of a text classification system is how well it protects against concept drift.  Can favor simpler models like Naïve Bayes  Feature selection: can be bad in protecting against concept drift8 10/17/11 CSCI5417-IR 15The Concept Drift Problem  Things change  Example: “president of the united states”  1999: clinton is great feature  2010: clinton is bad feature  One measure of a text classification system is how well it protects against concept drift.  Can favor simpler models like Naïve Bayes 10/17/11 CSCI5417-IR 16What is Clustering?  Clustering: the process of grouping a set of objects into classes of similar objects  It is the most common form of unsupervised learning

View Full Document


School:
Email:
New Password:
Confirm Password:

CU-Boulder CSCI 5417 - Clustering

Sign up for free to view:

Please select your school