DOC PREVIEW
CMU CS 10601 - Recitation

This preview shows page 1-2-3-4-27-28-29-30-55-56-57-58 out of 58 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 58 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 58 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 58 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 58 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 58 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 58 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 58 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 58 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 58 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 58 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 58 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 58 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 58 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Oznur Tastan 10601 Machine Learning Recitation 3 Sep 16 2009 OutlineText classificationText classification spam filteringText classificationText classificationText Classification: ExamplesRepresenting text for classificationRepresenting text: a list of wordsRepresenting text for classification‘Bag of words’ representation of textBag of words representationBag of wordsBag of words‘Bag of words’ representation of textMultinomial distributionMultinomial DistributionMultinomial Distribution‘Bag of words’ representation of text‘Bag of words’ representation of textMultinomial distribution and bag of wordsConjugate distributionDrichlet distributionDirichlet DistributionPseudo Count and priorGenerative modelPolynomial Curve Fitting Sum-of-Squares Error Function0th Order Polynomial1st Order Polynomial3rd Order Polynomial9th Order PolynomialWhich of the predicted curve is better?What do we really want?What do we really want?What do we really want?ExampleGeneral strategyTest set methodTest set methodHow good is the prediction?Train test set splitMore data is betterTrain test set splitTrain test set splitTrain test set splitTrain test set splitTrain/test set splitCross ValidationLOOCV (Leave-one-out Cross Validation)LOOCV (Leave-one-out Cross Validation)LOOCV (Leave-one-out Cross Validation)LOOCV (Leave-one-out Cross Validation)K-fold cross validationModel SelectionReferencesOznur Tastan10601 Machine LearningRecitation 3Sep 16 2009Outline• A text classification example– Multinomial distribution– Drichlet distribution• Model selection– Miro will be continuing in that topicText classification exampleText classification• We are not into classification yet.• For the sake of example,I’ll briefly go over what it is.Classification Task:You have an input x, you classify which label it has y from some fixed set of labels y1,...,ykText classification spam filteringInput: document DOutput: the predicted class y from {y1,...,yk}Spam filtering:Classify email as ‘Spam’, ‘Other’.P (Y=spam | X)Text classificationInput: document DOutput: the predicted class y from {y1,...,yk}Text classification examples:Classify email as ‘Spam’, ‘Other’. What other text classification applications you can think of?Text classificationInput: document xOutput: the predicted class y y is from {y1,...,yk}Text classification examples:Classify email as‘Spam’, ‘Other’.Classify web pages as‘Student’, ‘Faculty’, ‘Other’Classify news stories into topics‘Sports’, ‘Politics’..Classify business names byindustry.Classify movie reviews as‘Favorable’, ‘Unfavorable’, ‘Neutral’ … and many more.Text Classification: ExamplesClassify shipment articles into one 93 categories. An example category ‘wheat’ARGENTINE 1986/87 GRAIN/OILSEED REGISTRATIONSBUENOS AIRES, Feb 26Argentine grain board figures show crop registrations of grains, oilseeds and their products to February 11, in thousands of tonnes, showing those for future shipments month, 1986/87 total and 1985/86 total to February 12, 1986, in brackets:Bread wheat prev 1,655.8, Feb 872.0, March 164.6, total 2,692.4 (4,161.0).Maize Mar 48.0, total 48.0 (nil).Sorghum nil (nil)Oilseed export registrations were:Sunflowerseed total 15.0 (7.9)Soybean May 20.0, total 20.0 (nil)The board also detailed export registrations for subproducts, as follows....Representing text for classificationARGENTINE 1986/87 GRAIN/OILSEED REGISTRATIONSBUENOS AIRES, Feb 26Argentine grain board figures show crop registrations of grains, oilseeds and their products to February 11, in thousands of tonnes, showing those for future shipments month, 1986/87 total and 1985/86 total to February 12, 1986, in brackets:Bread wheat prev 1,655.8, Feb 872.0, March 164.6, total 2,692.4 (4,161.0).Maize Mar 48.0, total 48.0 (nil).Sorghum nil (nil)Oilseed export registrations were:Sunflowerseed total 15.0 (7.9)Soybean May 20.0, total 20.0 (nil)The board also detailed export registrations for sub-products, as follows....yHow would you represent the document?Representing text: a list of wordsargentine, 1986, 1987, grain, oilseed, registrations, buenos, aires, feb, 26, argentine, grain, board, figures, show, crop, registrations, of, grains, oilseeds, and, their, products, to, february, 11, in, …Common refinements: remove stopwords, stemming, collapsing multiple occurrences of words into one….yRepresenting text for classificationARGENTINE 1986/87 GRAIN/OILSEED REGISTRATIONSBUENOS AIRES, Feb 26Argentine grain board figures show crop registrations of grains, oilseeds and their products to February 11, in thousands of tonnes, showing those for future shipments month, 1986/87 total and 1985/86 total to February 12, 1986, in brackets:Bread wheat prev 1,655.8, Feb 872.0, March 164.6, total 2,692.4 (4,161.0).Maize Mar 48.0, total 48.0 (nil).Sorghum nil (nil)Oilseed export registrations were:Sunflowerseed total 15.0 (7.9)Soybean May 20.0, total 20.0 (nil)The board also detailed export registrations for sub-products, as follows....yHow would you represent the document?‘Bag of words’ representation of textARGENTINE 1986/87 GRAIN/OILSEED REGISTRATIONSBUENOS AIRES, Feb 26Argentine grain board figures show crop registrations of grains, oilseeds and their products to February 11, in thousands of tonnes, showing those for future shipments month, 1986/87 total and 1985/86 total to February 12, 1986, in brackets:Bread wheat prev 1,655.8, Feb 872.0, March 164.6, total 2,692.4 (4,161.0).Maize Mar 48.0, total 48.0 (nil).Sorghum nil (nil)Oilseed export registrations were:Sunflowerseed total 15.0 (7.9)Soybean May 20.0, total 20.0 (nil)The board also detailed export registrations for sub-products, as follows....grain(s) 3oilseed(s) 2total 3wheat 1maize 1soybean 1tonnes 1... ...word frequencyBag of word representation:Represent text as a vector of word frequencies.Bag of words representationdocument iFrequency (i,j) = j in document iword jA collection of documentsBag of wordsWhat simplifying assumption are we taking?Bag of wordsWhat simplifying assumption are we taking?We assumed word order is not important. ‘Bag of words’ representation of textARGENTINE 1986/87 GRAIN/OILSEED REGISTRATIONSBUENOS AIRES, Feb 26Argentine grain board figures show crop registrations of grains, oilseeds and their products to February 11, in thousands of tonnes, showing those for future shipments month, 1986/87 total and 1985/86 total to February 12, 1986, in brackets:Bread


View Full Document

CMU CS 10601 - Recitation

Documents in this Course
lecture

lecture

40 pages

Problem

Problem

12 pages

lecture

lecture

36 pages

Lecture

Lecture

31 pages

Review

Review

32 pages

Lecture

Lecture

11 pages

Lecture

Lecture

18 pages

Notes

Notes

10 pages

Boosting

Boosting

21 pages

review

review

21 pages

review

review

28 pages

Lecture

Lecture

31 pages

lecture

lecture

52 pages

Review

Review

26 pages

review

review

29 pages

Lecture

Lecture

37 pages

Lecture

Lecture

35 pages

Boosting

Boosting

17 pages

Review

Review

35 pages

lecture

lecture

32 pages

Lecture

Lecture

28 pages

Lecture

Lecture

30 pages

lecture

lecture

29 pages

leecture

leecture

41 pages

lecture

lecture

34 pages

review

review

38 pages

review

review

31 pages

Lecture

Lecture

41 pages

Lecture

Lecture

15 pages

Lecture

Lecture

21 pages

Lecture

Lecture

38 pages

Notes

Notes

37 pages

lecture

lecture

29 pages

Load more
Download Recitation
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Recitation and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Recitation 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?