Unformatted text preview:

1 CSCI 5417 Information Retrieval Systems Jim Martin!Lecture 17 10/25/2011 Today  Finish topic models model intro  Start on web search2 What if?  What if we just have the documents but no class assignments?  But assume we do have knowledge about the number of classes involved  Can we still use probabilistic models? In particular, can we use naïve Bayes?  Yes, via EM  Expectation Maximization EM 1. Given some model, like NB, make up some class assignments randomly. 2. Use those assignments to generate model parameters P(class) and P(word|class) 3. Use those model parameters to re-classify the training data. 4. Go to 23 Naïve Bayes Example (EM) Doc Category D1 ? D2 ? D3 ? D4 ? D5 ? Naïve Bayes Example (EM) Doc Category D1 Sports D2 Politics D3 Sports D4 Politics D5 Sports Doc Category {China, soccer} Sports {Japan, baseball} Politics {baseball, trade} Sports {China, trade} Politics {Japan, Japan, exports} Sports Sports (.6) baseball 2/13 China 2/13 exports 2/13 Japan 3/13 soccer 2/13 trade 2/13 Politics (.4) baseball 2/10 China 2/10 exports 1/10 Japan 2/10 soccer 1/10 trade 2/104 Naïve Bayes Example (EM)  Use these counts to reassess the class membership for D1 to D5. Reassign them to new classes. Recompute the tables and priors.  Repeat until happy Topics Doc Category {China, soccer} Sports {Japan, baseball} Sports {baseball, trade} Sports {China, trade} Politics {Japan, Japan, exports} Politics What’s the deal with trade?5 Topics Doc Category {China1, soccer2} Sports {Japan1, baseball2} Sports {baseball2, trade2} Sports {China1, trade1} Politics {Japan1, Japan1, exports1} Politics {basketball2, strike3} Topics  So let’s propose that instead of assigning documents to classes, we assign each word token in each document to a class (topic).  Then we can some new probabilities to associate with words, topics and documents  Distribution of topics in a doc  Distribution of topics overall  Association of words with topics 11/11/11 CSCI 5417 - IR 106 Topics  Example. A document like  {basketball2, strike3} Can be said to be .5 about topic 2 and .5 about topic 3 and 0 about the rest of the possible topics (may want to worry about smoothing later.  For a collection as a whole we can get a topic distribution (prior) by summing the words tagged with a particular topic, and dividing by the number of tagged tokens. 11/11/11 CSCI 5417 - IR 11 Problem  With “normal” text classification the training data associates a document with one or more topics.  Now we need to associate topics with the (content) words in each document  This is a semantic tagging task, not unlike part-of-speech tagging and word-sense tagging  It’s hard, slow and expensive to do right 11/11/11 CSCI 5417 - IR 127 Topic modeling  Do it without the human tagging  Given a set of documents  And a fixed number of topics (given)  Find the statistics that we need 11/11/11 CSCI 5417 - IR 13 Graphical Models Notation: Take 2 Category wi n Category w1 w2 w3 w4 wn …8 Unsupervised NB  Now suppose that Cat isn’t observed  That is, we don’t have category labels for each document  Then we need to learn two distributions:  P(Cat)  P(w|Cat)  How do we do this?  We might use EM  Alternative: Bayesian methods Category wi n Bayesian document categorization Cat w1 nD θ α φ β D priors P(w|Cat) P(Cat)9 Latent Dirichlet Allocation: Topic Models (Blei, Ng, & Jordan, 2001; 2003) Nd D zi wi θ (d) φ (j) α β θ (d) ∼ Dirichlet(α) zi ∼ Discrete(θ (d) ) φ (j) ∼ Dirichlet(β) wi ∼ Discrete(φ (zi) ) T distribution over topics for each document topic assignment for each word distribution over words for each topic word generated from assigned topic Dirichlet priors Given That  What could you do with it.  Browse/explore a collection and individual documents is the basic task 11/11/11 CSCI 5417 - IR 1810 Visualize the topics 11/11/11 CSCI 5417 - IR 19 Visualize documents 11/11/11 CSCI 5417 - IR 2011 Break 11/11/11 CSCI 5417 - IR 21 11/11/11 CSCI 5417 - IR 22 Brief History of Web Search  Early keyword-based engines  Altavista, Excite, Infoseek, Inktomi, Lycos ca. 1995-1997  Sponsored search ranking:  WWWW (Colorado/McBryan) -> Goto.com (morphed into Overture.com → Yahoo!  ???)  Your search ranking depended on how much you paid  Auction for keywords: casino was an expensive keyword!12 11/11/11 CSCI 5417 - IR 23 Brief history  1998+: Link-based ranking introduced by Google  Perception was that it represented a fundamental improvement over existing systems  Great user experience in search of a business model  Meanwhile Goto/Overture’s annual revenues were nearing $1 billion  Google adds paid-placement “ads” to the side, distinct from search results  2003: Yahoo follows suit  acquires Overture (for paid placement)  and Inktomi (for search) 11/11/11 CSCI 5417 - IR 24 Web search basics The Web Ad indexes Web Results 1 - 10 of about 7,310,000 for miele. (0.12 seconds) Miele, Inc -- Anything else is a compromise At the heart of your home, Appliances by Miele. ... USA. to miele.com. Residential Appliances. Vacuum Cleaners. Dishwashers. Cooking Appliances. Steam Oven. Coffee System ... www.miele.com/ - 20k - Cached - Similar pages Miele Welcome to Miele , the home of the very best appliances and kitchens in the world. www.miele.co.uk/ - 3k - Cached - Similar pages Miele - Deutscher Hersteller von Einbaugeräten, Hausgeräten ... - [ Translate this page ] Das Portal zum Thema Essen & Geniessen online unter www.zu -tisch.de. Miele weltweit ...ein Leben lang. ... Wählen Sie die Miele Vertretung Ihres Landes. www.miele.de/ - 10k - Cached - Similar pages Herzlich willkommen bei Miele Österreich - [ Translate this page ] Herzlich willkommen bei Miele Österreich Wenn Sie nicht automatisch weitergeleitet werden, klicken Sie bitte hier! HAUSHALTSGERÄTE ... www.miele.at/ - 3k - Cached - Similar pages Sponsored Links CG Appliance Express Discount Appliances (650) 756 -3931 Same Day Certified Installation www.cgappliance.com San Francisco -Oakland- San Jose, CA Miele Vacuum Cleaners Miele Vacuums- Complete Selection Free Shipping! www.vacuums.com


View Full Document

CU-Boulder CSCI 5417 - Lecture Notes

Download Lecture Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?