View Full Document


Unformatted text preview:

Data Intensive Information Processing Applications Session 4 Text Retrieval Algorithms Jimmy Lin University of Maryland Tuesday February 23 2010 This work is licensed under a Creative Commons Attribution Noncommercial Share Alike 3 0 United States See http creativecommons org licenses by nc sa 3 0 us for details Source Wikipedia Japanese rock garden Today s Agenda Introduction to information retrieval Basics as cs o of indexing de g a and d retrieval et e a Inverted indexing in MapReduce Retrieval at scale First nomenclature Information retrieval IR z z What do we search z z Focus on textual information text document retrieval Other possibilities include image video music Generically collections Less frequently used corpora What do we find z z Generically documents Even though we may be referring to web pages PDFs PowerPoint slides paragraphs etc Information Retrieval Cycle Source Selection Resource Query Formulation Query Search Results Selection System discovery Vocabulary discovery Concept discovery Document discovery source reselection Documents Examination Information Delivery The Central Problem in Search Searcher Concepts Query Terms tragic love story Author Concepts Document Terms fateful star crossed romance Do these represent the same concepts Abstract IR Architecture Query Documents online offline Representation Function Representation Function Query Representation Document Representation Comparison Function Index Hits How do we represent text Remember computers don t understand anything Bag ag o of words o ds z z z z Treat all the words in a document as index terms Assign a weight to each term based on importance or in simplest case or case presence absence of word Disregard order structure meaning etc of the words Simple yet effective Assumptions z z z Term occurrence is independent Document relevance is independent Words are well defined What s a word 1982 2005 06 25 Sample Document McDonald s slims down spuds Fast food chain to reduce certain types of fat in its french fries with new cooking oil Bag of Words NEW YORK CNN Money McDonald s Corp is cutting the amount of bad fat in its french fries nearly in half the fast food chain said Tuesday as it moves to make all its fried menu items healthier 14 McDonalds M D ld But does that mean the popular shoestring fries won t taste the same The company says no It s a win win for our customers because they are getting the same great french fry taste along with an even healthier nutrition profile said Mike Roberts president of McDonald s USA 11 fries But others are not so sure McDonald s will not specifically discuss the kind of oil it plans to use but at least one nutrition expert says playing with the formula could mean a different taste Shares of Oak Brook Ill based McDonald s MCD d MCD down 0 0 54 54 tto 23 23 22 22 R Research h Estimates were lower Tuesday afternoon It was unclear Tuesday whether competitors Burger King and Wendy s International WEN down 0 80 to 34 91 Research Estimates would follow suit Neither company could immediately be reached for comment 12 fat 8 new 7 french 6 company said nutrition 5 food oil percent p reduce taste Tuesday Counting Words Documents case folding tokenization stopword removal stemming Bag of Words Inverted Index syntax semantics word knowledge etc Boolean Retrieval Users express queries as a Boolean expression z z AND OR NOT Can be arbitrarily nested Retrieval is based on the notion of sets z z Any given query divides the collection into two sets retrieved not retrieved Pure Boolean systems do not define an ordering of the results Inverted Index Boolean Retrieval Doc 1 Doc 2 one fish two fish 1 blue red fish blue fish 2 3 1 egg fish 1 1 cat in the hat Doc 4 green eggs and ham 4 1 cat Doc 3 1 blue 2 cat 3 egg 4 fish 1 green 1 green 4 ham 1 ham 4 hat 3 one 1 red 2 two 1 hat one 1 1 red two 1 1 2 Boolean Retrieval To execute a Boolean query z Build query syntax tree OR blue AND fish OR ham z z ham For each clause look up postings blue 2 fish 1 AND blue 2 Traverse postings and apply Boolean operator Efficiencyy analysis y z z Postings traversal is linear assuming sorted postings Start with shortest posting first fish Strengths and Weaknesses Strengths z z z Precise if you know the right strategies Precise if you have an idea of what you re looking for Implementations are fast and efficient Weaknesses z z z z z Users must learn Boolean logic Boolean logic insufficient to capture the richness of language No control over size of result set either too many hits or none When do you stop reading All documents in the result set are considered co s de ed equally equa y good good What about partial matches Documents that don t quite match the query may be useful also Ranked Retrieval Order documents by how likely they are to be relevant to the information need z z z User model z z Estimate relevance q di Sort documents by relevance Display sorted results Present hits one screen at a time best results first At any point users can decide to stop looking How do we estimate relevance z z z Assume document is relevant if it has a lot of query terms Replace relevance q di with sim q di Compute p similarity y of vector representations p Vector Space Model t3 d2 d3 d1 t1 d5 t2 d4 Assumption Documents that are close together in vector space talk about the same things Therefore retrieve documents based on how close the Therefore document is to the query i e similarity closeness Similarity Metric Use angle between the vectors r r d j dk cos r r d j dk r r n w w d j dk i 1 i j i k sim d j d k r r n n 2 2 d j dk w w i 1 i j i 1 i k Or more generally inner products r r n sim d j d k d j d k i 1 wi j wi k Term Weighting Term weights consist of two components z z Here s the intuition z z Local how important is the term in this document Global how important is the term in the collection Terms that appear often in a document should get high weights Terms that appear in many documents should get low weights How do we capture this mathematically z z Term frequency local Inverse document frequency global TF IDF Term Weighting N wi j tf i j log ni wi j weight assigned to term i in document j tf i j number of occurrence of term i in document j N number of documents in entire collection ni number of documents with term i Inverted Index TF IDF Doc 1 Doc 2 one fish two fish red fish blue fish Doc 3 cat in the hat Doc 4 green eggs and ham tf 1 blue 2 3 1 cat 1 egg …

Access the best Study Guides, Lecture Notes and Practice Exams

Loading Unlocking...

Join to view Text Retrieval Algorithms and access 3M+ class-specific study document.

We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Text Retrieval Algorithms and access 3M+ class-specific study document.


By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?