Unformatted text preview:

1 CSCI 5417 Information Retrieval Systems Jim Martin!Lecture 9 9/20/2011 9/22/11 CSCI5417-IR 2Today 9/20  Where we are  MapReduce/Hadoop  Probabilistic IR  Language models  LM for ad hoc retrieval2 9/22/11 CSCI5417-IR 3Where we are...  Basics of ad hoc retrieval  Indexing  Term weighting/scoring  Cosine  Evaluation  Document classification  Clustering  Information extraction  Sentiment/Opinion mining 9/22/11 CSCI7000-IR 4But First: Back to Distributed Indexing splits Parser Parser Parser Master a-f g-p q-z a-f g-p q-z a-f g-p q-z Inverter Inverter Inverter Postings a-f g-p q-z assign assign3 Huh?  That was supposed to be an explanation of MapReduce (Hadoop)...  Maybe not so much...  Here’s another try MapReduce  MapReduce is a distributed programming framework that is intended to facilitate applications that are  Data intensive  Parallelizable in a certain sense  In a commodity-cluster environment  MapReduce is the original internal Google model  Hadoop is the open source version4 Inspirations  MapReduce elegantly and efficiently combines inspirations from a variety of sources, including  Functional programming  Key/value association lists  Unix pipes Functional Programming  The focus is on side-effect free specifications of input/output mappings  There are various idioms, but map and reduce are two central ones...  Mapping refers to applying an identical function to each of the elements of a list and constructing a list of the outputs  Reducing refers to receiving the elements of a list and aggregating the elements according to some function.5 Python Map/Reduce  Say you wanted to compute simple sum of squares of a list of numbers € wi2i=0n∑>>> z [1, 2, 3, 4, 5, 6, 7, 8, 9] >>> z2 = map(lambda x: x**2, l) >>> z2 [1, 4, 9, 16, 25, 36, 49, 64, 81] >>> reduce(lambda x,y: x+y, z2) 285 >>> reduce(lambda x,y: x+y, map(lambda x: x**2, z)) 285 Association Lists (key/value)  The notion of association lists goes way back to early lisp/ai programming. The basic idea is to try to view problems in terms of sets of key/value pairs.  Most major languages now provide first-class support for this notion (usually via hashes on keys)  We’ve seen this a lot this semester  Tokens and term-ids  Terms and document ids  Terms and posting lists  Docids and tf/idf values  Etc.6 MapReduce  MapReduce combines these ideas in the following way  There are two phases of processing mapping and reducing. Each phase consists of multiple identical copies of map and reduce methods  Map methods take individual key/value pairs as input and return some function of those pairs to produce a new key/value pair  Reduce methods take key/<list of values> pairs as input, and return some aggregate function of the values as an answer. Map Phase map map map map map map Key/value Key/value Key/value Key/value Key/value Key/value Key/value Key/value Key/value Key/value Key/value Key/value Key/value Key’/value’ Key’/value’ Key’/value’ Key’/value’ Key’/value’ Key’/value’7 Reduce Phase Distribute by keys Key’/value’ Key’/value’ Key’/value’ Key’/value’ Key’/value’ Key’/value’ Key’/value’ Key’/value’ Key’/value’ Key’’/value’’ reduce Sort by keys and collate values Key’/<value’ list> Key’/<value’ list> Key’/<value’ list> Key’/<value’ list> reduce reduce reduce reduce reduce Key’’/value’’ Key’’/value’’ Key’’/value’’ Key’’/value’’ Key’’/value’’ Example  Simple example used in all the tutorials  Get the counts of each word type across a bunch of docs  Let’s assume each doc is a big long string For map Input: Filenames are keys; content string is values Output: Term tokens are keys; values are 1’s For reduce Input: Terms tokens are keys, 1’s are values Output: Term types are keys, summed counts are values8 Dumbo Example def map(docid, contents): for term in contents.split(): yield term, 1 def reduce(term, counts): sum = 0 for count in counts: sum = sum + count yield term, sum Key Value Key Value Hidden Infrastructure  Partitioning the incoming data  Hadoop has default methods  By file, given a bunch of files  <filename, contents>  By line, given a file full of lines  <line #, line>  Sorting/collating the mapped key/values  Moving the data among the nodes  Distributed file system  Don’t move the data; just assign mappers/reducers to nodes9 Example 2  Given our normal postings  term -> list of (doc-id, tf) tuples  Generate the vector length normalization for each document in the index € wt,d2t∈d∑• Map • Input: terms are keys, posting lists are values • Output: doc-ids are keys, squared weights are values • Reduce • Input: doc-ids are keys, list of squared weights are values • Output: doc-ids are keys, square root of the summed weights are the values Example 2 def map(term, postings): for post in postings: yield post.docID(), post.weight() ** 2 def reduce(docID, sqWeights): sum = 0 for weight in sqWeights: sum = sum + weight yield docID, math.sqrt(sum)10 9/22/11 CSCI5417-IR 19Break  Thursday we’ll start discussion of projects. So come to class ready to say something about projects.  Part 2 of the HW is due next Thursday  Feel free to mail me updated results (R-precisions) as you get them... Probabilistic pproaches.  The following is a mix of chapters 12 and 13.  Only the material from 12 will be on the quiz 9/22/11 CSCI5417-IR 2011 An Alternative  Basic vector space model uses a geometric metaphor/framework for the ad hoc retrieval problem  One dimension for each word in the vocab  Weights are usually tf-idf based  An alternative is to use a probabilistic approach  So we’ll take a short detour into probabilistic language modeling 9/22/11 CSCI5417-IR 2122"Using"Language"Models"for"ad"hoc"Retrieval""


View Full Document

CU-Boulder CSCI 5417 - Lecture Notes

Download Lecture Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?