CU-Boulder CSCI 5417 - Lecture Notes - D2334139

Home> Schools> University of Colorado at Boulder> Computer Science (CSCI) > CSCI 5417> Lecture Notes

CU-Boulder CSCI 5417 - Lecture Notes

School name University of Colorado at Boulder

Course Csci 5417- Information Retrieval Systems

Pages 15

Download Save

Unformatted text preview:

1 CSCI 5417 Information Retrieval Systems Jim Martin!Lecture 9 9/20/2011 9/22/11 CSCI5417-IR 2Today 9/20  Where we are  MapReduce/Hadoop  Probabilistic IR  Language models  LM for ad hoc retrieval2 9/22/11 CSCI5417-IR 3Where we are...  Basics of ad hoc retrieval  Indexing  Term weighting/scoring  Cosine  Evaluation  Document classification  Clustering  Information extraction  Sentiment/Opinion mining 9/22/11 CSCI7000-IR 4But First: Back to Distributed Indexing splits Parser Parser Parser Master a-f g-p q-z a-f g-p q-z a-f g-p q-z Inverter Inverter Inverter Postings a-f g-p q-z assign assign3 Huh?  That was supposed to be an explanation of MapReduce (Hadoop)...  Maybe not so much...  Here’s another try MapReduce  MapReduce is a distributed programming framework that is intended to facilitate applications that are  Data intensive  Parallelizable in a certain sense  In a commodity-cluster environment  MapReduce is the original internal Google model  Hadoop is the open source version4 Inspirations  MapReduce elegantly and efficiently combines inspirations from a variety of sources, including  Functional programming  Key/value association lists  Unix pipes Functional Programming  The focus is on side-effect free specifications of input/output mappings  There are various idioms, but map and reduce are two central ones...  Mapping refers to applying an identical function to each of the elements of a list and constructing a list of the outputs  Reducing refers to receiving the elements of a list and aggregating the elements according to some function.5 Python Map/Reduce  Say you wanted to compute simple sum of squares of a list of numbers € wi2i=0n∑>>> z [1, 2, 3, 4, 5, 6, 7, 8, 9] >>> z2 = map(lambda x: x**2, l) >>> z2 [1, 4, 9, 16, 25, 36, 49, 64, 81] >>> reduce(lambda x,y: x+y, z2) 285 >>> reduce(lambda x,y: x+y, map(lambda x: x**2, z)) 285 Association Lists (key/value)  The notion of association lists goes way back to early lisp/ai programming. The basic idea is to try to view problems in terms of sets of key/value pairs.  Most major languages now provide first-class support for this notion (usually via hashes on keys)  We’ve seen this a lot this semester  Tokens and term-ids  Terms and document ids  Terms and posting lists  Docids and tf/idf values  Etc.6 MapReduce  MapReduce combines these ideas in the following way  There are two phases of processing mapping and reducing. Each phase consists of multiple identical copies of map and reduce methods  Map methods take individual key/value pairs as input and return some function of those pairs to produce a new key/value pair  Reduce methods take key/<list of values> pairs as input, and return some aggregate function of the values as an answer. Map Phase map map map map map map Key/value Key/value Key/value Key/value Key/value Key/value Key/value Key/value Key/value Key/value Key/value Key/value Key/value Key’/value’ Key’/value’ Key’/value’ Key’/value’ Key’/value’ Key’/value’7 Reduce Phase Distribute by keys Key’/value’ Key’/value’ Key’/value’ Key’/value’ Key’/value’ Key’/value’ Key’/value’ Key’/value’ Key’/value’ Key’’/value’’ reduce Sort by keys and collate values Key’/<value’ list> Key’/<value’ list> Key’/<value’ list> Key’/<value’ list> reduce reduce reduce reduce reduce Key’’/value’’ Key’’/value’’ Key’’/value’’ Key’’/value’’ Key’’/value’’ Example  Simple example used in all the tutorials  Get the counts of each word type across a bunch of docs  Let’s assume each doc is a big long string For map Input: Filenames are keys; content string is values Output: Term tokens are keys; values are 1’s For reduce Input: Terms tokens are keys, 1’s are values Output: Term types are keys, summed counts are values8 Dumbo Example def map(docid, contents): for term in contents.split(): yield term, 1 def reduce(term, counts): sum = 0 for count in counts: sum = sum + count yield term, sum Key Value Key Value Hidden Infrastructure  Partitioning the incoming data  Hadoop has default methods  By file, given a bunch of files  <filename, contents>  By line, given a file full of lines  <line #, line>  Sorting/collating the mapped key/values  Moving the data among the nodes  Distributed file system  Don’t move the data; just assign mappers/reducers to nodes9 Example 2  Given our normal postings  term -> list of (doc-id, tf) tuples  Generate the vector length normalization for each document in the index € wt,d2t∈d∑• Map • Input: terms are keys, posting lists are values • Output: doc-ids are keys, squared weights are values • Reduce • Input: doc-ids are keys, list of squared weights are values • Output: doc-ids are keys, square root of the summed weights are the values Example 2 def map(term, postings): for post in postings: yield post.docID(), post.weight() ** 2 def reduce(docID, sqWeights): sum = 0 for weight in sqWeights: sum = sum + weight yield docID, math.sqrt(sum)10 9/22/11 CSCI5417-IR 19Break  Thursday we’ll start discussion of projects. So come to class ready to say something about projects.  Part 2 of the HW is due next Thursday  Feel free to mail me updated results (R-precisions) as you get them... Probabilistic pproaches.  The following is a mix of chapters 12 and 13.  Only the material from 12 will be on the quiz 9/22/11 CSCI5417-IR 2011 An Alternative  Basic vector space model uses a geometric metaphor/framework for the ad hoc retrieval problem  One dimension for each word in the vocab  Weights are usually tf-idf based  An alternative is to use a probabilistic approach  So we’ll take a short detour into probabilistic language modeling 9/22/11 CSCI5417-IR 2122"Using"Language"Models"for"ad"hoc"Retrieval""

View Full Document


School:
Email:
New Password:
Confirm Password:

CU-Boulder CSCI 5417 - Lecture Notes

Sign up for free to view:

Please select your school