DOC PREVIEW
CU-Boulder CSCI 5417 - Lecture 9

This preview shows page 1-2-3-27-28-29 out of 29 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 29 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 29 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 29 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 29 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 29 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 29 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 29 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CSCI 5417 Information Retrieval Systems Jim MartinToday 9/20Where we are...But First: Back to Distributed IndexingHuh?MapReduceInspirationsFunctional ProgrammingPython Map/ReduceAssociation Lists (key/value)Slide 11Map PhaseReduce PhaseExampleDumbo ExampleHidden InfrastructureExample 2Slide 18BreakProbabilistic pproaches.An AlternativeSlide 22Slide 23Stochastic Language ModelsSlide 25Slide 26So.... LMs for ad hoc RetrievalUnigram and higher-order modelsNext timeCSCI 5417Information Retrieval SystemsJim MartinLecture 99/20/201101/14/19 CSCI 5417 - IR 2Today 9/20Where we areMapReduce/HadoopProbabilistic IRLanguage modelsLM for ad hoc retrieval01/14/19 CSCI 5417 - IR 3Where we are...Basics of ad hoc retrievalIndexingTerm weighting/scoringCosineEvaluationDocument classificationClusteringInformation extractionSentiment/Opinion mining01/14/19 CSCI 7000 - IR 4But First: Back to Distributed IndexingsplitsParserParserParserMastera-f g-p q-za-f g-p q-za-f g-p q-zInverterInverterInverterPostingsa-fg-pq-zassign assignHuh?That was supposed to be an explanation of MapReduce (Hadoop)...Maybe not so much...Here’s another tryMapReduceMapReduce is a distributed programming framework that is intended to facilitate applications that areData intensiveParallelizable in a certain senseIn a commodity-cluster environmentMapReduce is the original internal Google modelHadoop is the open source versionInspirationsMapReduce elegantly and efficiently combines inspirations from a variety of sources, includingFunctional programmingKey/value association listsUnix pipesFunctional ProgrammingThe focus is on side-effect free specifications of input/output mappingsThere are various idioms, but map and reduce are two central ones...Mapping refers to applying an identical function to each of the elements of a list and constructing a list of the outputsReducing refers to receiving the elements of a list and aggregating the elements according to some function.Python Map/ReduceSay you wanted to compute simple sum of squares of a list of numbers€ wi2i=0n∑>>> z[1, 2, 3, 4, 5, 6, 7, 8, 9]>>> z2 = map(lambda x: x**2, l)>>> z2[1, 4, 9, 16, 25, 36, 49, 64, 81]>>> reduce(lambda x,y: x+y, z2)285>>> reduce(lambda x,y: x+y, map(lambda x: x**2, z))285Association Lists (key/value)The notion of association lists goes way back to early lisp/ai programming. The basic idea is to try to view problems in terms of sets of key/value pairs.Most major languages now provide first-class support for this notion (usually via hashes on keys)We’ve seen this a lot this semesterTokens and term-idsTerms and document idsTerms and posting listsDocids and tf/idf valuesEtc.MapReduceMapReduce combines these ideas in the following wayThere are two phases of processing mapping and reducing. Each phase consists of multiple identical copies of map and reduce methodsMap methods take individual key/value pairs as input and return some function of those pairs to produce a new key/value pairReduce methods take key/<list of values> pairs as input, and return some aggregate function of the values as an answer.Map PhasemapmapmapmapmapmapmapmapmapmapmapmapKey/valueKey/valueKey/valueKey/valueKey/valueKey/valueKey/valueKey/valueKey/valueKey/valueKey/valueKey/valueKey/valueKey/valueKey/valueKey/valueKey/valueKey/valueKey/valueKey/valueKey/valueKey/valueKey/valueKey/valueKey/valueKey/valueKey’/value’Key’/value’Key’/value’Key’/value’Key’/value’Key’/value’Key’/value’Key’/value’Key’/value’Key’/value’Key’/value’Key’/value’Reduce PhaseDistribute by keysDistribute by keysKey’/value’Key’/value’Key’/value’Key’/value’Key’/value’Key’/value’Key’/value’Key’/value’Key’/value’Key’/value’Key’/value’Key’/value’Key’/value’Key’/value’Key’/value’Key’/value’Key’/value’Key’/value’Key’’/value’’Key’’/value’’reducereduceSort by keys and collate values Sort by keys and collate values Key’/<value’ list>Key’/<value’ list>Key’/<value’ list>Key’/<value’ list>Key’/<value’ list>Key’/<value’ list>Key’/<value’ list>Key’/<value’ list>reducereducereducereducereducereducereducereducereducereduceKey’’/value’’Key’’/value’’Key’’/value’’Key’’/value’’Key’’/value’’Key’’/value’’Key’’/value’’Key’’/value’’Key’’/value’’Key’’/value’’ExampleSimple example used in all the tutorialsGet the counts of each word type across a bunch of docsLet’s assume each doc is a big long stringFor mapInput: Filenames are keys; content string is valuesOutput: Term tokens are keys; values are 1’sFor mapInput: Filenames are keys; content string is valuesOutput: Term tokens are keys; values are 1’sFor reduceInput: Terms tokens are keys, 1’s are valuesOutput: Term types are keys, summed counts are valuesFor reduceInput: Terms tokens are keys, 1’s are valuesOutput: Term types are keys, summed counts are valuesDumbo Exampledef map(docid, contents): for term in contents.split(): yield term, 1def reduce(term, counts): sum = 0 for count in counts: sum = sum + count yield term, sumKeyValueKeyValueHidden InfrastructurePartitioning the incoming dataHadoop has default methodsBy file, given a bunch of files<filename, contents>By line, given a file full of lines<line #, line>Sorting/collating the mapped key/valuesMoving the data among the nodesDistributed file systemDon’t move the data; just assign mappers/reducers to nodesExample 2Given our normal postingsterm -> list of (doc-id, tf) tuplesGenerate the vector length normalization for each document in the index€ wt,d2t∈d∑•Map•Input: terms are keys, posting lists are values•Output: doc-ids are keys, squared weights are values•Map•Input: terms are keys, posting lists are values•Output: doc-ids are keys, squared weights are values•Reduce•Input: doc-ids are keys, list of squared weights are values•Output: doc-ids are keys, square root of the summed weights are the values•Reduce•Input: doc-ids are keys, list of squared weights are values•Output: doc-ids are keys, square root of the summed weights are the valuesExample 2def map(term, postings): for post in postings: yield post.docID(), post.weight()


View Full Document

CU-Boulder CSCI 5417 - Lecture 9

Download Lecture 9
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture 9 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 9 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?