CSCI 5417 Information Retrieval Systems Jim MartinToday 9/20Where we are...But First: Back to Distributed IndexingHuh?MapReduceInspirationsFunctional ProgrammingPython Map/ReduceAssociation Lists (key/value)Slide 11Map PhaseReduce PhaseExampleDumbo ExampleHidden InfrastructureExample 2Slide 18BreakProbabilistic pproaches.An AlternativeSlide 22Slide 23Stochastic Language ModelsSlide 25Slide 26So.... LMs for ad hoc RetrievalUnigram and higher-order modelsNext timeCSCI 5417Information Retrieval SystemsJim MartinLecture 99/20/201101/14/19 CSCI 5417 - IR 2Today 9/20Where we areMapReduce/HadoopProbabilistic IRLanguage modelsLM for ad hoc retrieval01/14/19 CSCI 5417 - IR 3Where we are...Basics of ad hoc retrievalIndexingTerm weighting/scoringCosineEvaluationDocument classificationClusteringInformation extractionSentiment/Opinion mining01/14/19 CSCI 7000 - IR 4But First: Back to Distributed IndexingsplitsParserParserParserMastera-f g-p q-za-f g-p q-za-f g-p q-zInverterInverterInverterPostingsa-fg-pq-zassign assignHuh?That was supposed to be an explanation of MapReduce (Hadoop)...Maybe not so much...Here’s another tryMapReduceMapReduce is a distributed programming framework that is intended to facilitate applications that areData intensiveParallelizable in a certain senseIn a commodity-cluster environmentMapReduce is the original internal Google modelHadoop is the open source versionInspirationsMapReduce elegantly and efficiently combines inspirations from a variety of sources, includingFunctional programmingKey/value association listsUnix pipesFunctional ProgrammingThe focus is on side-effect free specifications of input/output mappingsThere are various idioms, but map and reduce are two central ones...Mapping refers to applying an identical function to each of the elements of a list and constructing a list of the outputsReducing refers to receiving the elements of a list and aggregating the elements according to some function.Python Map/ReduceSay you wanted to compute simple sum of squares of a list of numbers€ wi2i=0n∑>>> z[1, 2, 3, 4, 5, 6, 7, 8, 9]>>> z2 = map(lambda x: x**2, l)>>> z2[1, 4, 9, 16, 25, 36, 49, 64, 81]>>> reduce(lambda x,y: x+y, z2)285>>> reduce(lambda x,y: x+y, map(lambda x: x**2, z))285Association Lists (key/value)The notion of association lists goes way back to early lisp/ai programming. The basic idea is to try to view problems in terms of sets of key/value pairs.Most major languages now provide first-class support for this notion (usually via hashes on keys)We’ve seen this a lot this semesterTokens and term-idsTerms and document idsTerms and posting listsDocids and tf/idf valuesEtc.MapReduceMapReduce combines these ideas in the following wayThere are two phases of processing mapping and reducing. Each phase consists of multiple identical copies of map and reduce methodsMap methods take individual key/value pairs as input and return some function of those pairs to produce a new key/value pairReduce methods take key/<list of values> pairs as input, and return some aggregate function of the values as an answer.Map PhasemapmapmapmapmapmapmapmapmapmapmapmapKey/valueKey/valueKey/valueKey/valueKey/valueKey/valueKey/valueKey/valueKey/valueKey/valueKey/valueKey/valueKey/valueKey/valueKey/valueKey/valueKey/valueKey/valueKey/valueKey/valueKey/valueKey/valueKey/valueKey/valueKey/valueKey/valueKey’/value’Key’/value’Key’/value’Key’/value’Key’/value’Key’/value’Key’/value’Key’/value’Key’/value’Key’/value’Key’/value’Key’/value’Reduce PhaseDistribute by keysDistribute by keysKey’/value’Key’/value’Key’/value’Key’/value’Key’/value’Key’/value’Key’/value’Key’/value’Key’/value’Key’/value’Key’/value’Key’/value’Key’/value’Key’/value’Key’/value’Key’/value’Key’/value’Key’/value’Key’’/value’’Key’’/value’’reducereduceSort by keys and collate values Sort by keys and collate values Key’/<value’ list>Key’/<value’ list>Key’/<value’ list>Key’/<value’ list>Key’/<value’ list>Key’/<value’ list>Key’/<value’ list>Key’/<value’ list>reducereducereducereducereducereducereducereducereducereduceKey’’/value’’Key’’/value’’Key’’/value’’Key’’/value’’Key’’/value’’Key’’/value’’Key’’/value’’Key’’/value’’Key’’/value’’Key’’/value’’ExampleSimple example used in all the tutorialsGet the counts of each word type across a bunch of docsLet’s assume each doc is a big long stringFor mapInput: Filenames are keys; content string is valuesOutput: Term tokens are keys; values are 1’sFor mapInput: Filenames are keys; content string is valuesOutput: Term tokens are keys; values are 1’sFor reduceInput: Terms tokens are keys, 1’s are valuesOutput: Term types are keys, summed counts are valuesFor reduceInput: Terms tokens are keys, 1’s are valuesOutput: Term types are keys, summed counts are valuesDumbo Exampledef map(docid, contents): for term in contents.split(): yield term, 1def reduce(term, counts): sum = 0 for count in counts: sum = sum + count yield term, sumKeyValueKeyValueHidden InfrastructurePartitioning the incoming dataHadoop has default methodsBy file, given a bunch of files<filename, contents>By line, given a file full of lines<line #, line>Sorting/collating the mapped key/valuesMoving the data among the nodesDistributed file systemDon’t move the data; just assign mappers/reducers to nodesExample 2Given our normal postingsterm -> list of (doc-id, tf) tuplesGenerate the vector length normalization for each document in the index€ wt,d2t∈d∑•Map•Input: terms are keys, posting lists are values•Output: doc-ids are keys, squared weights are values•Map•Input: terms are keys, posting lists are values•Output: doc-ids are keys, squared weights are values•Reduce•Input: doc-ids are keys, list of squared weights are values•Output: doc-ids are keys, square root of the summed weights are the values•Reduce•Input: doc-ids are keys, list of squared weights are values•Output: doc-ids are keys, square root of the summed weights are the valuesExample 2def map(term, postings): for post in postings: yield post.docID(), post.weight()
View Full Document