1 CSCI 5417 Information Retrieval Systems Jim Martin!Lecture 10 9/22/2011 9/22/11 CSCI5417-IR 2Today 9/22 Finish LM-based IR Language models in general Smoothing LM for ad hoc retrieval performance Project brainstorming2 An Alternative to the VS Model Basic vector space model uses a geometric metaphor/framework for the ad hoc retrieval problem One dimension for each word in the vocab Weights are usually tf-idf based An alternative is to use a probabilistic approach So we’ll take a short detour into probabilistic language modeling 9/22/11 CSCI5417-IR 3In General When you propose a probabilistic approach to problems like this you need to specify three things 1. Exactly what you want to the model to be 2. How you will acquire the parameters of that model 3. How you will use the model operationally 9/22/11 CSCI5417-IR 43 5"Where"we"are"" In"the"LM"approach"to"IR,"we"a3empt"to"model"the"query"genera;on"process." Think"of"a"query"as"being"generated"from"a"model"derived"from"a"document"(or"documents)" Then"we"rank"documents"by"the"probability"that"a"query"would"be"observed"as"a"random"sample"from"the"respec;ve"document"model.! That"is,"we"rank"according"to"P(q|d)." Next:"how"do"we"compute"P(q|d)?"9/22/11 CSCI5417-IR 6Stochastic Language Models Models probability of generating strings (each word in turn) in the language (commonly all strings over ∑). E.g., unigram model 0.2 the 0.1 a 0.01 man 0.01 woman 0.03 said 0.02 likes … the man likes the woman 0.2 0.01 0.02 0.2 0.01 multiply Model M P(s | M) = 0.000000084 9/22/11 CSCI5417-IR 7Stochastic Language Models Model probability of generating any string (for example, a query) 0.2 the 0.01 class 0.0001 sayst 0.0001 pleaseth 0.0001 yon 0.0005 maiden 0.01 woman Model M1 Model M2 maiden class pleaseth yon the 0.0005 0.01 0.0001 0.0001 0.2 0.01 0.0001 0.02 0.1 0.2 P(s|M2) > P(s|M1) 0.2 the 0.0001 class 0.03 sayst 0.02 pleaseth 0.1 yon 0.01 maiden 0.0001 woman 8"How"to"compute"P(q|d)"" This"kind"of"condi;onal"independence"assump;on"is"oMen"called"a"Markov"model""(|q|:"length"ofr"q;"tk":"the"token"occurring"at"posi;on"k"in"q)" This"is"equivalent"to:" Ot,q:"term"frequency"(#"occurrences)"of"t"in"q!5 9/22/11 CSCI5417-IR 9Unigram and higher-order models Unigram Language Models Bigram (generally, n-gram) Language Models Other Language Models Grammar-based models (PCFGs), etc. Probably not the first thing to try in IR = P ( ) P ( | ) P ( | ) P ( | ) P ( ) P ( ) P ( ) P ( ) P ( ) P ( ) P ( | ) P ( | ) P ( | ) Easy. Effective! 10"Using"Language"Models"for"ad"hoc"Retrieval"" Each"document"is"treated"as"(the"basis"fo r)"a"lan guage"model." Given"a"query"q! Rank"documents"based"on"P(d|q)"via" P(q)""is"the"same"for"all"documents,"so"ignore" P(d)""is"the"prior"–"oMen"treated"as"the"same"for"all"d"" But"we"can"give"a"higher"prior"to"“ hi ghYqu ali ty”"documents" PageRank,"click"through,""social"tags,"etc." P(q|d)"is"the"probability"of"q"given"d."" So"to"rank"documents"according"to"relevance"to"q,"rank"according"to"P(q|d)"6 11"How"to"compute"P(q|d)"" We"will"make"the"same"condi;onal"independence"assump;on"as"for"Naive"Bayes."""(|q|:"length"ofr"q;"tk":"the"token"occurring"at"posi;on"k"in"q)" This"is"equivalent"to:" Ot,q:"term"frequency"(#"occurrences)"of"t"in"q!12"Parameter"es;ma;on" Where"do"the"parameters"P(t|Md)."come"from?" Start"with"simple"counts"(maximum"likelihood"es;mates)"""|d|:"length"of"document"d;"""Ot,d":"#"occurrences"of"term"t"in"docu ment"d"7 Problem: Zero counts A"single"term"t!with"P(t|Md)"="0"will"make""this""""""""""""""""""""""""""""""""""""""zero." This"would"give"a"single!term!the"power"to"eliminate"an"otherwise"relevant"document." For"example,"for"query"" “Michael"Jackson"top"hits”""a"document"about"“Jackson"top"songs”"(but"not"using"the"word"“hits”)"would"have"P(t|Md)"="0."–"That’s"bad."9/22/11 CSCI5417-IR 1314"Smoothing"" Key"intui;on:"A"nonYoccurring"term"is"possible"(even"though"it"didn’t"occur)."That"is"it’s"probability"shouldn’t"be"zero" If"it"isn’t"zero"what"should"it"be?""Remember"that"we’re"developing"LMs"for"each"document"in"a"collec;on."" but"no"more"likely"than"would"be"expected"by"chance"in"the"collec;on." Nota;on:"Mc:"the"collec;on"model;"cft:"the"number"of"occurrences"of"t"in"the"collec;on;""""""""""""""" """""":"the"total"number"of"tokens"in"the"collec;on."8 Smoothing Fall back on using the probability of that term in the collection as whole. Nota;on:"Mc:"the"collec;on"model;"cft:"the"number"of"occurrences"of"t"in"the"collec;on;""""""""""""""""""""":"the"total"number"of"tokens"in"the"collec;on." We"will"use"""""""""""""""""to"“smooth”"P(t|d)"away"from"zero.""9/22/11 CSCI5417-IR 1516"Mixture"model" P(t|d)"="λP(t|Md)"+"(1"Y"λ)P(t|Mc)" Mixes"the"probability"from"the"document"with"the"general"collec;on"frequency"of"the"word" If"a"term"in"query"occurs"in"a"do cument"we"combine"the"two"scores"with"differing"weights" If"a"term"doesn’t"occur"then"its"just"the"second"factor" The"P"of"the"term"in"the"collec;on"discounted"by"(1"–"λ)"9 Smoothing High"value"of"λ:"“conjunc;veYlike”"search"–"tends"to"retrieve"documents"containing"all"query"words." Low"value"of"λ:"more"disjunc;ve,"best"for"long"queries" Correctly"sekng"λ!is"very"important"for"good"performance."9/22/11 CSCI5417-IR 1718"Mixture"model:"Summary" What"we"model:"The"user"has"a"document"in"mind"and"generates"the"query"from"this"document." The"equa;on"represents"the"pro babi li ty"that"the"document"that"the"user"had"in"mind"was"in"fact"this"one."10 19"Example"" Collec;on:"d1"and"d2! d1!:"Jackson"was"one"of"the"most"talented"entertainers"of"all";me" d2:"Michael"Jackson"anointed"himself"King"of"Pop"" Query"q:"Michael"Jackson"" Use"mixture"model"with"λ!="1/2"
View Full Document