Unformatted text preview:

1 CSCI 5417 Information Retrieval Systems Jim Martin!Lecture 10 9/22/2011 9/22/11 CSCI5417-IR 2Today 9/22  Finish LM-based IR  Language models in general  Smoothing  LM for ad hoc retrieval performance  Project brainstorming2 An Alternative to the VS Model  Basic vector space model uses a geometric metaphor/framework for the ad hoc retrieval problem  One dimension for each word in the vocab  Weights are usually tf-idf based  An alternative is to use a probabilistic approach  So we’ll take a short detour into probabilistic language modeling 9/22/11 CSCI5417-IR 3In General  When you propose a probabilistic approach to problems like this you need to specify three things 1. Exactly what you want to the model to be 2. How you will acquire the parameters of that model 3. How you will use the model operationally 9/22/11 CSCI5417-IR 43 5"Where"we"are"" In"the"LM"approach"to"IR,"we"a3empt"to"model"the"query"genera;on"process." Think"of"a"query"as"being"generated"from"a"model"derived"from"a"document"(or"documents)" Then"we"rank"documents"by"the"probability"that"a"query"would"be"observed"as"a"random"sample"from"the"respec;ve"document"model.! That"is,"we"rank"according"to"P(q|d)." Next:"how"do"we"compute"P(q|d)?"9/22/11 CSCI5417-IR 6Stochastic Language Models  Models probability of generating strings (each word in turn) in the language (commonly all strings over ∑). E.g., unigram model 0.2 the 0.1 a 0.01 man 0.01 woman 0.03 said 0.02 likes … the man likes the woman 0.2 0.01 0.02 0.2 0.01 multiply Model M P(s | M) = 0.000000084 9/22/11 CSCI5417-IR 7Stochastic Language Models  Model probability of generating any string (for example, a query) 0.2 the 0.01 class 0.0001 sayst 0.0001 pleaseth 0.0001 yon 0.0005 maiden 0.01 woman Model M1 Model M2 maiden class pleaseth yon the 0.0005 0.01 0.0001 0.0001 0.2 0.01 0.0001 0.02 0.1 0.2 P(s|M2) > P(s|M1) 0.2 the 0.0001 class 0.03 sayst 0.02 pleaseth 0.1 yon 0.01 maiden 0.0001 woman 8"How"to"compute"P(q|d)"" This"kind"of"condi;onal"independence"assump;on"is"oMen"called"a"Markov"model""(|q|:"length"ofr"q;"tk":"the"token"occurring"at"posi;on"k"in"q)" This"is"equivalent"to:" Ot,q:"term"frequency"(#"occurrences)"of"t"in"q!5 9/22/11 CSCI5417-IR 9Unigram and higher-order models   Unigram Language Models  Bigram (generally, n-gram) Language Models  Other Language Models  Grammar-based models (PCFGs), etc.  Probably not the first thing to try in IR = P ( ) P ( | ) P ( | ) P ( | ) P ( ) P ( ) P ( ) P ( ) P ( ) P ( ) P ( | ) P ( | ) P ( | ) Easy. Effective! 10"Using"Language"Models"for"ad"hoc"Retrieval"" Each"document"is"treated"as"(the"basis"fo r)"a"lan guage"model." Given"a"query"q! Rank"documents"based"on"P(d|q)"via" P(q)""is"the"same"for"all"documents,"so"ignore" P(d)""is"the"prior"–"oMen"treated"as"the"same"for"all"d"" But"we"can"give"a"higher"prior"to"“ hi ghYqu ali ty”"documents" PageRank,"click"through,""social"tags,"etc." P(q|d)"is"the"probability"of"q"given"d."" So"to"rank"documents"according"to"relevance"to"q,"rank"according"to"P(q|d)"6 11"How"to"compute"P(q|d)"" We"will"make"the"same"condi;onal"independence"assump;on"as"for"Naive"Bayes."""(|q|:"length"ofr"q;"tk":"the"token"occurring"at"posi;on"k"in"q)" This"is"equivalent"to:" Ot,q:"term"frequency"(#"occurrences)"of"t"in"q!12"Parameter"es;ma;on" Where"do"the"parameters"P(t|Md)."come"from?" Start"with"simple"counts"(maximum"likelihood"es;mates)"""|d|:"length"of"document"d;"""Ot,d":"#"occurrences"of"term"t"in"docu ment"d"7 Problem: Zero counts  A"single"term"t!with"P(t|Md)"="0"will"make""this""""""""""""""""""""""""""""""""""""""zero." This"would"give"a"single!term!the"power"to"eliminate"an"otherwise"relevant"document." For"example,"for"query"" “Michael"Jackson"top"hits”""a"document"about"“Jackson"top"songs”"(but"not"using"the"word"“hits”)"would"have"P(t|Md)"="0."–"That’s"bad."9/22/11 CSCI5417-IR 1314"Smoothing"" Key"intui;on:"A"nonYoccurring"term"is"possible"(even"though"it"didn’t"occur)."That"is"it’s"probability"shouldn’t"be"zero" If"it"isn’t"zero"what"should"it"be?""Remember"that"we’re"developing"LMs"for"each"document"in"a"collec;on."" but"no"more"likely"than"would"be"expected"by"chance"in"the"collec;on." Nota;on:"Mc:"the"collec;on"model;"cft:"the"number"of"occurrences"of"t"in"the"collec;on;""""""""""""""" """""":"the"total"number"of"tokens"in"the"collec;on."8 Smoothing  Fall back on using the probability of that term in the collection as whole.  Nota;on:"Mc:"the"collec;on"model;"cft:"the"number"of"occurrences"of"t"in"the"collec;on;""""""""""""""""""""":"the"total"number"of"tokens"in"the"collec;on." We"will"use"""""""""""""""""to"“smooth”"P(t|d)"away"from"zero.""9/22/11 CSCI5417-IR 1516"Mixture"model" P(t|d)"="λP(t|Md)"+"(1"Y"λ)P(t|Mc)" Mixes"the"probability"from"the"document"with"the"general"collec;on"frequency"of"the"word" If"a"term"in"query"occurs"in"a"do cument"we"combine"the"two"scores"with"differing"weights" If"a"term"doesn’t"occur"then"its"just"the"second"factor" The"P"of"the"term"in"the"collec;on"discounted"by"(1"–"λ)"9 Smoothing  High"value"of"λ:"“conjunc;veYlike”"search"–"tends"to"retrieve"documents"containing"all"query"words." Low"value"of"λ:"more"disjunc;ve,"best"for"long"queries" Correctly"sekng"λ!is"very"important"for"good"performance."9/22/11 CSCI5417-IR 1718"Mixture"model:"Summary" What"we"model:"The"user"has"a"document"in"mind"and"generates"the"query"from"this"document." The"equa;on"represents"the"pro babi li ty"that"the"document"that"the"user"had"in"mind"was"in"fact"this"one."10 19"Example"" Collec;on:"d1"and"d2! d1!:"Jackson"was"one"of"the"most"talented"entertainers"of"all";me" d2:"Michael"Jackson"anointed"himself"King"of"Pop"" Query"q:"Michael"Jackson"" Use"mixture"model"with"λ!="1/2"


View Full Document

CU-Boulder CSCI 5417 - Lecture 10

Download Lecture 10
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture 10 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 10 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?