CU-Boulder CSCI 5417 - Lecture 10 - D2437822

Home> Schools> University of Colorado at Boulder> Computer Science (CSCI) > CSCI 5417> Lecture 10

CU-Boulder CSCI 5417 - Lecture 10

School name University of Colorado at Boulder

Course Csci 5417- Information Retrieval Systems

Pages 12

Download Save

Unformatted text preview:

1 CSCI 5417 Information Retrieval Systems Jim Martin!Lecture 10 9/22/2011 9/22/11 CSCI5417-IR 2Today 9/22  Finish LM-based IR  Language models in general  Smoothing  LM for ad hoc retrieval performance  Project brainstorming2 An Alternative to the VS Model  Basic vector space model uses a geometric metaphor/framework for the ad hoc retrieval problem  One dimension for each word in the vocab  Weights are usually tf-idf based  An alternative is to use a probabilistic approach  So we’ll take a short detour into probabilistic language modeling 9/22/11 CSCI5417-IR 3In General  When you propose a probabilistic approach to problems like this you need to specify three things 1. Exactly what you want to the model to be 2. How you will acquire the parameters of that model 3. How you will use the model operationally 9/22/11 CSCI5417-IR 43 5"Where"we"are"" In"the"LM"approach"to"IR,"we"a3empt"to"model"the"query"genera;on"process." Think"of"a"query"as"being"generated"from"a"model"derived"from"a"document"(or"documents)" Then"we"rank"documents"by"the"probability"that"a"query"would"be"observed"as"a"random"sample"from"the"respec;ve"document"model.! That"is,"we"rank"according"to"P(q|d)." Next:"how"do"we"compute"P(q|d)?"9/22/11 CSCI5417-IR 6Stochastic Language Models  Models probability of generating strings (each word in turn) in the language (commonly all strings over ∑). E.g., unigram model 0.2 the 0.1 a 0.01 man 0.01 woman 0.03 said 0.02 likes … the man likes the woman 0.2 0.01 0.02 0.2 0.01 multiply Model M P(s | M) = 0.000000084 9/22/11 CSCI5417-IR 7Stochastic Language Models  Model probability of generating any string (for example, a query) 0.2 the 0.01 class 0.0001 sayst 0.0001 pleaseth 0.0001 yon 0.0005 maiden 0.01 woman Model M1 Model M2 maiden class pleaseth yon the 0.0005 0.01 0.0001 0.0001 0.2 0.01 0.0001 0.02 0.1 0.2 P(s|M2) > P(s|M1) 0.2 the 0.0001 class 0.03 sayst 0.02 pleaseth 0.1 yon 0.01 maiden 0.0001 woman 8"How"to"compute"P(q|d)"" This"kind"of"condi;onal"independence"assump;on"is"oMen"called"a"Markov"model""(|q|:"length"ofr"q;"tk":"the"token"occurring"at"posi;on"k"in"q)" This"is"equivalent"to:" Ot,q:"term"frequency"(#"occurrences)"of"t"in"q!5 9/22/11 CSCI5417-IR 9Unigram and higher-order models   Unigram Language Models  Bigram (generally, n-gram) Language Models  Other Language Models  Grammar-based models (PCFGs), etc.  Probably not the first thing to try in IR = P ( ) P ( | ) P ( | ) P ( | ) P ( ) P ( ) P ( ) P ( ) P ( ) P ( ) P ( | ) P ( | ) P ( | ) Easy. Effective! 10"Using"Language"Models"for"ad"hoc"Retrieval"" Each"document"is"treated"as"(the"basis"fo r)"a"lan guage"model." Given"a"query"q! Rank"documents"based"on"P(d|q)"via" P(q)""is"the"same"for"all"documents,"so"ignore" P(d)""is"the"prior"–"oMen"treated"as"the"same"for"all"d"" But"we"can"give"a"higher"prior"to"“ hi ghYqu ali ty”"documents" PageRank,"click"through,""social"tags,"etc." P(q|d)"is"the"probability"of"q"given"d."" So"to"rank"documents"according"to"relevance"to"q,"rank"according"to"P(q|d)"6 11"How"to"compute"P(q|d)"" We"will"make"the"same"condi;onal"independence"assump;on"as"for"Naive"Bayes."""(|q|:"length"ofr"q;"tk":"the"token"occurring"at"posi;on"k"in"q)" This"is"equivalent"to:" Ot,q:"term"frequency"(#"occurrences)"of"t"in"q!12"Parameter"es;ma;on" Where"do"the"parameters"P(t|Md)."come"from?" Start"with"simple"counts"(maximum"likelihood"es;mates)"""|d|:"length"of"document"d;"""Ot,d":"#"occurrences"of"term"t"in"docu ment"d"7 Problem: Zero counts  A"single"term"t!with"P(t|Md)"="0"will"make""this""""""""""""""""""""""""""""""""""""""zero." This"would"give"a"single!term!the"power"to"eliminate"an"otherwise"relevant"document." For"example,"for"query"" “Michael"Jackson"top"hits”""a"document"about"“Jackson"top"songs”"(but"not"using"the"word"“hits”)"would"have"P(t|Md)"="0."–"That’s"bad."9/22/11 CSCI5417-IR 1314"Smoothing"" Key"intui;on:"A"nonYoccurring"term"is"possible"(even"though"it"didn’t"occur)."That"is"it’s"probability"shouldn’t"be"zero" If"it"isn’t"zero"what"should"it"be?""Remember"that"we’re"developing"LMs"for"each"document"in"a"collec;on."" but"no"more"likely"than"would"be"expected"by"chance"in"the"collec;on." Nota;on:"Mc:"the"collec;on"model;"cft:"the"number"of"occurrences"of"t"in"the"collec;on;""""""""""""""" """""":"the"total"number"of"tokens"in"the"collec;on."8 Smoothing  Fall back on using the probability of that term in the collection as whole.  Nota;on:"Mc:"the"collec;on"model;"cft:"the"number"of"occurrences"of"t"in"the"collec;on;""""""""""""""""""""":"the"total"number"of"tokens"in"the"collec;on." We"will"use"""""""""""""""""to"“smooth”"P(t|d)"away"from"zero.""9/22/11 CSCI5417-IR 1516"Mixture"model" P(t|d)"="λP(t|Md)"+"(1"Y"λ)P(t|Mc)" Mixes"the"probability"from"the"document"with"the"general"collec;on"frequency"of"the"word" If"a"term"in"query"occurs"in"a"do cument"we"combine"the"two"scores"with"diﬀering"weights" If"a"term"doesn’t"occur"then"its"just"the"second"factor" The"P"of"the"term"in"the"collec;on"discounted"by"(1"–"λ)"9 Smoothing  High"value"of"λ:"“conjunc;veYlike”"search"–"tends"to"retrieve"documents"containing"all"query"words." Low"value"of"λ:"more"disjunc;ve,"best"for"long"queries" Correctly"sekng"λ!is"very"important"for"good"performance."9/22/11 CSCI5417-IR 1718"Mixture"model:"Summary" What"we"model:"The"user"has"a"document"in"mind"and"generates"the"query"from"this"document." The"equa;on"represents"the"pro babi li ty"that"the"document"that"the"user"had"in"mind"was"in"fact"this"one."10 19"Example"" Collec;on:"d1"and"d2! d1!:"Jackson"was"one"of"the"most"talented"entertainers"of"all";me" d2:"Michael"Jackson"anointed"himself"King"of"Pop"" Query"q:"Michael"Jackson"" Use"mixture"model"with"λ!="1/2"

View Full Document


School:
Email:
New Password:
Confirm Password:

CU-Boulder CSCI 5417 - Lecture 10

Sign up for free to view:

Please select your school