This preview shows page 1-2-3-18-19-36-37-38 out of 38 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 38 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 38 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 38 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 38 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 38 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 38 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 38 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 38 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 38 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

15-381 Artificial IntelligenceInformation Retrieval: The Challenge (1)Information Retrieval: The Challenge (2)Information Retrieval Assumptions (1)Information Retrieval Assumption (2)Information Retrieval Assumption (3)Boolean Queries (1)Boolean Queries (2)Boolean Queries (3)Beyond the Boolean Boondoggle (1)Beyond the Boolean Boondoggle (2)The Vector Space Model (1)The Vector Space Model (2)The Vector Space Model (3)Refinements to VSM (2)Refinements to VSM (3)Refinements to VSM (4)Evaluating Information Retrieval (1)Evaluating Information Retrieval (2)Query Expansion (1)Query Expansion (2)Relevance FeedbackTerm Weighting Methods (1)Term Weighting Methods (2)Efficient Implementations of VSM (1)Efficient Implementations of VSM (3)Generalized Vector Space Model (1)Generalized Vector Space Model (2)GVSM, How it Works (1)GVSM, How it Works (2)A Critique of Pure Relevance (1)A Critique of Pure Relevance (2)Maximal Marginal Relevance (1)Maximal Marginal Relevance (2)Maximal Marginal Relevance (MMR) (3)Maximal Marginal Relevance (MMR) (4)Document Summarization in a Nutshell (1)Summarization as Passage Retrieval (1)15-381 Artificial IntelligenceInformation Retrieval(How to Power a Search Engine)Jaime Carbonell20 September 2001Topics Covered:•“Bag of Words” Hypothesis•Vector Space Model & Cosine Similarity•Query Expansion MethodsInformation Retrieval: The Challenge (1) Text DB includes:(1) Rainfall measurements in the Sahara continue to show a steadydecline starting from the first measurements in 1961. In 1996 only12mm of rain were recorded in upper Sudan, and 1mm in SouthernAlgiers...(2) Dan Marino states that professional football risks loosing the numberone position in heart of fans across this land. Declines in TV audienceratings are cited...(3) Alarming reductions in precipitation in desert regions are blamed fordesert encroachment of previously fertile farmland in Northern Africa.Scientists measured both yearly precipitation and groundwater levels...Information Retrieval: The Challenge (2)User query states:"Decline in rainfall and impact on farms near Sahara"Challenges•How to retrieve (1) and (3) and not (2)?•How to rank (3) as best?•How to cope with no shared words?Information Retrieval Assumptions (1)Basic IR task•There exists a document collection {Dj }•Users enters at hoc query Q•Q correctly states user’s interest•User wants {Di } < {Dj } most relevant to Q"Shared Bag of Words" assumptionEvery query = {wi }Every document = {wk }...where wi & wk in same ΣAll syntax is irrelevant (e.g. word order)All document structure is irrelevantAll meta-information is irrelevant(e.g. author, source, genre)=> Words suffice for relevance assessmentInformation Retrieval Assumption (2)Information Retrieval Assumption (3)Retrieval by shared wordsIf Q and Dj share some wi , then Relevant(Q, Dj )If Q and Dj share all wi , then Relevant(Q, Dj )If Q and Dj share over K% of wi , then Relevant(Q, Dj)Boolean Queries (1)Industrial use of SilverQ: silverR: "The Count’s silver anniversary...""Even the crash of ’87 had a silver lining...""The Lone Ranger lived on in syndication...""Sliver dropped to a new low in London..."...Q: silver AND photographyR: "Posters of Tonto and the Lone Ranger...""The Queen’s Silver Anniversary photos..."...Boolean Queries (2)Q: (silver AND (NOT anniversary)AND (NOT lining)AND emulsion)OR (AgI AND crystalAND photography))R: "Silver Iodide Crystals in Photography...""The emulsion was worth its weight in silver..."...Boolean Queries (3)Boolean queries are:a) easy to implementb) confusing to composec) seldom used (except by librarians)d) prone to low recalle) all of the aboveBeyond the Boolean Boondoggle (1)Desiderata (1)•Query must be natural for all users•Sentence, phrase, or word(s)•No AND’s, OR’s, NOT’s, ...•No parentheses (no structure)•System focus on important words•Q: I want laser printers nowBeyond the Boolean Boondoggle (2)Desiderata (2) • Find what I mean, not just what I say Q: cheap car insurance(pAND (pOR"cheap" [1.0]"inexpensive" [0.9]"discount" [0.5)](pOR "car" [1.0]"auto" [0.8]"automobile" [0.9]"vehicle" [0.5])(pOR "insurance" [1.0]"policy" [0.3]))The Vector Space Model (1)Let Σ = [w1, w2, ... wn ]Let Dj = [c(w1, Dj), c(w2, Dj), ... c(wn, Dj)]Let Q = [c(w1, Q), c(w2, Q), ... c(wn, Q)]The Vector Space Model (2)Initial Definition of Similarity:SI(Q, Dj) = Q . DjNormalized Definition of Similarity:SN(Q, Dj) = (Q . Dj)/(|Q| x |Dj|)= cos(Q, Dj)The Vector Space Model (3)Relevance RankingIf SN(Q, Di) > SN(Q, Dj)Then Di is more relevant than Di to QRetrieve(k,Q,{Dj}) =Argmaxk[cos(Q, Dj)] Dj in {Dj}Refinements to VSM (2)Stop-Word Elimination•Discard articles, auxiliaries, prepositions, ... typically 100-300 most frequent small words•Reduce document length by 30-40%•Retrieval accuracy improves slightly (5-10%)Refinements to VSM (3)Proximity Phrases•E.g.: "air force" => airforce•Found by high-mutual informationp(w1 w2) >> p(w1)p(w2)p(w1 & w2 in k-window) >> p(w1 in k-window) p(w2 in same k-window)•Retrieval accuracy improves slightly (5-10%)•Too many phrases => inefficiencyRefinements to VSM (4)Words => Terms•term = word | stemmed word | phrase•Use exactly the same VSM method on terms (vs words)Evaluating Information Retrieval (1)Contingency table:docsretrievedallrelevantretrieveedprecisiondocsrelevantallrelevantretrieveedrecall&&relevant not-relevantretrieved a bnot retrieved c dEvaluating Information Retrieval (2)P = a/(a+b) R = a/(a+c)Accuracy = (a+d)/(a+b+c+d)F1 = 2PR/(P+R)Miss = c/(a+c) = 1 - R (false negative)F/A = b/(a+b+c+d)(false positive)Query Expansion (1)Observations:•Longer queries often yield better results•User’s vocabulary may differ from documentvocabularyQ: how to avoid heart diseaseD: "Factors in minimizing stroke and cardiac arrest: Recommended dietary and exercise regimens"•Maybe longer queries have more chances to help recall.Query Expansion (2)Bridging the Gap•Human query expansion (user or expert)•Thesaurus-based expansionSeldom works in practice (unfocused)•Relevance feedback–Widen a thin bridge over vocabulary gap–Adds words from document space to query•Pseudo-Relevance feedback•Local Context analysisRelevance FeedbackRocchio FormulaQ’ = F[Q, Dret ]F = weighted vector sum, such as:W(t,Q’) = αW(t,Q) + βW(t,Drel ) - γW(t,Dirr )Term Weighting Methods (1)Salton’s Tf*IDfTf = term frequency in a documentDf = document frequency of term= #


View Full Document

CMU CS 15381 - Handouts

Documents in this Course
Planning

Planning

19 pages

Planning

Planning

19 pages

Lecture

Lecture

42 pages

Lecture

Lecture

27 pages

Lecture

Lecture

19 pages

FOL

FOL

41 pages

lecture

lecture

34 pages

Exam

Exam

7 pages

Lecture

Lecture

22 pages

Handout

Handout

11 pages

Midterm

Midterm

14 pages

lecture

lecture

83 pages

mdp

mdp

37 pages

HW2

HW2

7 pages

nn

nn

25 pages

lecture

lecture

13 pages

Handout

Handout

5 pages

Lecture

Lecture

27 pages

Lecture

Lecture

62 pages

Lecture

Lecture

5 pages

Load more
Download Handouts
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Handouts and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Handouts 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?