CORNELL CS 630 - Information Retrieval: Retrieval Models

Unformatted text preview:

1CS630 Representing and Accessing Digital InformationInformation Retrieval: Retrieval ModelsThorsten JoachimsCornell UniversityBased on slides from Jamie Callan and Claire CardieInformation Retrieval• Basics• Data Structures and Access• Indexing and Preprocessing• Retrieval ModelsBasic IR ProcessesComparisonInformation NeedQueryRepresentationDocumentIndexed ObjectsRepresentationRetrieved ObjectsEvaluation/FeedbackWhat is a Retrieval Model?• A model is an abstract representation of a process – Used to study properties, draw conclusions, make predictions– Quality of the conclusions depends on how closely the model represents reality• A retrieval model describes the human and computational processes involved in ad-hoc retrieval– Example: models human information seeking behavior– Example: models how documents are ranked computationally– Components: users, information needs, queries, documents, relevance assessments, …– Retrieval models have notion of relevance, explicitly or implicitlyMajor Retrieval Models• Boolean• Vector space• Citation analysis models• Usage analysis models (later in semester)• Probabilistic models (partially covered in text classification)Types of Retrieval Models: Exact Match vs. Best Match Retrieval• Exact match– Query specifies precise retrieval criteria– Every document either matches or fails to match query– Result is a set of documents• Usually in no particular order w.r.t. relevance• Often in reverse-chronological order• Best match– Query describes retrieval criteria for desired documents– Every document matches the query to some degree– Result is a ranked list of documents, “best” first2Overview • Boolean• Vector space– Basic vector space– Extended boolean model– Latent semantic indexing (LSI)• Citation analysis models– Hubs & authorities– PageRank• Usage analysis models– Direct Hit– Ranking SVM• Probabilistic models– Basic probabilistic model– Bayesian inference networks– Language modelsexact matchbest matchbest matchbest matchbest matchExact Match vs. Best Match Retrieval• Best-match models are usually more accurate/effective– Good documents appear at the top of the rankings– Good documents often don’t exactlymatch the query• Query may be too strict• Document didn’t match user expectations• Exact match still prevalent in some markets– Installed base– Efficient– Sufficient for some tasks– Web “advanced search”Unranked Boolean Retrieval Model• Most common Exact Match model• Model– Retrieve documents iff they satisfy a Boolean expression• Query specifies precise relevance criteria– Documents returned in no particular order• Operators– Logical operators: AND, OR, AND-NOT (BUT)– Distance operators: near, sentence, paragraph, …– String matching operators: wildcard– Field operators: date, author, title• Unranked Boolean model is not the same as Boolean queriesExampleBoolean Query((((professional OR elite) NEAR/1 competitive NEAR/1 eating) OR (competit* NEAR/1 eat*)) AND (FIELD date 7/4/2002) AND-NOT (weight NEAR/1 loss ))• Studies show that people are not good at creating Boolean queries– People overestimate the quality of the queries they create– Queries are too strict: few relevant documents found– Queries are too loose: too many documents found (but few relevant)Implementation Details• Query subtrees can be evaluated in parallel– Use multiple processes– Reduce I/O wait time• Query optimization is very important– Order query by term frequency– “fail early” for intersection operators such as AND, proximitycomputer AND diagnosis AND medicine AND disease(6%) (2%) (2%) (2%)Boolean Query Optimization• Goal: lower average cost of evaluating queryDIAGNOSIS (6%)OR (?%)AND (?%)COMPUTER (6%)DISEASE (8%)MEDICINE (8%)computer AND (diagnosis OR medicine OR disease )(6%) (6%) (8%) (8%)3Unranked Boolean: WESTLAW• Large commercial system• Serves legal and professional markets– Legal: court cases, statutes, regulations, …– Public records– News: newspapers, magazines, journals, …– Financial: stock quotes, SEC materials, financial analyses• Total collection size: 5-7 Terabytes• 700,000 users• In operation since 1974• Best-match and free text queries added in 1992Unranked Boolean: WESTLAW• Boolean operators• Proximity operators– Phrases: “Cornell University”– Word proximity: language /3 technology– Same sentence (/s) or paragraph (/p): Kobayashi /s “hot dog”• Restrictions: Date (After 1990 & Before 2002)• Query expansion:– Wildcard: K*ashi – Automatic expansion of plurals and possessives• Document structure (fields): Title• Citations: Cites (Salton) & Date (After 1998)Unranked Boolean: WESTLAW• Queries are typically developed incrementally– Implicit relevance feedbackV1: machine AND learningV2: (machine AND learning) OR (neural AND networks) OR (decision AND tree)V3: (machine AND learning) OR (neural AND networks) OR (decision AND tree) AND (C4.5 OR Ripper OR EM)• Queries are complex– Proximity operators used often– NOT is rare• Queries are long (9-10 words, on average)Unranked Boolean: Summary• Advantages– Very efficient– Predictable, easy to explain– Structured queries– Works well when searcher knows exactly what is wanted• Disadvantages– Difficult to create good Boolean queries• Difficulty increases with size of collection– Precision and recall usually have strong inverse correlation– Predictability of results causes people to overestimate recall• Documents that are “close” are not retrievedTerm Weights: A Brief Introduction• The words of a text are not equally indicative of its meaning“Most scientists think that butterflies use the position of the sun in the sky as a kind of compass that allows them to determine which way is north. Scientists think that butterflies may use other cues, such as the earth’s magnetic field, but we have a lot to learn about monarchs’ sense of direction.”• Important: • Unimportant: • Term weights reflect the (estimated) importance of each termbutterflies, monarchs, scientists, direction, compassmost, think, kind, sky, determine, cues,


View Full Document

CORNELL CS 630 - Information Retrieval: Retrieval Models

Download Information Retrieval: Retrieval Models
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Information Retrieval: Retrieval Models and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Information Retrieval: Retrieval Models 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?