DOC PREVIEW
Brandeis CS 101A - Introduction to Information Retrieval

This preview shows page 1-2-3-4-5-6-7-48-49-50-51-52-53-96-97-98-99-100-101-102 out of 102 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 102 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 102 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 102 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 102 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 102 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 102 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 102 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 102 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 102 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 102 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 102 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 102 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 102 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 102 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 102 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 102 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 102 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 102 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 102 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 102 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 102 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Introduction toInformation RetrievalCS 101aFall 2007James PustejovskyDifferent Motivations & Modes in IR• The User Task– Browsing– Retrieval• e.g.:– Find a procedure for installing an Epsonwireless printer– Find a place to stay in St. John in USVI– Find info on cycling in Boston– Find research papers on lexical semanticsof Korean Causative VerbsBasic IR systemIR systemInformationneed orqueryDocuments orreferences todocumentswhich are(hopefully)relevantDocumentsData Retrieval vs Info Retrieval DR IR Matching Exact Partial, best Items wanted Matching Relevant Queries Precise Imprecise Information Data, numeric Nat. Lang.A Formal Characterization of IR Models• An IR model is a quadruple– [D, Q, F, R(qi,dj)]– Where• D is a set of logical views of documents• Q is a set of logical views of queries• F is a framework for modeling documents,queries and their relationships• R(qi,dj) is a ranking function which ratesdocument dj according to query qiIndexing in our model of IRInformation ProblemQuery(qi)Surrogate(D)InformationObjectsRepresentation -search formulationRepresentation -indexingComparisonFull text  Index termsdocumentstructure recognitionaccentsspacing etc. stopwordsnoungroupsstemmingautomatic or manual indexing structure full text index termstext + structuretextAutomatic Indexing• Choose from the terms in a documentthose which are most indicative of itscontent.– contrast with full-text retrieval• For non-Boolean retrieval, includeweights with terms (more later).Normalizing terms• Should numbers, units (“mph”), etc. beincluded ?• Should “traffic” and “Traffic” be one term?• Should “compute”, “computer”,“computation”, “computerization” be all oneterm ?– Stemming is the process of removing suffixesso that these are all mapped to “comput”Term weightingZipf's Law: If the words, w, in a collection are ranked, r(w),by their frequency, f(w), they roughly fit the relation: r(w) * f(w) = cThis suggests that some terms are more effective than others inretrieval.In particular relative frequency is a useful measure thatidentifies terms that occur with substantial frequency in somedocuments, but with relatively low overall collectionfrequency.Term weights are functions that are used to quantify theseconcepts.Word frequency characteristicsZipf’s Law: rank * frequency ≈ constant02004006008001000120014710131619222528Term FrequencyConceptA term that appears many times within a document islikely to be more important than a term that appearsonly once.Statistical Indexing - Basis• Frequent words are important contentrepresentation words.• except content-free “function” wordslikethe, and, or, but, of, in, it, he, …• middle-frequency words are the best forindexing documents. (Why?)Basic Indexing Strategy1 list the unique words in the documents2 remove stopwords (about 250 for English)3 stem remaining words (improves recall)4 assign as index terms eitherA - all resulting terms orB - all but very rare terms (they won’t retrieve much)C - terms that are most frequent in the doc.D - terms weighted highly by other measuresNotes• There are standard stopword lists forEnglish.• 4A and 4B don’t give term weights.• Removing rare words (4B) was an oldidea– but it reduces exhaustivity and can affectrecall and precision.– It may be acceptable to remove wordswhose total frequency is 1.Surrogate/Query ComparisonInformation ProblemQuery(qi)Surrogate(D)InformationObjectsRepresentation -search formulationRepresentation -indexingComparisonIR Models• Set Theoretic Models– Boolean– Fuzzy– Extended Boolean• Vector Models (Algebraic)• Probabilistic Models (probabilistic)• Others (e.g., neural networks)Traditional IR DesignBoolean Model for IR• Based on Boolean Logic (Algebra of Sets).• Fundamental principles established byGeorge Boole in the 1850’s• Deals with set membership and operations onsets• Set membership in IR systems is usuallybased on whether (or not) a documentcontains a keyword (term)Query Languages• A way to express the query (formalexpression of the information need)• Types:– Boolean– Natural Language– Stylized Natural Language– Form-Based (GUI)Simple query language: Boolean• Terms + Connectors– terms• words• normalized (stemmed) words• phrases• thesaurus terms– connectors• AND• OR• NOTBoolean Queries• Cat• Cat OR Dog• Cat AND Dog• (Cat AND Dog)• (Cat AND Dog) OR Collar• (Cat AND Dog) OR (Collar AND Leash)• (Cat OR Dog) AND (Collar OR Leash)Boolean Queries• (Cat OR Dog) AND (Collar OR Leash)– Each of the following combinations satisfiesthis statement:• Cat x x x x• Dog x x x x x• Collar x x x x• Leash x x x xBoolean Queries• (Cat OR Dog) AND (Collar OR Leash)– None of the following combinations work:• Cat x x • Dog x x• Collar x x• Leash x xBoolean Queries– Usually expressed as INFIX operators in IR• ((a AND b) OR (c AND b))– NOT is UNARY PREFIX operator• ((a AND b) OR (c AND (NOT b)))– AND and OR can be n-ary operators• (a AND b AND c AND d)– Some rules - (De Morgan revisited)• NOT(a) AND NOT(b) = NOT(a OR b)• NOT(a) OR NOT(b)= NOT(a AND b)• NOT(NOT(a)) = aBoolean SearchingFormal Query:cracks AND beamsAND Width_measurementAND Prestressed_concreteCracksBeamsWidthmeasurementPrestressedconcreteRelaxed Query:(C AND B AND P) OR(C AND B AND W) OR(C AND W AND P) OR(B AND W AND P)Relaxed Query:(C AND B AND P) OR(C AND B AND W) OR(C AND W AND P) OR(B AND W AND P)Boolean Logict33t11t22D11D22D33D44D55D66D88D77D99D1010D1111m1m2m3m5m4m7m8m6m2 = t1 t2 t3m1 = t1 t2 t3m4 = t1 t2 t3m3 = t1 t2 t3m6 = t1 t2 t3m5 = t1 t2 t3m8 = t1 t2 t3m7 = t1 t2 t3Precedence Ordering• In what order do we evaluate the components of theBoolean expression?– Parenthesis get done first• (a or b) and (c or d)• (a or (b and c) or d)– Usually start from the left and work right (in case of ties)– Usually (if there are no parentheses)• NOT before AND• AND before ORPseudo-Boolean Queries• A new notation, from web search+cat dog +collar leashThese are prefix operators• Does not mean the same thing as AND/OR!+ means “mandatory, must be in document”- means “cannot be in the document”• Phrases:“stray cat” AND “frayed collar”is equivalent to+“stray cat” +“frayed collar”Result Sets• Run a query, get a result set• Two choices– Reformulate query, run on entire


View Full Document

Brandeis CS 101A - Introduction to Information Retrieval

Download Introduction to Information Retrieval
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Introduction to Information Retrieval and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Introduction to Information Retrieval 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?