Introduction to Information Retrieval CS 101a Fall 2007 James Pustejovsky Different Motivations Modes in IR The User Task Browsing Retrieval e g Find a procedure for installing an Epson wireless printer Find a place to stay in St John in USVI Find info on cycling in Boston Find research papers on lexical semantics of Korean Causative Verbs Basic IR system Information need or query IR system Documents Documents or references to documents which are hopefully relevant Data Retrieval vs Info Retrieval Matching DR IR Exact Partial best Items wanted Matching Relevant Queries Precise Imprecise Information Data numeric Nat Lang A Formal Characterization of IR Models An IR model is a quadruple D Q F R qi dj Where D is a set of logical views of documents Q is a set of logical views of queries F is a framework for modeling documents queries and their relationships R qi dj is a ranking function which rates document dj according to query qi Indexing in our model of IR Information Problem Representation search formulation Query qi Comparison Information Objects Representation indexing Surrogate D Full text Index terms document accents spacing etc structure text structure recognition structure stopwords noun groups stemming automatic or manual indexing text full text index terms Automatic Indexing Choose from the terms in a document those which are most indicative of its content contrast with full text retrieval For non Boolean retrieval include weights with terms more later Normalizing terms Should numbers units mph etc be included Should traffic and Traffic be one term Should compute computer computation computerization be all one term Stemming is the process of removing suffixes so that these are all mapped to comput Term weighting Zipf s Law If the words w in a collection are ranked r w by their frequency f w they roughly fit the relation r w f w c This suggests that some terms are more effective than others in retrieval In particular relative frequency is a useful measure that identifies terms that occur with substantial frequency in some documents but with relatively low overall collection frequency Term weights are functions that are used to quantify these concepts Word frequency characteristics 1200 1000 800 600 400 200 28 25 22 19 16 13 10 7 4 1 0 Zipf s Law rank frequency constant Term Frequency Concept A term that appears many times within a document is likely to be more important than a term that appears only once Statistical Indexing Basis Frequent words are important content representation words except content free function words like the and or but of in it he middle frequency words are the best for indexing documents Why Basic Indexing Strategy 1 2 3 4 list the unique words in the documents remove stopwords about 250 for English stem remaining words improves recall assign as index terms either A B C D all resulting terms or all but very rare terms they won t retrieve much terms that are most frequent in the doc terms weighted highly by other measures Notes There are standard stopword lists for English 4A and 4B don t give term weights Removing rare words 4B was an old idea but it reduces exhaustivity and can affect recall and precision It may be acceptable to remove words whose total frequency is 1 Surrogate Query Comparison Information Problem Representation search formulation Query qi Comparison Information Objects Representation indexing Surrogate D IR Models Set Theoretic Models Boolean Fuzzy Extended Boolean Vector Models Algebraic Probabilistic Models probabilistic Others e g neural networks Traditional IR Design Boolean Model for IR Based on Boolean Logic Algebra of Sets Fundamental principles established by George Boole in the 1850 s Deals with set membership and operations on sets Set membership in IR systems is usually based on whether or not a document contains a keyword term Query Languages A way to express the query formal expression of the information need Types Boolean Natural Language Stylized Natural Language Form Based GUI Simple query language Boolean Terms Connectors terms words normalized stemmed words phrases thesaurus terms connectors AND OR NOT Boolean Queries Cat Cat OR Dog Cat AND Dog Cat AND Dog Cat AND Dog OR Collar Cat AND Dog OR Collar AND Leash Cat OR Dog AND Collar OR Leash Boolean Queries Cat OR Dog AND Collar OR Leash Each of the following combinations satisfies this statement Cat x Dog Collar x Leash x x x x x x x x x x x x x x x Boolean Queries Cat OR Dog AND Collar OR Leash None of the following combinations work Cat Dog Collar Leash x x x x x x x x Boolean Queries Usually expressed as INFIX operators in IR a AND b OR c AND b NOT is UNARY PREFIX operator a AND b OR c AND NOT b AND and OR can be n ary operators a AND b AND c AND d Some rules De Morgan revisited NOT a AND NOT b NOT a OR b NOT a OR NOT b NOT a AND b NOT NOT a a Boolean Searching Cracks Formal Query cracks AND beams AND Width measurement AND Prestressed concrete Width measurement Beams Prestressed concrete Relaxed Query Relaxed Query C AND B AND P OR C AND B AND P OR C AND B AND W OR C AND B AND W OR C AND W AND P OR C AND W AND P OR B AND W AND P B AND W AND P Boolean Logic t1 D9 m3 D11 m6 D4 D5 D3 D10 m8 D2 D1 m5 m2 m1 D6 m4 m7 D8 D7 t3 t2 m1 t1 t2 t3 m2 t1 t2 t3 m3 t1 t2 t3 m4 t1 t2 t3 m5 t1 t2 t3 m6 t1 t2 t3 m7 t1 t2 t3 m8 t1 t2 t3 Precedence Ordering In what order do we evaluate the components of the Boolean expression Parenthesis get done first a or b and c or d a or b and c or d Usually start from the left and work right in case of ties Usually if there are no parentheses NOT before AND AND before OR Pseudo Boolean Queries A new notation from web search cat dog collar leash These are prefix operators Does not mean the same thing as AND OR means mandatory must be in document means cannot be in the document Phrases stray cat AND frayed collar is equivalent to stray cat frayed collar Result Sets Run a query get a result set Two choices Reformulate query run on entire collection Reformulate query run on result set Example Dialog query Redford AND Newman S1 1450 documents S1 AND Sundance S2 898 documents Faceted Boolean Query Strategy break query into facets conjunction of disjunctions a1 OR a2 OR a3 b1 OR b2 c1 OR c2 OR c3 OR c4 AND each facet expresses a topic rain forest OR jungle OR amazon medicine OR remedy OR cure Smith OR Zhou AND Ordering of Retrieved Documents Pure Boolean has no ordering In practice order chronologically order by total number of hits on query terms
View Full Document
Unlocking...