Wright CS 707 - Models for IR - D1933175

Home> Schools> Wright State University> Counseling (CS) > CS 707> Models for IR

Wright CS 707 - Models for IR

Pages 41

Download Save

Unformatted text preview:

Models for IRPowerPoint PresentationSlide 3Slide 4Slide 7Slide 8Retrieval : Ad hoc vs FilteringSpecifying an IR ModelAutomation / Machine Learning as Interpolation / ExtrapolationSlide 12Slide 13Slide 14Boolean ModelSlide 16Slide 17Slide 18Slide 19Slide 20Vector ModelDocuments as vectorsIntuitionDesiderata for proximityFirst cutCosine similaritySlide 27ExampleSlide 29Queries in the vector space modelSlide 31Triangle InequalitySlide 33Slide 34Slide 35Summary: What’s the point of using vector spaces?Slide 37Slide 38Slide 39Slide 40Digression: terminologySlide 42Slide 43Prasad L2IRModels 1Models for IRAdapted from Lectures byBerthier Ribeiro-Neto (Brazil), Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)Prasad L2IRModels 2IntroductionDocs DBInformation NeedIndex TermsDocQueryRankedList of DocsmatchmatchabstractabstractPrasad L2IRModels 3Introduction•Premise: Semantics of documents and user information need is expressible naturally through sets of index termsUnfortunately, in general, matching at index term level is quite imprecise•Critical Issue: Ranking •Ordering of retrieved documents that (hopefully) reflects their relevance to the queryPrasad L2IRModels 4•Fundamental premisses regarding relevance determines an IR Modelcommon sets of index termssharing of weighted termslikelihood of relevance•IR Model (boolean, vector, probabilistic, etc), logical view of the documents (full text, index terms, etc) and the user task (retrieval, browsing, etc) are all orthogonal aspects of an IR system.Prasad L2IRModels 7Retrieval: Ad Hoc vs Filtering•Ad hoc retrieval:Collection“Fixed Size”Q2Q3Q1Q4Q5Prasad L2IRModels 8Retrieval: Ad Hoc vs Filtering•Filtering:Documents StreamUser 1ProfileUser 2ProfileDocs Filteredfor User 2Docs forUser 1Prasad L2IRModels 9Retrieval : Ad hoc vs Filtering•Docs collection relatively static while queries vary•Ranking for determining relevance to user information needCf. String matching problem where the text is given and the pattern to be searched varies. •E.g., use indexing techniques, suffix trees, etc.•Queries relatively static while new docs are added to the collection•Construction of user profile to reflect user preferencesCf. String matching problem where pattern is given and the text varies.•E.g., use automata-based techniquesPrasad L2IRModels 10Specifying an IR Model•Structure Quadruple [D, Q, F, R(qi, dj)]D = Representation of documentsQ = Representation of queriesF = Framework for modeling representations and their relationships•Standard language/algebra/impl. type for translation to provide semantics •Evaluation w.r.t. “direct” semantics through benchmarks R = Ranking function that associates a real number with a query-doc pairAutomation / Machine Learning asInterpolation / ExtrapolationPrasad L2IRModels 11Training ExamplesBenchmarkQuery ->Behavior of the General AlgorithmPrasad L2IRModels 12Classic IR Models - Basic Concepts•Each document represented by a set of representative keywords or index termsIndex terms meant to capture document’s main themes or semantics.Usually, index terms are nouns because nouns have meaning by themselves.However, search engines assume that all words are index terms (full text representation)Prasad L2IRModels 13Classic IR Models - Basic Concepts•Not all terms are equally useful for representing the document’s content•Let ki be an index termdj be a document wij be the weight associated with (ki,dj)•The weight wij quantifies the importance of the index term for describing the document contentPrasad L2IRModels 14Notations/Conventionski is an index termdj is a documentt is the total number of termsK = (k1, k2, …, kt) is the set of all index termswij >= 0 is the weight associated with (ki,dj)•wij = 0 if the term is not in the docvec(dj) = (w1j, w2j, …, wtj) is the weight vector associated with the document djgi(vec(dj)) = wij is the function which returns the weight associated with the pair (ki,dj)Prasad L2IRModels 15Boolean ModelPrasad L2IRModels 16The Boolean Model•Simple model based on set theory•Queries and documents specified as boolean expressions precise semanticsE.g., q = ka  (kb  kc)•Terms are either present or absent. Thus, wij  {0,1}Prasad L2IRModels 17Exampleq = ka  (kb  kc)vec(qdnf) = (1,1,1)  (1,1,0)  (1,0,0)»Disjunctive Normal Formvec(qcc) = (1,1,0) »Conjunctive component•Similar/Matching documents• md1 = [ka ka d e] => (1,0,0)• md2 = [ka kb kc] => (1,1,1)•Unmatched documents• ud1 = [ka kc] => (1,0,1)• ud2 = [d] => (0,0,0)Prasad L2IRModels 18Similarity/Matching functionsim(q,dj) = 1 if vec(dj)  vec(qdnf)) 0 otherwise»Requires coercion for accuracyPrasad L2IRModels 19Venn Diagram q = ka  (kb  kc) (1,1,1)(1,0,0)(1,1,0)Ka KbKcPrasad L2IRModels 20Drawbacks of the Boolean ModelExpressive power of boolean expressions to capture information need and document semantics inadequateRetrieval based on binary decision criteria (with no partial match) does not reflect our intuitions behind relevance adequately•As a resultAnswer set contains either too few or too many documents in response to a user queryNo ranking of documentsPrasad L2IRModels 21Vector ModelPrasad L2IRModels 22Documents as vectors•Not all index terms are equally useful in representing document content•Each doc j can be viewed as a vector of non-boolean weights, one component for each termterms are axes of vector spacedocs are points in this vector space•even with stemming, the vector space may have 20,000+ dimensionsPrasad L2IRModels 23IntuitionPostulate: Documents that are “close together” in the vector space talk about the same things.t1d2d1d3d4d5t3t2θφPrasad L2IRModels 24Desiderata for proximity•If d1 is near d2, then d2 is near d1.•If d1 near d2, and d2 near d3, then d1 is not far from d3.•No doc is closer to d than d itself.Prasad L2IRModels 25First cut•Idea: Distance between d1 and d2 is the length of the vector |d1 – d2|.Euclidean distance•Why is this not a great idea?•We still haven’t dealt with the issue of length normalizationShort documents would be more similar to each other by virtue of length, not topic•However, we can implicitly normalize by looking at angles instead“Proportional

View Full Document


School:
Email:
New Password:
Confirm Password:

Wright CS 707 - Models for IR

Sign up for free to view:

Please select your school