CS6322: CS6322: Information Retrieval Information Retrieval Sanda HarabagiuSanda HarabagiuLecture 9: Scoring and results assemblyLecture 9: Scoring and results assemblyCS6322: Information RetrievalCS6322: Information RetrievalRecap: tf-idf weighting The tf-idf weight of a term is the product of its tf weight and its idf weight. Best known weighting scheme in information retrieval Increases with the number of occurrences within a document Increases with the rarity of the term in the collection)df/(log)tflog1(w10,,tdtNdt×+=Ch. 6CS6322: Information RetrievalCS6322: Information RetrievalRecap: Queries as vectors Key idea 1: Do the same for queries: represent them as vectors in the space Key idea 2: Rank documents according to their proximity to the query in this space proximity = similarity of vectorsCh. 6CS6322: Information RetrievalCS6322: Information RetrievalRecap: cosine(query,document)∑∑∑====•=•=ViiViiViiidqdqddqqdqdqdq12121),cos(rrrrrrrrrrDot productUnit vectorscos(q,d) is the cosine similarity of q and d … or,equivalently, the cosine of the angle between q and d.Ch. 6CS6322: Information RetrievalCS6322: Information RetrievalThis lecture Speeding up vector space ranking Putting together a complete search system Will require learning about a number of miscellaneous topics and heuristicsCh. 7CS6322: Information RetrievalCS6322: Information RetrievalComputing cosine scoresSec. 6.3.3CS6322: Information RetrievalCS6322: Information RetrievalEfficient cosine ranking Find the K docs in the collection “nearest” to the query ⇒ K largest query-doc cosines. Efficient ranking: Computing a single cosine efficiently. Choosing the K largest cosine values efficiently. Can we do this without computing all N cosines?Sec. 7.1CS6322: Information RetrievalCS6322: Information RetrievalEfficient cosine ranking What we’re doing in effect: solving the K-nearest neighbor problem for a query vector In general, we do not know how to do this efficiently for high-dimensional spaces But it is solvable for short queries, and standard indexes support this wellSec. 7.1CS6322: Information RetrievalCS6322: Information RetrievalSpecial case – unweighted queries No weighting on query terms Assume each query term occurs only once Then for ranking, don’t need to normalize query vector Slight simplification of algorithm from Lecture 6Sec. 7.1CS6322: Information RetrievalCS6322: Information RetrievalFaster cosine: unweighted querySec. 7.1CS6322: Information RetrievalCS6322: Information RetrievalComputing the K largest cosines: selection vs. sorting Typically we want to retrieve the top K docs (in the cosine ranking for the query) not to totally order all docs in the collection Can we pick off docs with K highest cosines? Let J = number of docs with nonzero cosines We seek the K best of these JSec. 7.1CS6322: Information RetrievalCS6322: Information RetrievalUse heap for selecting top K Binary tree in which each node’s value > the values of children Takes 2J operations to construct, then each of K “winners” read off in 2log J steps. For J=1M, K=100, this is about 10% of the cost of sorting.1.9 .3.8.3.1.1Sec. 7.1CS6322: Information RetrievalCS6322: Information RetrievalBottlenecks Primary computational bottleneck in scoring: cosine computation Can we avoid all this computation? Yes, but may sometimes get it wrong a doc not in the top K may creep into the list of Koutput docs Is this such a bad thing?Sec. 7.1.1CS6322: Information RetrievalCS6322: Information RetrievalCosine similarity is only a proxy User has a task and a query formulation Cosine matches docs to query Thus cosine is anyway a proxy for user happiness If we get a list of K docs “close” to the top K by cosine measure, should be okSec. 7.1.1CS6322: Information RetrievalCS6322: Information RetrievalGeneric approach Find a set A of contenders, with K < |A| << N A does not necessarily contain the top K, but has many docs from among the top K Return the top K docs in A Think of A as pruning non-contenders The same approach is also used for other (non-cosine) scoring functions Will look at several schemes following this approachSec. 7.1.1CS6322: Information RetrievalCS6322: Information RetrievalIndex elimination Basic algorithm FastCosineScore of Fig 7.1 only considers docs containing at least one query term Take this further: Only consider high-idf query terms Only consider docs containing many query termsSec. 7.1.2Fig 7.1CS6322: Information RetrievalCS6322: Information RetrievalHigh-idf query terms only For a query such as catcher in the rye Only accumulate scores from catcher and rye Intuition: in and the contribute little to the scores and so don’t alter rank-ordering much Benefit: Postings of low-idf terms have many docs → these (many) docs get eliminated from set A of contendersSec. 7.1.2CS6322: Information RetrievalCS6322: Information RetrievalDocs containing many query terms Any doc with at least one query term is a candidate for the top K output list For multi-term queries, only compute scores for docs containing several of the query terms Say, at least 3 out of 4 Imposes a “soft conjunction” on queries seen on web search engines (early Google) Easy to implement in postings traversalSec. 7.1.2Introduction to Information RetrievalIntroduction to Information Retrieval3 of 4 query termsBrutusCaesarCalpurnia1 2 3 5 8 13 21 342 4 8 16 32 64 12813 16Antony 3 4 8 16 32 64 12832Scores only computed for docs 8, 16 and 32.Sec. 7.1.2CS6322: Information RetrievalCS6322: Information RetrievalChampion lists Precompute for each dictionary term t, the r docs of highest weight in t’s postings Call this the champion list for t (aka fancy list or top docs for t) Note that r has to be chosen at index build time Thus, it’s possible that r < K At query time, only compute scores for docs in the champion list of some query term Pick the K top-scoring docs from amongst theseSec. 7.1.3CS6322: Information RetrievalCS6322: Information RetrievalExercises How can Champion Lists be implemented in an inverted index? Note that the champion list has nothing to do with small docIDsSec. 7.1.3CS6322: Information RetrievalCS6322: Information RetrievalQuantitativeStatic quality scores We want
View Full Document