Vector Space Model : Ranking RevisitedRecap: tf-idf weightingRecap: Queries as vectorsRecap: cosine(query,document)This lectureComputing cosine scoresEfficient cosine ranking(cont’d)Special case – unweighted queriesComputing the K largest cosines: selection vs. sortingUse heap for selecting top KBottlenecksCosine similarity is only a proxyGeneric approachIndex eliminationHigh-idf query terms onlyDocs containing many query terms3 of 4 query termsChampion listsExercisesStatic quality scoresModeling authorityNet scoreTop K by net score – fast methodsWhy order postings by g(d)?Champion lists in g(d)-orderingHigh and low listsImpact-ordered postings1. Early termination2. idf-ordered termsCluster pruning: preprocessingCluster pruning: query processingVisualizationWhy use random samplingGeneral variantsSlide 37Parametric and zone indexesFieldsZoneExample zone indexesTiered indexesExample tiered indexQuery term proximityQuery parsersAggregate scoresPutting it all togetherPrasad L09VSM-Ranking 1Vector Space Model : Ranking RevisitedAdapted from Lectures byPrabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)Recap: tf-idf weightingThe tf-idf weight of a term is the product of its tf weight and its idf weight.Best known weighting scheme in information retrievalIncreases with the number of occurrences within a documentIncreases with the rarity of the term in the collection)df/(log)tflog1(w10,,tdtNdtPrasad 2L09VSM-RankingRecap: Queries as vectorsKey idea 1: Do the same for queries: represent them as vectors in the spaceKey idea 2: Rank documents according to their proximity to the query in this spaceproximity = similarity of vectorsPrasad 3L09VSM-RankingRecap: cosine(query,document)--ViiViiViiidqdqddqqdqdqdq12121),cos(Dot productUnit vectorscos(q,d) is the cosine similarity of q and d … or,equivalently, the cosine of the angle between q and d.Prasad 4L09VSM-RankingThis lectureSpeeding up vector space rankingPutting together a complete search systemWill require learning about a number of miscellaneous topics and heuristicsPrasad 5L09VSM-RankingComputing cosine scoresEfficient cosine rankingFind the K docs in the collection “nearest” to the query K largest query-doc cosines.Efficient ranking:Computing a single cosine efficiently.Choosing the K largest cosine values efficiently.Can we do this without computing all N cosines?Prasad 7L09VSM-Ranking(cont’d)What we’re doing in effect: solving the K-nearest neighbor problem for a query vectorIn general, we do not know how to do this efficiently for high-dimensional spacesBut it is solvable for short queries, and standard indexes support this wellPrasad 8L09VSM-RankingSpecial case – unweighted queriesNo weighting on query termsAssume each query term occurs only onceThen for ranking, don’t need to normalize query vectorSlight simplification 7.1Prasad 9L09VSM-RankingComputing the K largest cosines: selection vs. sortingTypically we want to retrieve the top K docs (in the cosine ranking for the query)not to totally order all docs in the collectionCan we pick off docs with K highest cosines?Let J = number of docs with nonzero cosinesWe seek the K best of these J7.1Prasad 11L09VSM-RankingUse heap for selecting top KBinary tree in which each node’s value > the values of childrenTakes 2J operations to construct, then each of K “winners” read off in 2log J steps.For J=10M, K=100, this is about 10% of the cost of sorting.1.9 .3.8.3.1.17.1Prasad 12L09VSM-RankingBottlenecksPrimary computational bottleneck in scoring: cosine computationCan we avoid all this computation?Yes, but may sometimes get it wronga doc not in the top K may creep into the list of K output docsIs this such a bad thing?7.1.1Prasad 13L09VSM-RankingCosine similarity is only a proxyUser has a task and a query formulationCosine matches docs to queryThus cosine is anyway a proxy for user happinessIf we get a list of K docs “close” to the top K by cosine measure, should be ok7.1.1Prasad 14L09VSM-RankingGeneric approachFind a set A of contenders, with K < |A| << NA does not necessarily contain the top K, but has many docs from among the top KReturn the top K docs in AThink of A as pruning non-contendersThe same approach is also used for other (non-cosine) scoring functionsWill look at several schemes following this approach7.1.1Prasad L09VSM-RankingIndex eliminationBasic algorithm of Fig 7.1 considers only docs containing at least one query termTake this further:Consider only high-idf query termsConsider only docs containing many query terms7.1.2Prasad 16L09VSM-RankingHigh-idf query terms onlyFor a query such as catcher in the ryeOnly accumulate scores from catcher and ryeIntuition: in and the contribute little to the scores and don’t alter rank-ordering muchBenefit:Postings of low-idf terms have many docs these (many) docs get eliminated from A7.1.2Prasad 17L09VSM-RankingDocs containing many query termsAny doc with at least one query term is a candidate for the top K output listFor multi-term queries, only compute scores for docs containing several of the query termsSay, at least 3 out of 4Imposes a “soft conjunction” on queries seen on web search engines (early Google)Easy to implement in postings traversal7.1.2Prasad 18L09VSM-Ranking3 of 4 query termsBrutusCaesarCalpurnia1 2 3 5 8 13 21 342 4 8 16 32 6412813 16Antony 3 4 8 16 32 6412832Scores only computed for 8, 16 and 32.7.1.2Prasad 19L09VSM-RankingChampion listsPrecompute for each dictionary term t, the r docs of highest weight in t’s postingsCall this the champion list for t(aka fancy list or top docs for t)Note that r has to be chosen at index timeAt query time, only compute scores for docs in the champion list of some query termPick the K top-scoring docs from amongst these7.1.3Prasad 20L09VSM-RankingExercisesHow do Champion Lists relate to Index Elimination? Can they be used together?How can Champion Lists be implemented in an inverted index?Note the champion list has nothing to do with small docIDs7.1.3Prasad 21L09VSM-RankingQuantitativeQuantitativeStatic quality scoresWe want top-ranking documents to be both relevant and authoritativeRelevance is being modeled by cosine
View Full Document