Wright CS 707 - Vector Space Model- Ranking Revisited - D1084658

Home> Schools> Wright State University> Counseling (CS) > CS 707> Vector Space Model- Ranking Revisited

Wright CS 707 - Vector Space Model- Ranking Revisited

Pages 46

Download Save

Unformatted text preview:

Vector Space Model : Ranking RevisitedRecap: tf-idf weightingRecap: Queries as vectorsRecap: cosine(query,document)This lectureComputing cosine scoresEfficient cosine ranking(cont’d)Special case – unweighted queriesComputing the K largest cosines: selection vs. sortingUse heap for selecting top KBottlenecksCosine similarity is only a proxyGeneric approachIndex eliminationHigh-idf query terms onlyDocs containing many query terms3 of 4 query termsChampion listsExercisesStatic quality scoresModeling authorityNet scoreTop K by net score – fast methodsWhy order postings by g(d)?Champion lists in g(d)-orderingHigh and low listsImpact-ordered postings1. Early termination2. idf-ordered termsCluster pruning: preprocessingCluster pruning: query processingVisualizationWhy use random samplingGeneral variantsSlide 37Parametric and zone indexesFieldsZoneExample zone indexesTiered indexesExample tiered indexQuery term proximityQuery parsersAggregate scoresPutting it all togetherPrasad L09VSM-Ranking 1Vector Space Model : Ranking RevisitedAdapted from Lectures byPrabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)Recap: tf-idf weightingThe tf-idf weight of a term is the product of its tf weight and its idf weight.Best known weighting scheme in information retrievalIncreases with the number of occurrences within a documentIncreases with the rarity of the term in the collection)df/(log)tflog1(w10,,tdtNdtPrasad 2L09VSM-RankingRecap: Queries as vectorsKey idea 1: Do the same for queries: represent them as vectors in the spaceKey idea 2: Rank documents according to their proximity to the query in this spaceproximity = similarity of vectorsPrasad 3L09VSM-RankingRecap: cosine(query,document)--ViiViiViiidqdqddqqdqdqdq12121),cos(Dot productUnit vectorscos(q,d) is the cosine similarity of q and d … or,equivalently, the cosine of the angle between q and d.Prasad 4L09VSM-RankingThis lectureSpeeding up vector space rankingPutting together a complete search systemWill require learning about a number of miscellaneous topics and heuristicsPrasad 5L09VSM-RankingComputing cosine scoresEfficient cosine rankingFind the K docs in the collection “nearest” to the query  K largest query-doc cosines.Efficient ranking:Computing a single cosine efficiently.Choosing the K largest cosine values efficiently.Can we do this without computing all N cosines?Prasad 7L09VSM-Ranking(cont’d)What we’re doing in effect: solving the K-nearest neighbor problem for a query vectorIn general, we do not know how to do this efficiently for high-dimensional spacesBut it is solvable for short queries, and standard indexes support this wellPrasad 8L09VSM-RankingSpecial case – unweighted queriesNo weighting on query termsAssume each query term occurs only onceThen for ranking, don’t need to normalize query vectorSlight simplification 7.1Prasad 9L09VSM-RankingComputing the K largest cosines: selection vs. sortingTypically we want to retrieve the top K docs (in the cosine ranking for the query)not to totally order all docs in the collectionCan we pick off docs with K highest cosines?Let J = number of docs with nonzero cosinesWe seek the K best of these J7.1Prasad 11L09VSM-RankingUse heap for selecting top KBinary tree in which each node’s value > the values of childrenTakes 2J operations to construct, then each of K “winners” read off in 2log J steps.For J=10M, K=100, this is about 10% of the cost of sorting.1.9 .3.8.3.1.17.1Prasad 12L09VSM-RankingBottlenecksPrimary computational bottleneck in scoring: cosine computationCan we avoid all this computation?Yes, but may sometimes get it wronga doc not in the top K may creep into the list of K output docsIs this such a bad thing?7.1.1Prasad 13L09VSM-RankingCosine similarity is only a proxyUser has a task and a query formulationCosine matches docs to queryThus cosine is anyway a proxy for user happinessIf we get a list of K docs “close” to the top K by cosine measure, should be ok7.1.1Prasad 14L09VSM-RankingGeneric approachFind a set A of contenders, with K < |A| << NA does not necessarily contain the top K, but has many docs from among the top KReturn the top K docs in AThink of A as pruning non-contendersThe same approach is also used for other (non-cosine) scoring functionsWill look at several schemes following this approach7.1.1Prasad L09VSM-RankingIndex eliminationBasic algorithm of Fig 7.1 considers only docs containing at least one query termTake this further:Consider only high-idf query termsConsider only docs containing many query terms7.1.2Prasad 16L09VSM-RankingHigh-idf query terms onlyFor a query such as catcher in the ryeOnly accumulate scores from catcher and ryeIntuition: in and the contribute little to the scores and don’t alter rank-ordering muchBenefit:Postings of low-idf terms have many docs  these (many) docs get eliminated from A7.1.2Prasad 17L09VSM-RankingDocs containing many query termsAny doc with at least one query term is a candidate for the top K output listFor multi-term queries, only compute scores for docs containing several of the query termsSay, at least 3 out of 4Imposes a “soft conjunction” on queries seen on web search engines (early Google)Easy to implement in postings traversal7.1.2Prasad 18L09VSM-Ranking3 of 4 query termsBrutusCaesarCalpurnia1 2 3 5 8 13 21 342 4 8 16 32 6412813 16Antony 3 4 8 16 32 6412832Scores only computed for 8, 16 and 32.7.1.2Prasad 19L09VSM-RankingChampion listsPrecompute for each dictionary term t, the r docs of highest weight in t’s postingsCall this the champion list for t(aka fancy list or top docs for t)Note that r has to be chosen at index timeAt query time, only compute scores for docs in the champion list of some query termPick the K top-scoring docs from amongst these7.1.3Prasad 20L09VSM-RankingExercisesHow do Champion Lists relate to Index Elimination? Can they be used together?How can Champion Lists be implemented in an inverted index?Note the champion list has nothing to do with small docIDs7.1.3Prasad 21L09VSM-RankingQuantitativeQuantitativeStatic quality scoresWe want top-ranking documents to be both relevant and authoritativeRelevance is being modeled by cosine

View Full Document


School:
Email:
New Password:
Confirm Password:

Wright CS 707 - Vector Space Model- Ranking Revisited

Sign up for free to view:

Please select your school