Slide 1Lecture 6 – I introduced a bugRecap: tf-idf weightingRecap: Queries as vectorsRecap: cosine(query,document)This lectureComputing cosine scoresEfficient cosine rankingSlide 9Special case – unweighted queriesComputing the K largest cosines: selection vs. sortingUse heap for selecting top KBottlenecksCosine similarity is only a proxyGeneric approachIndex eliminationHigh-idf query terms onlyDocs containing many query terms3 of 4 query termsChampion listsExercisesStatic quality scoresModeling authorityNet scoreTop K by net score – fast methodsWhy order postings by g(d)?Champion lists in g(d)-orderingHigh and low listsImpact-ordered postings1. Early termination2. idf-ordered termsCluster pruning: preprocessingCluster pruning: query processingVisualizationWhy use random samplingGeneral variantsSlide 37Parametric and zone indexesFieldsZoneExample zone indexesTiered indexesExample tiered indexQuery term proximityQuery parsersAggregate scoresPutting it all togetherResourcesIntroduction to Information RetrievalIntroduction to Information Retrieval Introduction toInformation RetrievalCS276Information Retrieval and Web SearchPandu Nayak and Prabhakar RaghavanLecture 7: Scoring and results assemblyIntroduction to Information RetrievalIntroduction to Information Retrieval Lecture 6 – I introduced a bugIn my anxiety to avoid taking the log of zero, I rewroteas 2otherwise 0,0 tfif, tflog 1 10 t,dt,dt,dwotherwise 0,0 tfif),tf(1 log 10 t,dt,dt,dwIn fact this was unnecessary, since the zero case is treatedspecially above; net the FIRST version above is right.Introduction to Information RetrievalIntroduction to Information Retrieval Recap: tf-idf weightingThe tf-idf weight of a term is the product of its tf weight and its idf weight.Best known weighting scheme in information retrievalIncreases with the number of occurrences within a documentIncreases with the rarity of the term in the collection)df/(log)tflog1(w10,10,tdtNdtCh. 6Introduction to Information RetrievalIntroduction to Information Retrieval Recap: Queries as vectorsKey idea 1: Do the same for queries: represent them as vectors in the spaceKey idea 2: Rank documents according to their proximity to the query in this spaceproximity = similarity of vectorsCh. 6Introduction to Information RetrievalIntroduction to Information Retrieval Recap: cosine(query,document)--ViiViiViiidqdqddqqdqdqdq12121),cos(Dot productUnit vectorscos(q,d) is the cosine similarity of q and d … or,equivalently, the cosine of the angle between q and d.Ch. 6Introduction to Information RetrievalIntroduction to Information Retrieval This lectureSpeeding up vector space rankingPutting together a complete search systemWill require learning about a number of miscellaneous topics and heuristicsCh. 7Introduction to Information RetrievalIntroduction to Information Retrieval Computing cosine scoresSec. 6.3.3Introduction to Information RetrievalIntroduction to Information Retrieval Efficient cosine rankingFind the K docs in the collection “nearest” to the query K largest query-doc cosines.Efficient ranking:Computing a single cosine efficiently.Choosing the K largest cosine values efficiently.Can we do this without computing all N cosines?Sec. 7.1Introduction to Information RetrievalIntroduction to Information Retrieval Efficient cosine rankingWhat we’re doing in effect: solving the K-nearest neighbor problem for a query vectorIn general, we do not know how to do this efficiently for high-dimensional spacesBut it is solvable for short queries, and standard indexes support this wellSec. 7.1Introduction to Information RetrievalIntroduction to Information Retrieval Special case – unweighted queriesNo weighting on query termsAssume each query term occurs only onceThen for ranking, don’t need to normalize query vectorSlight simplification of algorithm from Lecture 6Sec. 7.1Introduction to Information RetrievalIntroduction to Information Retrieval Computing the K largest cosines: selection vs. sortingTypically we want to retrieve the top K docs (in the cosine ranking for the query)not to totally order all docs in the collectionCan we pick off docs with K highest cosines?Let J = number of docs with nonzero cosinesWe seek the K best of these JSec. 7.1Introduction to Information RetrievalIntroduction to Information Retrieval Use heap for selecting top KBinary tree in which each node’s value > the values of childrenTakes 2J operations to construct, then each of K “winners” read off in 2log J steps.For J=1M, K=100, this is about 10% of the cost of sorting.1.9 .3.8.3.1.1Sec. 7.1Introduction to Information RetrievalIntroduction to Information Retrieval BottlenecksPrimary computational bottleneck in scoring: cosine computationCan we avoid all this computation?Yes, but may sometimes get it wronga doc not in the top K may creep into the list of K output docsIs this such a bad thing?Sec. 7.1.1Introduction to Information RetrievalIntroduction to Information Retrieval Cosine similarity is only a proxyUser has a task and a query formulationCosine matches docs to queryThus cosine is anyway a proxy for user happinessIf we get a list of K docs “close” to the top K by cosine measure, should be okSec. 7.1.1Introduction to Information RetrievalIntroduction to Information Retrieval Generic approachFind a set A of contenders, with K < |A| << NA does not necessarily contain the top K, but has many docs from among the top KReturn the top K docs in AThink of A as pruning non-contendersThe same approach is also used for other (non-cosine) scoring functionsWill look at several schemes following this approachSec. 7.1.1Introduction to Information RetrievalIntroduction to Information Retrieval Index eliminationBasic algorithm cosine computation algorithm only considers docs containing at least one query termTake this further:Only consider high-idf query termsOnly consider docs containing many query termsSec. 7.1.2Introduction to Information RetrievalIntroduction to Information Retrieval High-idf query terms onlyFor a query such as catcher in the ryeOnly accumulate scores from catcher and ryeIntuition: in and the contribute little to the scores and so don’t
View Full Document