DOC PREVIEW
Stanford CS 276 - Scoring and results assembly

This preview shows page 1-2-3-23-24-25-26-46-47-48 out of 48 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Slide 1Lecture 6 – I introduced a bugRecap: tf-idf weightingRecap: Queries as vectorsRecap: cosine(query,document)This lectureComputing cosine scoresEfficient cosine rankingSlide 9Special case – unweighted queriesComputing the K largest cosines: selection vs. sortingUse heap for selecting top KBottlenecksCosine similarity is only a proxyGeneric approachIndex eliminationHigh-idf query terms onlyDocs containing many query terms3 of 4 query termsChampion listsExercisesStatic quality scoresModeling authorityNet scoreTop K by net score – fast methodsWhy order postings by g(d)?Champion lists in g(d)-orderingHigh and low listsImpact-ordered postings1. Early termination2. idf-ordered termsCluster pruning: preprocessingCluster pruning: query processingVisualizationWhy use random samplingGeneral variantsSlide 37Parametric and zone indexesFieldsZoneExample zone indexesTiered indexesExample tiered indexQuery term proximityQuery parsersAggregate scoresPutting it all togetherResourcesIntroduction to Information RetrievalIntroduction to Information Retrieval Introduction toInformation RetrievalCS276Information Retrieval and Web SearchPandu Nayak and Prabhakar RaghavanLecture 7: Scoring and results assemblyIntroduction to Information RetrievalIntroduction to Information Retrieval Lecture 6 – I introduced a bugIn my anxiety to avoid taking the log of zero, I rewroteas 2otherwise 0,0 tfif, tflog 1 10 t,dt,dt,dwotherwise 0,0 tfif),tf(1 log 10 t,dt,dt,dwIn fact this was unnecessary, since the zero case is treatedspecially above; net the FIRST version above is right.Introduction to Information RetrievalIntroduction to Information Retrieval Recap: tf-idf weightingThe tf-idf weight of a term is the product of its tf weight and its idf weight.Best known weighting scheme in information retrievalIncreases with the number of occurrences within a documentIncreases with the rarity of the term in the collection)df/(log)tflog1(w10,10,tdtNdtCh. 6Introduction to Information RetrievalIntroduction to Information Retrieval Recap: Queries as vectorsKey idea 1: Do the same for queries: represent them as vectors in the spaceKey idea 2: Rank documents according to their proximity to the query in this spaceproximity = similarity of vectorsCh. 6Introduction to Information RetrievalIntroduction to Information Retrieval Recap: cosine(query,document)--ViiViiViiidqdqddqqdqdqdq12121),cos(Dot productUnit vectorscos(q,d) is the cosine similarity of q and d … or,equivalently, the cosine of the angle between q and d.Ch. 6Introduction to Information RetrievalIntroduction to Information Retrieval This lectureSpeeding up vector space rankingPutting together a complete search systemWill require learning about a number of miscellaneous topics and heuristicsCh. 7Introduction to Information RetrievalIntroduction to Information Retrieval Computing cosine scoresSec. 6.3.3Introduction to Information RetrievalIntroduction to Information Retrieval Efficient cosine rankingFind the K docs in the collection “nearest” to the query  K largest query-doc cosines.Efficient ranking:Computing a single cosine efficiently.Choosing the K largest cosine values efficiently.Can we do this without computing all N cosines?Sec. 7.1Introduction to Information RetrievalIntroduction to Information Retrieval Efficient cosine rankingWhat we’re doing in effect: solving the K-nearest neighbor problem for a query vectorIn general, we do not know how to do this efficiently for high-dimensional spacesBut it is solvable for short queries, and standard indexes support this wellSec. 7.1Introduction to Information RetrievalIntroduction to Information Retrieval Special case – unweighted queriesNo weighting on query termsAssume each query term occurs only onceThen for ranking, don’t need to normalize query vectorSlight simplification of algorithm from Lecture 6Sec. 7.1Introduction to Information RetrievalIntroduction to Information Retrieval Computing the K largest cosines: selection vs. sortingTypically we want to retrieve the top K docs (in the cosine ranking for the query)not to totally order all docs in the collectionCan we pick off docs with K highest cosines?Let J = number of docs with nonzero cosinesWe seek the K best of these JSec. 7.1Introduction to Information RetrievalIntroduction to Information Retrieval Use heap for selecting top KBinary tree in which each node’s value > the values of childrenTakes 2J operations to construct, then each of K “winners” read off in 2log J steps.For J=1M, K=100, this is about 10% of the cost of sorting.1.9 .3.8.3.1.1Sec. 7.1Introduction to Information RetrievalIntroduction to Information Retrieval BottlenecksPrimary computational bottleneck in scoring: cosine computationCan we avoid all this computation?Yes, but may sometimes get it wronga doc not in the top K may creep into the list of K output docsIs this such a bad thing?Sec. 7.1.1Introduction to Information RetrievalIntroduction to Information Retrieval Cosine similarity is only a proxyUser has a task and a query formulationCosine matches docs to queryThus cosine is anyway a proxy for user happinessIf we get a list of K docs “close” to the top K by cosine measure, should be okSec. 7.1.1Introduction to Information RetrievalIntroduction to Information Retrieval Generic approachFind a set A of contenders, with K < |A| << NA does not necessarily contain the top K, but has many docs from among the top KReturn the top K docs in AThink of A as pruning non-contendersThe same approach is also used for other (non-cosine) scoring functionsWill look at several schemes following this approachSec. 7.1.1Introduction to Information RetrievalIntroduction to Information Retrieval Index eliminationBasic algorithm cosine computation algorithm only considers docs containing at least one query termTake this further:Only consider high-idf query termsOnly consider docs containing many query termsSec. 7.1.2Introduction to Information RetrievalIntroduction to Information Retrieval High-idf query terms onlyFor a query such as catcher in the ryeOnly accumulate scores from catcher and ryeIntuition: in and the contribute little to the scores and so don’t


View Full Document

Stanford CS 276 - Scoring and results assembly

Documents in this Course
Load more
Download Scoring and results assembly
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Scoring and results assembly and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Scoring and results assembly 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?