DOC PREVIEW
UT Dallas CS 6359 - Lecture9

This preview shows page 1-2-3-23-24-25-26-46-47-48 out of 48 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS6322: CS6322: Information Retrieval Information Retrieval Sanda HarabagiuSanda HarabagiuLecture 9: Scoring and results assemblyLecture 9: Scoring and results assemblyCS6322: Information RetrievalCS6322: Information RetrievalRecap: tf-idf weighting The tf-idf weight of a term is the product of its tf weight and its idf weight. Best known weighting scheme in information retrieval Increases with the number of occurrences within a document Increases with the rarity of the term in the collection)df/(log)tflog1(w10,,tdtNdt×+=Ch. 6CS6322: Information RetrievalCS6322: Information RetrievalRecap: Queries as vectors Key idea 1: Do the same for queries: represent them as vectors in the space Key idea 2: Rank documents according to their proximity to the query in this space proximity = similarity of vectorsCh. 6CS6322: Information RetrievalCS6322: Information RetrievalRecap: cosine(query,document)∑∑∑====•=•=ViiViiViiidqdqddqqdqdqdq12121),cos(rrrrrrrrrrDot productUnit vectorscos(q,d) is the cosine similarity of q and d … or,equivalently, the cosine of the angle between q and d.Ch. 6CS6322: Information RetrievalCS6322: Information RetrievalThis lecture Speeding up vector space ranking Putting together a complete search system Will require learning about a number of miscellaneous topics and heuristicsCh. 7CS6322: Information RetrievalCS6322: Information RetrievalComputing cosine scoresSec. 6.3.3CS6322: Information RetrievalCS6322: Information RetrievalEfficient cosine ranking Find the K docs in the collection “nearest” to the query ⇒ K largest query-doc cosines. Efficient ranking: Computing a single cosine efficiently. Choosing the K largest cosine values efficiently. Can we do this without computing all N cosines?Sec. 7.1CS6322: Information RetrievalCS6322: Information RetrievalEfficient cosine ranking What we’re doing in effect: solving the K-nearest neighbor problem for a query vector In general, we do not know how to do this efficiently for high-dimensional spaces But it is solvable for short queries, and standard indexes support this wellSec. 7.1CS6322: Information RetrievalCS6322: Information RetrievalSpecial case – unweighted queries No weighting on query terms Assume each query term occurs only once Then for ranking, don’t need to normalize query vector Slight simplification of algorithm from Lecture 6Sec. 7.1CS6322: Information RetrievalCS6322: Information RetrievalFaster cosine: unweighted querySec. 7.1CS6322: Information RetrievalCS6322: Information RetrievalComputing the K largest cosines: selection vs. sorting Typically we want to retrieve the top K docs (in the cosine ranking for the query) not to totally order all docs in the collection Can we pick off docs with K highest cosines? Let J = number of docs with nonzero cosines We seek the K best of these JSec. 7.1CS6322: Information RetrievalCS6322: Information RetrievalUse heap for selecting top K Binary tree in which each node’s value > the values of children Takes 2J operations to construct, then each of K “winners” read off in 2log J steps. For J=1M, K=100, this is about 10% of the cost of sorting.1.9 .3.8.3.1.1Sec. 7.1CS6322: Information RetrievalCS6322: Information RetrievalBottlenecks Primary computational bottleneck in scoring: cosine computation Can we avoid all this computation? Yes, but may sometimes get it wrong a doc not in the top K may creep into the list of Koutput docs Is this such a bad thing?Sec. 7.1.1CS6322: Information RetrievalCS6322: Information RetrievalCosine similarity is only a proxy User has a task and a query formulation Cosine matches docs to query Thus cosine is anyway a proxy for user happiness If we get a list of K docs “close” to the top K by cosine measure, should be okSec. 7.1.1CS6322: Information RetrievalCS6322: Information RetrievalGeneric approach Find a set A of contenders, with K < |A| << N A does not necessarily contain the top K, but has many docs from among the top K Return the top K docs in A Think of A as pruning non-contenders The same approach is also used for other (non-cosine) scoring functions Will look at several schemes following this approachSec. 7.1.1CS6322: Information RetrievalCS6322: Information RetrievalIndex elimination Basic algorithm FastCosineScore of Fig 7.1 only considers docs containing at least one query term Take this further: Only consider high-idf query terms Only consider docs containing many query termsSec. 7.1.2Fig 7.1CS6322: Information RetrievalCS6322: Information RetrievalHigh-idf query terms only For a query such as catcher in the rye Only accumulate scores from catcher and rye Intuition: in and the contribute little to the scores and so don’t alter rank-ordering much Benefit: Postings of low-idf terms have many docs → these (many) docs get eliminated from set A of contendersSec. 7.1.2CS6322: Information RetrievalCS6322: Information RetrievalDocs containing many query terms Any doc with at least one query term is a candidate for the top K output list For multi-term queries, only compute scores for docs containing several of the query terms Say, at least 3 out of 4 Imposes a “soft conjunction” on queries seen on web search engines (early Google) Easy to implement in postings traversalSec. 7.1.2Introduction to Information RetrievalIntroduction to Information Retrieval3 of 4 query termsBrutusCaesarCalpurnia1 2 3 5 8 13 21 342 4 8 16 32 64 12813 16Antony 3 4 8 16 32 64 12832Scores only computed for docs 8, 16 and 32.Sec. 7.1.2CS6322: Information RetrievalCS6322: Information RetrievalChampion lists Precompute for each dictionary term t, the r docs of highest weight in t’s postings Call this the champion list for t (aka fancy list or top docs for t) Note that r has to be chosen at index build time Thus, it’s possible that r < K At query time, only compute scores for docs in the champion list of some query term Pick the K top-scoring docs from amongst theseSec. 7.1.3CS6322: Information RetrievalCS6322: Information RetrievalExercises How can Champion Lists be implemented in an inverted index? Note that the champion list has nothing to do with small docIDsSec. 7.1.3CS6322: Information RetrievalCS6322: Information RetrievalQuantitativeStatic quality scores We want


View Full Document

UT Dallas CS 6359 - Lecture9

Documents in this Course
Lecture2

Lecture2

63 pages

Lecture3

Lecture3

49 pages

Lecture4

Lecture4

48 pages

Lecture5

Lecture5

47 pages

Lecture6

Lecture6

45 pages

Lecture7

Lecture7

63 pages

Lecture8

Lecture8

77 pages

Lecture10

Lecture10

84 pages

Lecture11

Lecture11

45 pages

Lecture12

Lecture12

134 pages

Lecture13

Lecture13

62 pages

Lecture14

Lecture14

76 pages

Project

Project

2 pages

Chapter_1

Chapter_1

25 pages

Load more
Download Lecture9
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture9 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture9 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?