UT Dallas CS 6359 - Lecture9 - D3096298

Home> Schools> University of Texas at Dallas> Computer Science (CS) > CS 6359> Lecture9

DOC PREVIEW

UT Dallas CS 6359 - Lecture9

School name University of Texas at Dallas

Course Cs 6359- Object-Oriented Analysis and Design

Pages 48

This preview shows page 1-2-3-23-24-25-26-46-47-48 out of 48 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 48 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 48 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 48 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 48 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 48 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 48 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 48 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 48 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 48 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 48 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 48 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CS6322: CS6322: Information Retrieval Information Retrieval Sanda HarabagiuSanda HarabagiuLecture 9: Scoring and results assemblyLecture 9: Scoring and results assemblyCS6322: Information RetrievalCS6322: Information RetrievalRecap: tf-idf weighting The tf-idf weight of a term is the product of its tf weight and its idf weight. Best known weighting scheme in information retrieval Increases with the number of occurrences within a document Increases with the rarity of the term in the collection)df/(log)tflog1(w10,,tdtNdt×+=Ch. 6CS6322: Information RetrievalCS6322: Information RetrievalRecap: Queries as vectors Key idea 1: Do the same for queries: represent them as vectors in the space Key idea 2: Rank documents according to their proximity to the query in this space proximity = similarity of vectorsCh. 6CS6322: Information RetrievalCS6322: Information RetrievalRecap: cosine(query,document)∑∑∑====•=•=ViiViiViiidqdqddqqdqdqdq12121),cos(rrrrrrrrrrDot productUnit vectorscos(q,d) is the cosine similarity of q and d … or,equivalently, the cosine of the angle between q and d.Ch. 6CS6322: Information RetrievalCS6322: Information RetrievalThis lecture Speeding up vector space ranking Putting together a complete search system Will require learning about a number of miscellaneous topics and heuristicsCh. 7CS6322: Information RetrievalCS6322: Information RetrievalComputing cosine scoresSec. 6.3.3CS6322: Information RetrievalCS6322: Information RetrievalEfficient cosine ranking Find the K docs in the collection “nearest” to the query ⇒ K largest query-doc cosines. Efficient ranking: Computing a single cosine efficiently. Choosing the K largest cosine values efficiently. Can we do this without computing all N cosines?Sec. 7.1CS6322: Information RetrievalCS6322: Information RetrievalEfficient cosine ranking What we’re doing in effect: solving the K-nearest neighbor problem for a query vector In general, we do not know how to do this efficiently for high-dimensional spaces But it is solvable for short queries, and standard indexes support this wellSec. 7.1CS6322: Information RetrievalCS6322: Information RetrievalSpecial case – unweighted queries No weighting on query terms Assume each query term occurs only once Then for ranking, don’t need to normalize query vector Slight simplification of algorithm from Lecture 6Sec. 7.1CS6322: Information RetrievalCS6322: Information RetrievalFaster cosine: unweighted querySec. 7.1CS6322: Information RetrievalCS6322: Information RetrievalComputing the K largest cosines: selection vs. sorting Typically we want to retrieve the top K docs (in the cosine ranking for the query) not to totally order all docs in the collection Can we pick off docs with K highest cosines? Let J = number of docs with nonzero cosines We seek the K best of these JSec. 7.1CS6322: Information RetrievalCS6322: Information RetrievalUse heap for selecting top K Binary tree in which each node’s value > the values of children Takes 2J operations to construct, then each of K “winners” read off in 2log J steps. For J=1M, K=100, this is about 10% of the cost of sorting.1.9 .3.8.3.1.1Sec. 7.1CS6322: Information RetrievalCS6322: Information RetrievalBottlenecks Primary computational bottleneck in scoring: cosine computation Can we avoid all this computation? Yes, but may sometimes get it wrong a doc not in the top K may creep into the list of Koutput docs Is this such a bad thing?Sec. 7.1.1CS6322: Information RetrievalCS6322: Information RetrievalCosine similarity is only a proxy User has a task and a query formulation Cosine matches docs to query Thus cosine is anyway a proxy for user happiness If we get a list of K docs “close” to the top K by cosine measure, should be okSec. 7.1.1CS6322: Information RetrievalCS6322: Information RetrievalGeneric approach Find a set A of contenders, with K < |A| << N A does not necessarily contain the top K, but has many docs from among the top K Return the top K docs in A Think of A as pruning non-contenders The same approach is also used for other (non-cosine) scoring functions Will look at several schemes following this approachSec. 7.1.1CS6322: Information RetrievalCS6322: Information RetrievalIndex elimination Basic algorithm FastCosineScore of Fig 7.1 only considers docs containing at least one query term Take this further: Only consider high-idf query terms Only consider docs containing many query termsSec. 7.1.2Fig 7.1CS6322: Information RetrievalCS6322: Information RetrievalHigh-idf query terms only For a query such as catcher in the rye Only accumulate scores from catcher and rye Intuition: in and the contribute little to the scores and so don’t alter rank-ordering much Benefit: Postings of low-idf terms have many docs → these (many) docs get eliminated from set A of contendersSec. 7.1.2CS6322: Information RetrievalCS6322: Information RetrievalDocs containing many query terms Any doc with at least one query term is a candidate for the top K output list For multi-term queries, only compute scores for docs containing several of the query terms Say, at least 3 out of 4 Imposes a “soft conjunction” on queries seen on web search engines (early Google) Easy to implement in postings traversalSec. 7.1.2Introduction to Information RetrievalIntroduction to Information Retrieval3 of 4 query termsBrutusCaesarCalpurnia1 2 3 5 8 13 21 342 4 8 16 32 64 12813 16Antony 3 4 8 16 32 64 12832Scores only computed for docs 8, 16 and 32.Sec. 7.1.2CS6322: Information RetrievalCS6322: Information RetrievalChampion lists Precompute for each dictionary term t, the r docs of highest weight in t’s postings Call this the champion list for t (aka fancy list or top docs for t) Note that r has to be chosen at index build time Thus, it’s possible that r < K At query time, only compute scores for docs in the champion list of some query term Pick the K top-scoring docs from amongst theseSec. 7.1.3CS6322: Information RetrievalCS6322: Information RetrievalExercises How can Champion Lists be implemented in an inverted index? Note that the champion list has nothing to do with small docIDsSec. 7.1.3CS6322: Information RetrievalCS6322: Information RetrievalQuantitativeStatic quality scores We want

View Full Document