DOC PREVIEW
UT Dallas CS 6359 - Lecture6

This preview shows page 1-2-3-21-22-23-43-44-45 out of 45 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 45 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 45 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 45 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 45 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 45 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 45 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 45 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 45 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 45 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 45 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS6322: CS6322: Information Retrieval Information Retrieval SandaSandaHarabagiuHarabagiuLecture 6: Scoring, Term Lecture 6: Scoring, Term Weighting and the Vector Space Weighting and the Vector Space ModelModelCS 6322: Information RetrievalCS 6322: Information RetrievalRecap of lecture 5 Collection and vocabulary statistics: Heaps’ and Zipf’s laws Dictionary compression for Boolean indexes Dictionary string, blocks, front coding Postings compression: Gap encoding, prefix-unique codes Variable-Byte and Gamma codescollection (text, xml markup etc) 3,600.0collection (text) 960.0Term-doc incidence matrix 40,000.0postings, uncompressed (32-bit words) 400.0postings, uncompressed (20 bits) 250.0postings, variable byte encoded 116.0postings, γ−encoded 101.0MBCS 6322: Information RetrievalCS 6322: Information RetrievalThis lecture; IIR Sections 6.2-6.4.3 Ranked retrieval Scoring documents Term frequency Collection statistics Weighting schemes Vector space scoringCS 6322: Information RetrievalCS 6322: Information RetrievalRanked retrieval Thus far, our queries have all been Boolean. Documents either match or don’t. Good for expert users with precise understanding of their needs and the collection. Also good for applications: Applications can easily consume 1000s of results. Not good for the majority of users. Most users incapable of writing Boolean queries (or they are, but they think it’s too much work). Most users don’t want to wade through 1000s of results. This is particularly true of web search.Ch. 6CS 6322: Information RetrievalCS 6322: Information RetrievalProblem with Boolean search:feast or famine Boolean queries often result in either too few (=0) or too many (1000s) results. Query 1: “standard user dlink 650” → 200,000 hits Query 2: “standard user dlink 650 no card found”: 0 hits It takes a lot of skill to come up with a query that produces a manageable number of hits. AND gives too few; OR gives too manyCh. 6CS 6322: Information RetrievalCS 6322: Information RetrievalRanked retrieval models Rather than a set of documents satisfying a query expression, in ranked retrieval models, the system returns an ordering over the (top) documents in the collection with respect to a query Free text queries: Rather than a query language of operators and expressions, the user’s query is just one or more words in a human language In principle, there are two separate choices here, but in practice, ranked retrieval models have normally been associated with free text queries and vice versa6CS 6322: Information RetrievalCS 6322: Information RetrievalFeast or famine: not a problem in ranked retrieval When a system produces a ranked result set, large result sets are not an issue Indeed, the size of the result set is not an issue We just show the top k ( ≈ 10) results We don’t overwhelm the user Premise: the ranking algorithm worksCh. 6CS 6322: Information RetrievalCS 6322: Information RetrievalScoring as the basis of ranked retrieval We wish to return in order the documents most likely to be useful to the searcher How can we rank-order the documents in the collection with respect to a query? Assign a score – say in [0, 1] – to each document This score measures how well document and query “match”.Ch. 6CS 6322: Information RetrievalCS 6322: Information RetrievalQuery-document matching scores We need a way of assigning a score to a query/document pair Let’s start with a one-term query If the query term does not occur in the document: score should be 0 The more frequent the query term in the document, the higher the score (should be) We will look at a number of alternatives for this.Ch. 6CS 6322: Information RetrievalCS 6322: Information RetrievalTake 1: Jaccard coefficient Recall from Lecture 3: A commonly used measure of overlap of two sets A and B jaccard(A,B) = |A ∩ B| / |A ∪ B| jaccard(A,A) = 1 jaccard(A,B) = 0 if A ∩B = 0 A and B don’t have to be the same size. Always assigns a number between 0 and 1.Ch. 6CS 6322: Information RetrievalCS 6322: Information RetrievalJaccard coefficient: Scoring example What is the query-document match score that the Jaccard coefficient computes for each of the two documents below? Query: ides of march Document 1: caesar died in march Document 2: the long marchCh. 6CS 6322: Information RetrievalCS 6322: Information RetrievalIssues with Jaccard for scoring It doesn’t consider term frequency (how many times a term occurs in a document) Rare terms in a collection are more informative than frequent terms. Jaccard doesn’t consider this information We need a more sophisticated way of normalizing for length Later in this lecture, we’ll use  . . . instead of |A ∩ B|/|A ∪ B| (Jaccard) for length normalization.| B A|/| B A| UICh. 6CS 6322: Information RetrievalCS 6322: Information RetrievalRecall (Lecture 1): Binary term-document incidence matrixAntony and Cleopatra Julius Caesar The Tempest Hamlet Othello MacbethAntony1 1 0 0 0 1Brutus1 1 0 1 0 0Caesar1 1 0 1 1 1Calpurnia0 1 0 0 0 0Cleopatra1 0 0 0 0 0mercy1 0 1 1 1 1worser1 0 1 1 1 0Each document is represented by a binary vector ∈ {0,1}|V|Sec. 6.2CS 6322: Information RetrievalCS 6322: Information RetrievalTerm-document count matrices Consider the number of occurrences of a term in a document:  Each document is a count vector in ℕv: a column below Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello MacbethAntony157 73 0 0 0 0Brutus4 157 0 1 0 0Caesar232 227 0 2 1 1Calpurnia0 10 0 0 0 0Cleopatra57 0 0 0 0 0mercy2 0 3 5 5 1worser2 0 1 1 1 0Sec. 6.2CS 6322: Information RetrievalCS 6322: Information RetrievalBag of words model Vector representation doesn’t consider the ordering of words in a document John is quicker than Mary and Mary is quicker than John have the same vectors This is called the bag of words model. In a sense, this is a step back: The positional index was able to distinguish these two documents. We will look at “recovering” positional information later in this course. For now: bag of words modelCS 6322: Information RetrievalCS 6322: Information RetrievalTerm frequency tf The term frequency tft,dof term t in document d is defined as the number of times that t occurs in d. We want to use tf when computing


View Full Document

UT Dallas CS 6359 - Lecture6

Documents in this Course
Lecture2

Lecture2

63 pages

Lecture3

Lecture3

49 pages

Lecture4

Lecture4

48 pages

Lecture5

Lecture5

47 pages

Lecture7

Lecture7

63 pages

Lecture8

Lecture8

77 pages

Lecture9

Lecture9

48 pages

Lecture10

Lecture10

84 pages

Lecture11

Lecture11

45 pages

Lecture12

Lecture12

134 pages

Lecture13

Lecture13

62 pages

Lecture14

Lecture14

76 pages

Project

Project

2 pages

Chapter_1

Chapter_1

25 pages

Load more
Download Lecture6
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture6 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture6 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?