UD CISC 689 - CISC 689 Information Retrieval Homework - D1966354

Home> Schools> University of Delaware> Computer/Information Sciences (CISC) (CISC) > CISC 689> CISC 689 Information Retrieval Homework

UD CISC 689 - CISC 689 Information Retrieval Homework

School name University of Delaware

Course Cisc 689- Topics: Artificial Intelligence: MACHINE LEARNING

Pages 2

Download Save

Unformatted text preview:

CISC689/489-010 Information Retrieval Homework 1Answer the following questions in your own words. You may discuss them with other students, but youmust turn in your own work.1. (25 points) Suppose we have a collection of N documents and V unique terms. For simplicity, assumethat each term appears at most once per document. Answer the following questions assuming Zipf’slaw holds.(a) What proportion of terms appear just once (i.e. in only one document)?(b) What proportion of terms appear ten times (i.e. in ten documents)?(c) What proportion of terms appear D times (i.e. in every document)?Now suppose we are going to index the documents and terms using an inverted index. If V = 100and Zipf’s law is true, roughly how many integers will be stored in the inverted file? What about forV = 1000? What is the general expression in terms of V ?Suppose we used a bit vector index instead of an inverted index. How big (in bytes) would the bitvector index be in terms of N and V ?Suppose we used a signature index with the signature size set to k = 8192 bits. How big (in bytes)would the signature file be in terms of N and V ? What would k need to be to ensure the signaturefile is smaller than the inverted file?2. (25 points) This question asks you to compare three different compression algorithms: Elias-γ encoding,Elias-δ encoding, and v-byte encoding.(a) Suppose a word appears at least once in all N documents in a collection. What compression ratiowould be achieved using each of the three algorithms to compress document numbers? (Definecompression ratio as the total size of the compressed numbers divided by the total size of theuncompressed numbers. Assume an uncompressed number requires 4 bytes of storage.)(b) Suppose a word appears in every 10th document, and it appears 5 times in each of those documents(i.e. 5 times in document 1, 5 times in document 11, ...). What compression ratio would beachieved using each of the three algorithms to compress the full inverted list?(c) Suppose a word appears in n documents such that the maximum gap between any two documentsnumbers is 25. Suppose it occurs at least once in each of those documents, but no more than 10times in any document. What is the minimum compression ratio that could be achieved by eachof the three algorithms? What is the maximum compression ratio?(d) Extra credit (5 points): Is there a way to compress inverted lists such that terms that appear inall N documents require just 1 bit? If so, describe it. Does it provide lossless compression? Canit be decompressed unambiguously? Will it provide better compression ratios on average thanthe three algorithms above?3. (25 points) This question asks you to compare retrieval models. Suppose we have the query “oil produc-ing nations”, and the three query terms have inverted lists as follows: Ik→ (dfk, ctfk, (doci, tfik), ...)oil → (5, 18, (1, 4), (4, 3), (6, 1), (7, 2), (8, 8))producing → (4, 20, (1, 6), (2, 2), (5, 4), (8, 8))nations → (3, 11, (1, 1), (3, 2), (8, 8))1Further suppose we have a collection of documents with lengths as follows:d1→ 498 d6→ 639d2→ 627 d7→ 566d3→ 551 d8→ 423d4→ 648 d9→ 589d5→ 621 d10→ 525and the total number of terms in the corpus is 5687. What are the scores of the 10 documents usingeach of the following retrieval models:(a) Vector space model with term weighting wik=tfiklenilogN+10.5+dfk(b) Binary independence model(c) BM25 model with parameters k1= 1.2, b = 0.75, k3= 0(d) Language model with Jelinek-Mercer smoothing parameter λ = 0.2(e) Language model with Dirichlet smoothing parameter µ = 2000Explain what makes these models different from each other, focusing especially on how they use termfrequency, document frequency, and document length to calculate document scores.Extra credit (5 points): Pick one of these models and describe how you might modify it to give moreweight to terms appearing in the title of a document. You may introduce additional term statistics tosupport your model, but be sure to define them carefully.4. (25 points) Now suppose the documents from question 3 have been judged by an assessor as follows:d1→ not relevant d6→ relevantd2→ not relevant d7→ not relevantd3→ relevant d8→ highly relevantd4→ highly relevant d9→ not relevantd5→ not relevant d10→ not relevantCalculate precision at rank 5, recall at rank 5, average precision, and R-precision for each of the fivemodels from question 3. Which performed the best by each measure? Which performed the worst?If the gain of a nonrelevant document is 0, the gain of a relevant document is 2, and the gain of a highlyrelevant documents is 4, calculate DCG at rank 10 for each of the models from question 3. Whichperformed the best?Extra credit (5 points): Is there a way to define term weights wikfor the vector space model such thatit performs the best by every evaluation measure? If not, why not? If so, describe it. (Note that youcannot simply assign a number to each term. The weights must be a function of term

View Full Document


School:
Email:
New Password:
Confirm Password:

UD CISC 689 - CISC 689 Information Retrieval Homework

Sign up for free to view:

Please select your school