DOC PREVIEW
UD CISC 689 - Information Retrieval Midterm Exam

This preview shows page 1-2 out of 6 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CISC689/489-010 Information Retrieval Midterm ExamYou have 2 hours to complete the following four questions. You may use notes and slides. You can usea calculator, but nothing that connects to the internet (no laptops, Blackberries, iPhones, etc.). Good luck!1. Short answer (5 points each)Answer each of the following questions in a few sentences.(a) Why are term frequency and inverse document frequency used so often in document scoringfunctions?⇒ Inverse document frequency: terms that appear in many documents are less likely to beimportant than terms that appear in few documents, because the more documents a term appearsin the more likely the content of those documents are to be unrelated to the term. Thereforegreater inverse document frequency indicates greater importance.Term frequency: a term that appears many times in a document is an indicator that the term isimportant to that document and therefore that the document is more likely to be “about” thatterm. Therefore greater term frequency indicates greater importance.(b) How do stopping and stemming reduce the size of an inverted index?⇒ Stopping: by eliminating the terms with very long inverted lists.Stemming: by reducing the number of inverted lists by consolidating the lists for two or moreterms with the same stem.(c) With 5, 000 documents and 10, 000 unique vocabulary terms, a bit vector index requires 5 × 107bits of storage. Suppose documents have 200 terms on average. If we added 2, 200 more documentsto the collection, roughly how big would the bit vector index become? Use Heaps’ law with k = 10and β = 0.5.⇒ Heaps’ law tells us that 10, 000 = 10 · (5000 · 200)0.5. If we add 2200 documents, we have V =10·(7200·200)0.5= 12, 000 vocabulary terms, so the bit vector index requires 7200·12000 ≈ 9·107bits.(d) The figure below depicts interpolated precision-recall curves for two search engines that indexresearch articles. There is no difference between the engines except in how they score documents.Imagine you’re a scientist looking for all published work on some topic. You don’t want to missany citation. Which engine would you prefer and why?0.0 0.2 0.4 0.6 0.8 1.00.0 0.2 0.4 0.6 0.8 1.0recallinterpolated precision0.0 0.2 0.4 0.6 0.8 1.00.0 0.2 0.4 0.6 0.8 1.0Engine 1Engine 21⇒ Engine 1 ranks more relevant documents in the top results, but Engine 2 finds all of therelevant documents faster. If there are R relevant documents total, and precision at recall=1 isroughly 0.25 for Engine 2, then I have to go to rank R/0.25 = 4R in Engine 2 to find all of therelevant documents. If precision at recall=1 is roughly 0.01 for Engine 1, then I have to go torank R/0.01 = 100R in Engine 1 to find all of the relevant documents. That’s 25 times moredocuments I have to look at in Engine 1’s results. Thus I prefer Engine 2.(e) Describe the advantages and disadvantages of language models and vector space models withrespect to each other.⇒ Vector space models are flexible: you can set term weights to be anything you want andyou can easily add additional features. They are easy to understand and implement. The maindisadvantage is that there is no formally-motivated way to determine what the term weightsshould be, and there is a practically-infinite space of possibilities to search in.Language models have a formal motivation in terms of probabilities of terms being sampledfrom documents that limits the search space of term weights. Language models can incorporatearbitrarily complex features of natural language. The disadvantages are that there are manyparameters to estimate (probabilities of features in every document), it is difficult to incorporatenon-language features, and there is no explicit model of relevance.(f) You have developed a new method for parsing documents that uses semantic information todecide which sentences to index and which to skip. How would you determine whether yourmethod produces better retrieval results than indexing every sentence?⇒ I assume I already have an index that includes every sentence and a query processing enginefor that index. Now I will re-parse every document in my collection and re-build the index fromscratch. I will use a sample of queries as inputs to both engines, giving me two document rankingsfor each query—one from the original index, one from the index with the new parsing method.Then I will have assessors judge the relevance of documents, and use those relevance judgments tocalculate measures like precision, recall, and average precisions. Whichever engine has the highestprecision or recall or average precision is the one that was better.(g) What is the primary difference between a signature file index and a bit vector index? How doesthis difference affect performance (storage space and retrieval performance)?⇒ In a bit vector index, every document is represented by a bit vector of length V (the vocabularysize). In a signature file index, documents are represented by bit vectors of length k (a parameterset by the engine developer). In terms of storage space, k can be set so that the index has muchlower storage requirements than the bit vector index. In terms of retrieval performance, thesmaller k is the more “collisions” there will be in query processing. This results in more falsematches.(h) Suppose you have observed that users click on the second result half as often as the first result,on the third result 1/3rd as often as the first result, and so on (clicks on the ith result = 1/i timesclicks on the first result). How would you modify the discount in DCG to model this behavior?⇒ I would set the discount function to 1/i instead of 1/ log(i + 1).22. Indexing (20 points)Sketch pseudo-code for indexing a collection of documents with an inverted file. Be sure to includeall the steps you have performed in the project. It does not have to be exactly correct, but it shouldcover all the major points of building an inverted index. You do not need to include pseudo-code forstemming and compression, but you should include calls to stemming and compression functions whereappropriate.(Be sure to manage your time spent on this problem. Do not spend an hour making sure every detailworks right. Focus on including the steps in the right order.)⇒ Here’s one possibility. Obviously there are many possible answers, though the basic steps do notchange.function ParseAndTokenize(D)determine which parts of D are important (according to pre-determined rules)tokenize


View Full Document

UD CISC 689 - Information Retrieval Midterm Exam

Download Information Retrieval Midterm Exam
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Information Retrieval Midterm Exam and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Information Retrieval Midterm Exam 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?