DOC PREVIEW
Stanford CS 276 - CS276 Problem Set 1

This preview shows page 1 out of 4 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 4 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 4 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS276 Problem Set #1 5 questions, 10 points each. Assigned: Tuesday April 5th 2011 Due: Thursday April 14th 2011 Delivery: Assignments must be submitted by 5 p.m. Pacific on the due date. Problem sets should be handed to TAs in class or left in the box outside of Professor Manning's office. All assignments not delivered in class must have the time and date of submission clearly marked on the front page. SCPD students: Fax to (650) 736-1266, (650) 725-4138, or email to scpd- [email protected] Late policy: Refer to the course webpage. Honor code: Please review the collaboration and honor code policy on the course webpage. Also note: because some questions may be drawn from previous years’ problem sets or exams, students are forbidden to consult solution sets from previous years unless we explicitly provide them. !!!1. Querying Positional Index Shown below is a portion of the positional index in the format term: doc1: <position1, position2, …>; doc2: <position1, position2, …>; etc. angels: 2: <36,174,252,651>; 4: <12,22,102,432>; 7: <17>; fools: 2: <1,17,74,222>; 4: <8,78,108,458>; 7: <3,13,23,193>; fear: 2: <87,704,722,901>; 4: <13,43,113,433>; 7: <18,328,528,706>; in: 2: <3,37,76,444,851>; 4: <10,20,110,470,500>; 7: <5,15,25,195>; rush: 2: <2,66,194,321,702>; 4: <9,69,149,429,569>; 7: <4,14,404>; to: 2: <47,86,234,999>; 4: <14,24,774,944>; 7: <199,319,599>; tread: 2: <57,94,333>; 4: <15,35,155>; 7: <20,320,709>; where: 2: <67,124,393,1001>; 4: <11,41,101,421,431>; 7: <16,36,736>; Which document(s) if any meet each of the following queries, where each expression within quotes is a phrase query? (a) “fools rush in” (b) “fools rush in” AND “angels fear to tread” (c) What is the set of results returned for query fear /3 tread (query bold) on the index !!2. Query Processing (a) For a conjunctive query, is processing postings lists in order of size guaranteed to be optimal? Explain why it is, or give an example where it isn’t. (b) The first lecture covered the simple merging algorithm used for intersection of two postings list, i.e., p1 AND p2 (IIR Section 1.3). Give pseudo-code for a new algorithm that will instead compute p1 AND NOT p2. Briefly explain how you would handle it in the case of p1 OR NOT p2. (c) We can incorporate position information into the index for phrase queries. Algorithms for construction and intersection of such positional indexes were discussed in Lecture 2. These algorithms enable '/k' proximity queries (IIR Section 1.4, 2.4.2). In this question, you will design a scheme to additionally support '/s' (within-sentence) queries. (i) In the form of positional indexing discussed in class, each document is considered as a single unit. Is this sufficient to support '/s' queries? If so, why? If not, what else is needed? (ii) A positional index can be thought of as having two levels – the document “level” and, for each document, the postings “level.” Describe the structure of a “three-level” index that can support both '/s' and '/k' queries. (iii) Give pseudo-code for a new algorithm, SentencePositionalIntersect(p1, p2, k), able to support a joint '/s' and '/k' query on the data structure from (ii). Make your algorithm as efficient as possible. As input, your function should accept two (extended) postings p1, p2 and a term distance k. The function should return the set of docIDs satisfying two conditions: both terms must be in the same sentence and both terms must also be within distance k of each other. _____________________________________________________________________________ 3. Comparison of variable byte codes, γ codes, and δ codes (a) Compute variable byte and γ codes for the postings list <777, 17743, 294068, 31251336>. Assume that the first gap encodes offset from 0th document. Write binary codes in 8-bit blocks. (b) Reconstruct the postings list for the γ!code 10111111011011111000111110001111010. Assume the first gap encodes offset from 0th document. (c) γ codes are relatively inefficient for large numbers as they encode the length of the offset in inefficient unary code. δ codes differ from γ codes in that they encode the first part of the code (length) in γ code instead of unary code. The encoding of the second part of the code (offset) is the same. For example, δ code of 7 is 10,0,11 (commas are added for readability). 10,0 is the γ code for length of 2 and the encoding of offset (11) is unchanged. Compute the δ code for the postings list in (ii) above. For what range of numbers is the δ code shorter than the γ code?(d) In variable byte encoding we can notice that a gap cannot be of length 0 and that the stuff left to encode after shifting to next byte cannot be 0. Based on these observations suggest a modification to variable byte encoding that allows you to encode slightly larger gaps in the same amount of space? What is the largest gap you can encode in 2 bytes? 4. Map Reduce MapReduce is a form of parallel computation developed at Google. It works by breaking a large computing problem into smaller parts by recasting it in terms of manipulation of key-value pairs. Schema of map and reduce functions map: input ? list (k, v) reduce: (k,list(v)) ? output You are expected to write code similar to the following pseudo-code: map(String key, String value): // Key: explain what it stands for // Value: explain what it stands for // Pseudo-code here EmitIntermediate( X,Y ); // Fill in appropriate values in place of X and Y reduce(String key, Iterator values): // Key: explain what it stands for // Value: explain what it stands for // Pseudo-code here Emit(X,Y); //Fill in X and Y How would you do the following tasks using MapReduce? a. Build a bigram index (as in Lecture 3) For example the final output should have entries like : $b -> list(ball, bat, boy) oy -> list(boy, oyster) ... b. To find word co-occurrence frequency on a per sentence level. For example: Sentence1: "he was there" Sentence2: "he was not there" Output: he - there => 2, not - there => 1 ... Note: Assume stemmed sentences with no duplicate words5. Wild-card Queries (a) How many character bigram dictionary entries and character bigram posting entries are generated by indexing the bigrams in the terms in the following text as discussed in IIR Section 3.2.2: humpty dumpty


View Full Document

Stanford CS 276 - CS276 Problem Set 1

Documents in this Course
Load more
Download CS276 Problem Set 1
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view CS276 Problem Set 1 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view CS276 Problem Set 1 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?