Princeton COS 435 - Problem Set 5 - D2394893

Home> Schools> Princeton University> Computer Science (COS) > COS 435> Problem Set 5

Princeton COS 435 - Problem Set 5

Course Cos 435- Networks, Economics and Computing

Pages 2

Download Save

Unformatted text preview:

COS 435, Spring 2008 - Problem Set 5 Due at 5:00pm, Monday April 14, 2008 (end of the business day). Collaboration Policy You may discuss the general methods of solving the problems with other students in the class. However, each student must work out the details and write up his or her own solution to each problem independently. Lateness Policy A late penalty will be applied, unless there are extraordinary circumstances and/or prior arrangements: • Penalized 10% of the earned score if submitted by 5:00 pm Tuesday (4/15/08). • Penalized 25% of the earned score if submitted by 5:00 pm Saturday (4/19/08). • Penalized 50% if submitted later than 5:00 pm Saturday (4/19/08). Problem 1 (divisive clustering): On slide #11 of the class presentation April 2, there is an iterative improvement algorithm intended to be used for divisive partitioning. This problem addresses recalculating the min-max cut cost (slides #8 and #9) incrementally for use with that algorithm. Assume that for any objects v and w, sim(v,w)=sim(w,v) (we have been assuming this in class). Also assume that for any object v, sim(v,v)=0. Part a: Give an equation for cutcost(Cp)-cutcost(Cp –{x}) when x is an object in Cp. Here Cp is an arbitrary cluster and Cp –{x} is Cp with x removed. Your equation should be in terms of similarities between x and other objects. Hint: the quantity ∑ sim(vi, x) where U is the set of all objects vi in U is useful because it is a function of x independent of the clustering and can be precomputed before the clustering construction is begun. Part b: Using your equation of Part a, express cutcost(Cp –{x}) as an incremental change to cutcost(Cp). Also express cutcost(Cq U {x}) as an incremental change to cutcost(Cq) when Cq is an arbitrary cluster that does not contain x. (Cq U {x} is Cq with x added.)Part c: Give an equation for intracost(Cp)-intracost(Cp –{x}) when x is an object in Cp. Your equation should be in terms of similarities between x and other objects. Part d: Using your equation of Part c, express intracost(Cp –{x}) as an incremental change to intracost(Cp) and express intracost(Cq U {x}) as an incremental change to intracost(Cq) when Cq does not contain x. Part e: Given your equations in Parts b and d, what is the time complexity of the step: move vj to that cluster, if any, such that move gives maximum decrease in cost of the iterative improvement algorithm on slide #11. You may assume ∑ sim(vi, x) where U is the set of all objects vi in U is precomputed before the initial clustering is chosen; don’t include the cost of this precomputation. Justify your answer. Problem 2 (detecting near-duplicate documents): Part a: Let D denote a document that is 500 words long and contains each of the words “philanthrepist” , “pendantic” and “androgenous” exactly once each, with “philanthrepist” occurring in word position 100, “pendantic” in position 205, and “androgenous” in position 320. Each of these words is misspelled. Let Dcor be the document with these spelling errors corrected (“philanthropist” , “pedantic” and “androgynous”). For a 5-shingling of each document, give a lower bound and an upper bound on the resemblance r(D, Dcor) as given on slide #7 of the class presentation (April 7 and 9). (Do not expect your lower and upper bounds to be close.) Part b: Let E denote a document that is 500 words long and contains each of the words “philanthrepist” , “pendantic” and “androgenous” exactly once each but as the phrase “pendantic androgenous philanthrepist” starting at word position 200. Let Ecor be the document with the spelling errors in this phrase corrected ( “pedantic androgynous philanthropist”). For a 5-shingling of each document, give a lower bound and an upper bound on the resemblance r(E, Ecor) as given on slide #7 of the class presentation. (Again, do not expect your lower and upper bounds to be close.) Part c: When computing the 5-shingling of each of D, Dcor, E, and Ecor, suppose 25% of all possible shingles are repeated shingles. In this case, what are r(D,Dcor) and r(E,Ecor)? For what threshold would one pair be considered near-duplicates and the other not? Which is which? In your opinion, is this a desirable

View Full Document


School:
Email:
New Password:
Confirm Password:

Princeton COS 435 - Problem Set 5

Sign up for free to view:

Please select your school