DOC PREVIEW
CORNELL CS 674 - Final exam guide

This preview shows page 1 out of 4 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 4 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 4 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS674/INFO 630: Advanced Language Technologies, Fall 200711/29/07: Final exam guideThe final exam for the Fall 2007 semester is Thursday, Dec 13, 2:00 - 4:30 pm, Thurston 202.You may bring to the exam up to five sheets of notes (8”×11”, both sides OK), but no other referencematerial. Use of calculators, laptops, etc. is not permitted.Note that the exam is comprehensive, in the following sense. All material from lectures 1 (8/28/07)onwards, including lectures for which no lecture guides are available1, is fair game. On the other hand,the exam will also be held to approximately four to six questions to allay time pressure; hence, it is easilyinferred that the coverage will be very heavily weighted towards material not tested on the midterm.It bears repeating that what is most of interest is whether one understands how the various methods andmodels we have discussed were developed. This is because we have introduced fundamental concepts andtools that one simply must be familiar with to understand current research in human-language technologies,and because such understanding should help enable one to develop new methods and models. As has beenpreviously mentioned, an argument can be made that one really understands a concept when one can under-stand the implications when some assumption or other aspect of the setting is changed. My favorite kind ofquestion is based on this principle.I will be holding drop-in office hours on Thursday Nov 29th, Thursday Dec 6, and Tuesday December11: all 3-4pm. You can, of course, also make an appointment (with advance notice).The five questions from the Spring 2006 final exam appear on the following pages. Some notation differsfrom this year, but this shouldn’t pose any particular problem.1Those that are available are/will be posted at the course homepage, http://www.cs.cornell.edu/courses/cs674/2007fa/ .(1) [11 points] Figure 1 below represents hypothetical results produced by Google in response to twodifferent queries. We have indicated whether or not the summaries were clicked on; we have also indicatedwhat “class” the URLs of the corresponding documents belong to, where we assume some predefined setof possible classes (e.g., trustworthy vs. unknown vs. untrustworthy). For simplicity’s sake, assume thatthe summaries shown below are actually the full text of the corresponding documents (hence the use of “d”instead of “s” to label the summaries).qa: “cats versus dogs”click da1: “dogs outclass cats”[URL class 0]no click da2: “cats are great”[URL class 1]qb: “garfield versus snoopy”no click db1: “garfield is snoopy”[URL class -1]click db2: “critics prefer snoopy”[URL class 1]Figure 1: Results for two queries.Finally, assume the following term index:v(1)=dogs v(2)=outclass v(3)=cats v(4)=are v(5)=greatv(6)=garfield v(7)=versus v(8)=snoopy v(9)=is v(10)=critics v(11)=prefera) Suppose that Figure 1 represents the training data for Joachims’ 2002 system (which ignores querychains). Furthermore, assume that the query-document representation scheme is:Φ(q, d) =cos(~q,~d)URL class of d,where, for simplicity, we use tf weighting (instead of tf-idf weighting) to create ~q and~d.Is ~w = (0, 1)Ta valid choice for Joachims’ algorithm based on the above training data? Justify yourresponse. Your answer should include• an intuitive explanation of the kinds of items that would be preferred if we did indeed use ~w =(0, 1)Tfor ranking (sample, probably incorrect answer: “documents containing exactly oneword of the query”);• an explicit but brief explanation of what constraints, if any, are inferred from the data above, aswell as what potential constraints aren’t inferred, if any; and,• explicit numerical computation of those Φ(q, d) that you use in checking whether the proposedweight vector is valid.b) Suppose instead that we treat the clicks and non-clicks in Figure 1 as explicit relevance feedback. Ifwe apply the Rocchio algorithm upon the results of qbwith α = 1 and γ = 0, what is the minimumvalue of β, if any, that will result in ranking db2above db1with respect to new version of qb(i.e., willcause the Rocchio algorithm to “do the right thing”)? Be sure to justify your answer, showing all stepsand providing brief but clear explanations of them.To simplify your calculations, use non-normalized tf weighting to form document vectors.(2) [6 points] Fall 2007 note: One should not necessarily expect a midterm question to be repeated. Inthe Spring 2006 semester, a question from the midterm was repeated due to unusual circumstances.This question, which also appeared in extremely similar form on the midterm, modifies the setting thatresulted in our second derivation of the LM-based approach.Assume a finite set of document-topic language models t1, t2, . . . , tn, where the parameters for each tiare known and where we assume that each document was generated by exactly one of models ti. Supposethat the system is issued a query whose semantics is, “A document is relevant if it was generated by t1or byt2”. You should consider the query to be fixed and to be not a term sequence and hence not “generatable”by an LM (for instance, perhaps the system gets information requests through the user clicking on somecheckboxes).Derive (with adequate explanation of your steps) a scoring function that results from expanding P (R =y|D = d) based on the information just given, where it is required that most, if not all, of the quantities inyour function can be directly estimated in a reasonable way. For each quantity, be sure to explain either howyou would estimate it, justifying your choice, or why a problem arises (despite good-faith effort on yourpart) in estimating it.Note: topic LMs are allowed to “generate” documents that are not in the corpus.(3) [11 points]a) In the following three-matrix product, a, b, c, d, and e are variables and V is a matrix:√2/2 0√2/2 a0 b5 cd eVT(the size of the box surrounding VTis not meant to indicate anything about V ’s dimensionality).Suppose someone asserts the following statement:The above represents a singular-value decomposition for some matrix.Give the values or most specific ranges of values for the variables a through e and the dimension ofV that can be inferred from this statement. Be sure to give the reason(s) for each of your inferences.b) Suppose we have a corpus consisting of just two document vectors,−−→d(1)= (x,


View Full Document
Download Final exam guide
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Final exam guide and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Final exam guide 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?