Unformatted text preview:

Ling 289 Homework #3Due: Wed 17 Oct 20071. There’s a small subfield of humanities computing that attempts to use statistics from texts forauthorship attribution. Commonly this involves looking at the frequencies of fairly commonfunction words or function word sequences, which are assumed to be fairly consistently usedby an author across texts, but on which another author might have idiosyncratically differentusage. This problem considers texts of Jane Austen. In particular, when she died, she lefta partly finished novel (Sanditon I ), which was completed by a fan, attempting to write inher style (Sanditon II ), and the composite was then published. The table gives the relativefrequency of the word a preceded by and not preceded by such (i.e., the latter is the sum ofthe counts for all other words X a, the word and followed by or not followed by I and theword the preceded by or not preceded by on in the two halves of Sanditon and two otherJane Austen novels. Was Austen consistent in these habits of style from one work to another?Did her imitator successfully copy these aspects of her style? What evidence can you draw infavor or against it being a successful copy. (You should be looking to use a significance testfrom class. . . .)Word sequence Sense and Sensibility Emma Sanditon I Sanditon IIsuch a 14 16 8 2¬such a 133 180 93 81and I 12 14 12 1and ¬I 241 285 139 153on the 11 6 8 17¬on the 259 265 221 2042. In Wasow (1997), in the section on Collocations and HNPS, one finds the following data(graphed in Figure 12 of the paper):HNPS notTransparent collocations 90 102 192Non-collocations 59 329 388(i) Confirm the figure given in the paper for the chi-square test. (ii) Calculate the odds ratiofor heavy NP shift for transparent collocations versus non-collocations. That is, how manytimes larger are the odds of HNPS with a transparent collocation than with a non-collocation?3. Consider the grammar for Adam I in the Suppes (1970) article. Suppes notes that the majorreason for the only limitedly good fit of the grammar to the model is from the use of the NP→ NP NP rule, which ends up badly underestimating the number of times you would seeN N (see Table 1). However, this aspect of the grammar can be changed. Note that Suppesincludes a special rule NP → AdjP N in the grammar, even though sequences like A N couldhave been generated using the NP → NP NP rule. Try making a change to the grammarthat will improve its estimates (it isn’t important that you succeed, providing that you do thecalculations below correctly and present the results). Work out maximum likelihood estimatesfor the rules in your new grammar, and then the predicted (‘theoretical’) frequencies of eachform (recall that there are 2434 total noun phrases in the corpus in Table I). This isn’t quiteas difficult as Suppes makes it look! Adopt the same simplifying assumption that he didunder which each terminal sequence of parts of speech is given its ‘simplest’ analysis underthe grammar. You should then be able to give an analysis to each string, and to count howoften NP and any other nonterminals you use (i.e., also AdjP in Suppes’ grammar) appear inthe grammar, and how often they are rewritten in different ways. This will give maximimumlikelihood estimates for the rules, and will allow you to calculate predicted frequencies fordifferent strings of parts of speech over a corpus this size.(a) Give your grammar as a simple probabilistic CFG.(b) Show a table corresponding to Table I for your data.(c) Work out the goodness of fit of the grammar to the data using a chi-square test. (Towork out the number of degrees of freedom to use, pay attention in class, and/or readcarefully p. 112. The number of degrees of freedom is the number of cells in the tableminus the number of parameters set from the data minus 1.) Is your grammar betterthan Suppes’ grammar?(d) How much of the probability mass of the grammar is given to strings that were notobserved at all in the data of Table I?The easiest way to do this problem is probably by adapting the spreadsheet I used in


View Full Document

Stanford LING 289 - Homework

Download Homework
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Homework and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Homework 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?