This preview shows page 1-2-3 out of 8 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Massachvsetts Institvte of Technology Department of Electrical Engineering and Computer Science 6.863J/9.611J, Natural Language, Spring 2011 Reading & Response 1 Handed out: Wednesday, February 2nd Due: Sunday, February 6th, 6pm. EST February 2: Walking the walk, talking the talk – what is the enterprise of computation and natural language all about? Assignment Goals: (1) How to be a good consultant and learn to argue both sides of the same story; (2) Getting acquainted with the NLTK system and Python; installing these on your own computer as needed or else learning how to run NLTK on Athena; (3) Learning about n-grams and the statistical/grammatical ‘dialectic.’ Readings: (note, for future reference, BRT = “Berwick reading time”) (1) Jurafsky & Martin textbook, extract from chapter on n-grams, ch. 4, pp. 1–13, on website (below) or from the text, ch. 13, pp. 83-94; pp. 114-116. [BRT: 30 min] http://web.mit.edu/6.863/www/readings/ngrampages.pdf (2) S. Abney, extract from “Statistics and Linguistics,” in R. Bods, Statistical Linguistics, section 4.2. [BRT: 10 min] http://web.mit.edu/6.863/www/readings/abney96pages.pdf (3) N. Chomsky, extract on grammaticality from The Logical Structure of Linguistic Theory, 1955. [BRT: 10 min] http://web.mit.edu/6.863/www/readings/chomsky55b.pdf What you must do for this assignment: 1. Read the assigned material above. Limit your written response to two pages, singled-sided, using 11 pt. type or larger, with reasonable interline spacing and margins. Please use plaintext or pdf and email your response to me over the weekend by 6pm Sunday, EST. Responses received after this time are not acceptable and will not receive any credit. Email to: [email protected] and include in the Subject: line of your email the text: 6.863 Reading and Response 1. (The handout seems long, but that is because of the instructions on how to set up and use the software for the course.) 2. Bring a hardcopy of your written response to class on Monday so that you can refer to it, scribble changes on it, and be prepared to defend it. You can/should revise your response in light of our discussion and you may re-submit it by the end of the day on Monday (i.e., midnight, EST). 3. Page 1 of your response should consist of brief answers to the four NLTK/python warm-up questions below, on page 7. These really are intended to make sure you can successfully run the NLTK on Athena or on your own computer. The amount of time this should take, apart from reading, is minimal. If you have any qualms or issues about NLTK, then the TA will be able to help you on Thursday. 4. Page 2 of your response should be your ‘consultant’s report’ as described below. Please read the assignments listed above and pay attention to what the assignment says you should think about when you read them. There’s a reason for the strict page limitation: We want you to learn to be concise. Before you start writing, carefully read the Style Guide as described on the course web page. We are very picky about the points listed, so you should review every paper you writeto be sure that you have adhered to the commandments. Otherwise, you will drive us crazy with rage. Now, on to the assignment itself. You have just been hired as a consultant at Google, to assist with their new “Google Speech” project. Much like their Google Books project, Google Speech aims to collect many, many millions of examples of spoken sentences, initially just in English, transcribed into written text.1 You overhear Google employee #25 talking to Google employee #200 about their plans for this data: “Look,” she says, “plainly, the number of sentences one person will ever hear or speak in a lifetime is finite – and so therefore is the collection of all the sentences we’ll ever put into Google Speech, even if it’s trillions and trillions of examples. This collection, a ‘corpus,’ constitutes the set of ‘observables’ for natural language. It is this corpus that we have to model. It’s just like when we observe some other natural phenomenon, like the motion of the planets. We can use lots and lots of astronomical observations, and then, once we know the position of, say, Saturn at many points in time, we can predict where it will be in the next moment, just by using our collected data. So for example, the probability of where Saturn is at time t is just contingent on where it was at some finite measured number of instances in the past. With sentences, we can do the same thing via the method called n-grams, as described by Jurafsky & Martin ch. 4 (see your reading #1). An n-gram is just a way of predicting what the nth word in a sentence will be, given the n–1 preceding words. And that’s what we have lots of data about. We can use the probabilities of such sequences to capture what we need to know. For example, like JM says, if we see the sequence “I’d like to make a collect…” then a very likely next word is call, or phone, or international, but not the. It should be a snap. We can do fancier statistics if we need to – I know that sometimes specific, very long sequences won’t ever show up in our corpus, so they’ll have a frequency of zero, but we now have sophisticated ways of estimating this kind of missing data.” Employee #200 replies, “Wait a minute. Are you sure that’s the right thing to study? Isn’t the set of sentences that even one person can potentially produce countably infinite? How do you determine what goes on your list, and what does not? And I’m a bit troubled by your physics analogy. I don’t think Newton would have appreciated it. Sure, Copernicus and Kepler collected lots and lots of data sequences, but what underlies them, F=ma, isn’t just a statistical approximation – it’s an absolute principle. A theory. What you want to model – the true ‘observables’ – isn’t what’s in the ‘outside world’, the sequences of words or sentences, but rather the principles of the ‘inside world’ – the ‘cognitive machinery’ that produces or perceives this or that collection of sentences.” Keep these arguments in mind. You’re going to write a brief report on them in a bit, but first you get some experience of your own with


View Full Document

MIT 6 863J - Study guide

Documents in this Course
N-grams

N-grams

42 pages

Semantics

Semantics

75 pages

Semantics

Semantics

82 pages

Semantics

Semantics

64 pages

Load more
Download Study guide
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Study guide and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Study guide 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?