MIT 6 863J - Study guide - D2033032

Home> Schools> Massachusetts Institute of Technology> Electrical Engineering and Computer Science (6) > 6 863J> Study guide

DOC PREVIEW

MIT 6 863J - Study guide

School name Massachusetts Institute of Technology

Course 6 863j- Natural Language and the Computer Representation of Knowledge

Pages 8

This preview shows page 1-2-3 out of 8 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Massachvsetts Institvte of Technology Department of Electrical Engineering and Computer Science 6.863J/9.611J, Natural Language, Spring 2011 Reading & Response 1 Handed out: Wednesday, February 2nd Due: Sunday, February 6th, 6pm. EST February 2: Walking the walk, talking the talk – what is the enterprise of computation and natural language all about? Assignment Goals: (1) How to be a good consultant and learn to argue both sides of the same story; (2) Getting acquainted with the NLTK system and Python; installing these on your own computer as needed or else learning how to run NLTK on Athena; (3) Learning about n-grams and the statistical/grammatical ‘dialectic.’ Readings: (note, for future reference, BRT = “Berwick reading time”) (1) Jurafsky & Martin textbook, extract from chapter on n-grams, ch. 4, pp. 1–13, on website (below) or from the text, ch. 13, pp. 83-94; pp. 114-116. [BRT: 30 min] http://web.mit.edu/6.863/www/readings/ngrampages.pdf (2) S. Abney, extract from “Statistics and Linguistics,” in R. Bods, Statistical Linguistics, section 4.2. [BRT: 10 min] http://web.mit.edu/6.863/www/readings/abney96pages.pdf (3) N. Chomsky, extract on grammaticality from The Logical Structure of Linguistic Theory, 1955. [BRT: 10 min] http://web.mit.edu/6.863/www/readings/chomsky55b.pdf What you must do for this assignment: 1. Read the assigned material above. Limit your written response to two pages, singled-sided, using 11 pt. type or larger, with reasonable interline spacing and margins. Please use plaintext or pdf and email your response to me over the weekend by 6pm Sunday, EST. Responses received after this time are not acceptable and will not receive any credit. Email to: [email protected] and include in the Subject: line of your email the text: 6.863 Reading and Response 1. (The handout seems long, but that is because of the instructions on how to set up and use the software for the course.) 2. Bring a hardcopy of your written response to class on Monday so that you can refer to it, scribble changes on it, and be prepared to defend it. You can/should revise your response in light of our discussion and you may re-submit it by the end of the day on Monday (i.e., midnight, EST). 3. Page 1 of your response should consist of brief answers to the four NLTK/python warm-up questions below, on page 7. These really are intended to make sure you can successfully run the NLTK on Athena or on your own computer. The amount of time this should take, apart from reading, is minimal. If you have any qualms or issues about NLTK, then the TA will be able to help you on Thursday. 4. Page 2 of your response should be your ‘consultant’s report’ as described below. Please read the assignments listed above and pay attention to what the assignment says you should think about when you read them. There’s a reason for the strict page limitation: We want you to learn to be concise. Before you start writing, carefully read the Style Guide as described on the course web page. We are very picky about the points listed, so you should review every paper you writeto be sure that you have adhered to the commandments. Otherwise, you will drive us crazy with rage. Now, on to the assignment itself. You have just been hired as a consultant at Google, to assist with their new “Google Speech” project. Much like their Google Books project, Google Speech aims to collect many, many millions of examples of spoken sentences, initially just in English, transcribed into written text.1 You overhear Google employee #25 talking to Google employee #200 about their plans for this data: “Look,” she says, “plainly, the number of sentences one person will ever hear or speak in a lifetime is finite – and so therefore is the collection of all the sentences we’ll ever put into Google Speech, even if it’s trillions and trillions of examples. This collection, a ‘corpus,’ constitutes the set of ‘observables’ for natural language. It is this corpus that we have to model. It’s just like when we observe some other natural phenomenon, like the motion of the planets. We can use lots and lots of astronomical observations, and then, once we know the position of, say, Saturn at many points in time, we can predict where it will be in the next moment, just by using our collected data. So for example, the probability of where Saturn is at time t is just contingent on where it was at some finite measured number of instances in the past. With sentences, we can do the same thing via the method called n-grams, as described by Jurafsky & Martin ch. 4 (see your reading #1). An n-gram is just a way of predicting what the nth word in a sentence will be, given the n–1 preceding words. And that’s what we have lots of data about. We can use the probabilities of such sequences to capture what we need to know. For example, like JM says, if we see the sequence “I’d like to make a collect…” then a very likely next word is call, or phone, or international, but not the. It should be a snap. We can do fancier statistics if we need to – I know that sometimes specific, very long sequences won’t ever show up in our corpus, so they’ll have a frequency of zero, but we now have sophisticated ways of estimating this kind of missing data.” Employee #200 replies, “Wait a minute. Are you sure that’s the right thing to study? Isn’t the set of sentences that even one person can potentially produce countably infinite? How do you determine what goes on your list, and what does not? And I’m a bit troubled by your physics analogy. I don’t think Newton would have appreciated it. Sure, Copernicus and Kepler collected lots and lots of data sequences, but what underlies them, F=ma, isn’t just a statistical approximation – it’s an absolute principle. A theory. What you want to model – the true ‘observables’ – isn’t what’s in the ‘outside world’, the sequences of words or sentences, but rather the principles of the ‘inside world’ – the ‘cognitive machinery’ that produces or perceives this or that collection of sentences.” Keep these arguments in mind. You’re going to write a brief report on them in a bit, but first you get some experience of your own with

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3 out of 8 pages.

MIT 6 863J - Study guide

Sign up for free to view:

Please select your school