MIT 6 863J - LECTURE NOTES - D1864608

Home> Schools> Massachusetts Institute of Technology> Electrical Engineering and Computer Science (6) > 6 863J> LECTURE NOTES

DOC PREVIEW

MIT 6 863J - LECTURE NOTES

School name Massachusetts Institute of Technology

Course 6 863j- Natural Language and the Computer Representation of Knowledge

Pages 11

This preview shows page 1-2-3-4 out of 11 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 11 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 11 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 11 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 11 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 11 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Massachvsetts Institvte of TechnologyDepartment of Electrical Engineering and Computer Science6.863J/9.611J, Natural Language ProcessingLaboratory 3: Beyond Simple ParsingHanded out: March 12, 2008 Due: March 21, 2008Goals of the Laboratory. Having been introduced to context-free parsing and how to learn context-freerules from corpora, Laboratory 3 turns to a deeper analysis of context-free grammars, more sophisticatedprobabilistic context-free parsers, and natural language processing. In particular, in this lab we examineconstraints on the structure of human phrases and sentences where phrases have been “displaced” from theirusual positions, as in, This laboratory, I never wanted to finish, where the object of ]it finish, this laboratory,does not appear in its usual position after the verb, but has been displaced to the front. Many would considerthis one of the fundamental properties of human language, that distinguishes it from artificial languages,and it naturally causes difficulties for parsing (since phrases do not wind up where one would naturallyexpect them to be).1You will also be introduced to some of the tools we have to look at sentence parses, inpreparation for future laboratories.What you must turn in. As before, email the write-up for this lab as a plain-text or pdf file, or just emaila URL to the addresses below. We’ve obtained some graders, so you should now email your answers to thegraders email: [email protected]. Also as usual, please include the following in the “Subject:” lineof your email: 6.863 Lab 3. You may collaborate with whomever you wish; just please write the names ofyour collaborators on your report. Your report should be recognizably your own work.Initial Preparation: (From Lab 2): First, be sure you have a version of Python with NLTK and Numpyinstalled.Some of the tools you use in this lab will need to be run on Athena, and some of it is in Java. You willneed to run the following commands to set up your environment:athena% add 6.863athena% add javaNew: Systems for running probabilistic parsersFinally, we have prepared the following systems for use on Athena. You created a simple probabilistic parserusing NLTK in Laboratory 2. For this lab, on Athena we’ve set up an installation of the Bikel-Collins parser,and a separate installation of the Collins parser, both pre-trained on the Wall Street Journal(WSJ) sectionof the Penn Treebank (PTB), a collection of approximately one million words of running text from, ah, theWall Street Journal, which was then converted, partially by hand, into parse trees.2(If you wish to trysetting Bikel/Collins up on your own system, you can try downloading it from http://www.cis.upenn.edu/~dbikel/software.html#stat-parser – but it will not necessarily be easy to set up this tool and wecannot offer support if it does not work for you.)These parsers require their input to be tagged – their rules do not go down to the level of individual words,only their parts of speech (POS). These parts of speech names are partly an historical artifact, derived andelaborated from much older corpus work first done at Brown University in the 1960s. For example, the tagIN denotes any Preposition, while VBD stands for a past-tense verb, someties ending in ed or en, such astaken (but VBD could also be the tag associated with an irregular past tense verb such as sung). A list of48 of the most important tags used for the Penn Tree Bank is given below, and also appended to the end1This illustrates, incidentally, that whatever natural language has been “designed” for, it is not necessarily designed for easyparsing: it would be much easier if the phrases stayed in their positions to be easily interpreted as the arguments to functions.If anything, this indicates that the conceptual-intensional constraints have “taken priority” – that is, constraints that interfacelanguage with the reasoning and inference systems inside the mind/brain.2As is standard, specifically this means the systems are trained on sections 02 to 21 of the Wall Street Journal, approximately40,000 parsed sentences.1of this laboratory. You’ll soon grow to know and love or hate this list. (For example, what information dothese tags include? What information do they omit?) http://web.mit.edu/6.863/www/labs/PTTags.pdfWe will be using the Bikel-Collins parser with the MXPOST maximum entropy tagger. We will havemore to say about this tagging method in lecture, but for now you can download it yourself here:http://www.inf.ed.ac.uk/resources/nlp/local_doc/MXPOST.htmland read about the method via the following paper as described there:http://web.mit.edu/6.863/www/readings/mxpost_doc.pdf.We’ve provided three ways to run the Bikel/Collins parser on Athena:• bikel-wrapper, a GUI interface to the MXPOST tagger, the Bikel parser, and a tree viewer. Usingthis interface, you can even correct mistakes in the tagged sentence before it is parsed.• bikel-parse, a pure text interface. It will prompt you for a sentence (after which you need to hit ctrl-Dto end the input), run it through the tagger and parser, and print the parse tree as an S-expression.• bikel-draw, which is like bikel-parse except it displays a graphical window with the parse tree atthe end.Similarly, in section ?? below, you will need to run the Collins parser. We have provided the followingto do this on Athena:• collins-parse, a pure text interface. It will prompt you for a sentence (after which you need tohit ctrl-D to end the input), run it through the tagger and parser, and print the parse tree as anS-expression.• collins-draw, which is like collins-parse except it displays a graphical window with the parse treeat the end.In all cases, you need to give the parser tokenized input, as described in Laboratory 2: There need tobe spaces between everything that the parser considers as a separate token, not just where spaces would beused in natural English.As an example, when you run bikel-wrapper, you get a window that looks like this:2To use it, type a tokenized sentence into the top text box. Then, click the “Tag” button below it.The output from MXPOST will go into the middle box. Edit the output if you wish (e.g., if you want tocorrect any part of speech tags you think are incorrect), and then click “Parse” to run the Bikel parser onit. The parser’s progress will be shown in the bottom window. (It will take some time to load a large tablecorresponding to the probability estimates

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3-4 out of 11 pages.

MIT 6 863J - LECTURE NOTES

Sign up for free to view:

Please select your school