Massachvsetts Institvte of TechnologyDepartment of Electrical Engineering and Computer Science6.863J/9.611J, Natural Language ProcessingLaboratory 3: Advanced Parsing & Lexical SemanticsHanded out: March 20, 2009 Due: April 6, 2009Goals of the Laboratory. Laboratory 2 and Competitive Grammar Writing introduced you to probabilisticcontext-free parsing. In this laboratory you will learn how well state-of-the-art statistical context-free parserswork – their strengths and weaknesses.In Part I, you will learn about the following:1. An introduction to current state-of-the-art probabilistic parsing. These systems use machine learningmethods to acquire their own rules, trained on a large number of pre-parsed sentences, using a moresophisticated method than the one you used in Laboratory 2. After your own work on grammar writing,you might appreciate the effort that such machine learning saves. In the first few questions we’ll haveyou do some basic warm-up in running these systems, so as to get used to the particular interfaces wehave set up for them.2. What are the strengths and weaknesses of current state-of-the-art statistical natural language parsers?If they were perfect, we’d be done, at least for the syntax part of natural language processing. But theparsers are not perfect. We will look at ambiguity (again). One of the things modern statistical parsersdo better is to add information about particular words in order to figure out how to prune parsingpossibilities. Classic examples would be those such as I saw the guy on the hill with the telescope. Doeswith the telescope associate more strongly with the hill or saw?Part II will give deeper understanding of how statistical parsers work and how they interact with, lexicalfrequencies, syntactic, and semantic regularities. In particular, you will:1. Investigate the connection between lexical (word level) semantics and parsing, using the Penn TreeBank (PTB) as a concrete test bed.2. Explore how a state-of-the-art probabilistic parser will handle these issues.What you must turn in. You will need to turn in two (or more) report files.1. a writeup of Part I: lab3a.pdf2. one or more reports for each verb for part II as [your verb].pdf.As before, please email your write-ups as pdf files to [email protected]. You may use the write-uptemplates provided here:http://web.mit.edu/6.863/spring2009/writeups/lab3/Please rename verbN.pdf to [your verb].pdf when you submit. You will be assigned multiple verbs to analyze,you are required to report on one. Extra reports you submit will count as extra credit. In your email include6.863 Lab 2 as your “Subject:”. As usual, you may collaborate with whomever you wish; just note thenames of your collaborators in your report. Your report should be recognizably your own work.1Part IInitial Preparation:• Background reading: Please read our semi-Google books version of Chapter 14 of the 2nd editionJM text, here:http://www.mit.edu/~6.863/spring2009/jmnew/ch14.pdf• Software: You will need to run the tools in this lab either on Athena, by logging in to linux.mit.eduand setting up your environment properly, as described in Laboratories 1 and 2. Alternatively, youcan download the java program to run the parsing tools here, along with installation instructions:http://web.mit.edu/~6.863/spring2009/code/lab3.zip1 Using modern probabilistic parsers1.1 Running statistical parsers on AthenaYou created a simple probabilistic parser using NLTK in Laboratory 2. For this lab, on Athena we’ve setup installations of two widely-used statistically-based parsers on Athena. Each is pre-trained on the WallStreet Journal(WSJ) section of the Penn Treebank (PTB), a collection of approximately one million wordsof running text from, ah, the Wall Street Journal, which was then converted, partially by hand, into parsetrees.1Such parsers require their input to be tagged – their rules do not go down to the level of individual words,only their parts of speech (POS). These part of speech names, which you first saw in Lab 2, are partly anhistorical artifact, derived and elaborated from much older corpus work first done at Brown University in the1960s. For example, the tag IN denotes any Preposition, while VBD stands for a past-tense verb, sometimesending in ed or en, such as taken (but VBD could also be the tag associated with an irregular past tense verbsuch as sung). A list of 48 of the most important tags used for the Penn Tree Bank is given just below.You’ll soon grow to know and love or hate this list. (For example, what information do these tags include?What information do they omit?)http://bulba.sdsu.edu/jeanette/thesis/PennTags.htmlSince the sentences don’t come with their tags on them, and since the same word can have more than onetag, parsing systems must pre-process sentences and assign tags to each word. We will be using these parserswith the MXPOST maximum entropy tagger, which combines information about the endings of words plussome local context, such as the preceding word tag, and combines this information probabilistically to figureout the most likely tag to assign to the word it is currently looking at. We will have more to say about thistagging method in lecture, but for now if you want you can download it yourself and read about it here:http://www.inf.ed.ac.uk/resources/nlp/local_doc/MXPOST.htmland read about the method via the following paper as described there:http://www.mit.edu/~6.863/spring2009/readings/mxpost_doc.pdf1.1.1 The Java installation for the Bikel-Collins & Stanford parsersFor a warm-up introduction to these parsers, we will use a java-based “wrapper” implementation that letsyou switch between either the Bikel-Collins or the Stanford statistical parsers, and automatically loads inthe entire training set of Wall Street Journal sentences. (It also lets you load in the entire test set of WallStreet Journal sentences, and has other useful functionality that we’ll run through as we go, see below.)1As is standard, specifically this means the systems are trained on sections 02 to 21 of the Wall Street Journal, approximately40,000 parsed sentences.2This parser wrapper comes packaged as a .jar file parsers.jar so it should be possible to install andrun on your local linux, Windoze, or (with more difficulty) Mac OS X machine. See the readme file for howto do this, here:http://web.mit.edu/~6.863/spring2009/code/lab3/parsers-install.txtTo run the program on Athena, ssh using X-windows
View Full Document