DOC PREVIEW
Stanford CS 224 - Language modeling

This preview shows page 1-2 out of 6 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS 224N / Ling 237Programming Assignment 1: Language ModelingDue Wednesday 16 April 2008This assignment may be done individually or in groups of two. We stronglyencourage collaboration, however your submission must include a statement describingthe contributions of each collaborator. See the collaboration policy on the website(http://cs224n.stanford.edu/assignments.html#collab).Please read this assignment soon and go through the Setup section to ensure that you areable to access the relevant files and compile the code. Especially if your programming experienceis limited, start working early so that you will have ample time to discover stumbling blocks andask questions.1SetupOn the Leland machines (such as bramble.stanford.edu)1, make sure you can access the followingdirectories:/afs/ir/class/cs224n/pa1/java/ : the Java code provided for this course/afs/ir/class/cs224n/pa1/data/ : the data sets used in this assignmentCopy the pa1/java/ directory to your local directory and make sure you can compile the codewithout errors. The code compiles under JDK 1.5, which is the version installed on the Lelandmachines.To ease compilation, we’ve installed ant in the class bin/ directory. ant is similar in functionto the Unix make command, but ant is smarter, is tailored to Java, and uses XML configurationfiles. When you invoke ant, it looks in the current directory for a file called build.xml whichcontains project-specific compilation instructions. The java/ directory contains a build.xml filesuitable for this assignment (and a symlink to the ant executable). Thus, to copy the source filesand compile them with ant, you can use the following sequence of commands:cd ∼mkdir -p cs224n/pa1cd cs224n/pa1cp -r /afs/ir/class/cs224n/pa1/java .cd java./antIf you don’t want to use ant, you are welcome to write a Makefile, or for a simple project likethis one, you can just docd ∼/cs224n/pa1/java/mkdir classes/javac -source 5 -d classes src/*/*/*.java1see http://www.stanford.edu/services/unixcomputing/environments.html for a list of Leland machines1Seethecollaborationpolicyonthewebsite(http://cs224n.stanford.edu/assignments.html#collab).rable.stanford.eduPlease note that some of the resources used in this assignment requirea Stanford Network Account and therefore may not be accessible.Once you’ve compiled the code successfully, you need to make sure you can run it. In order toexecute the compiled code, Java needs to know where to find your compiled class files. As shouldbe familiar to every Java programmer, this is normally achieved by setting the CLASSPATHenvironment variable. If you have compiled with ant, your class files are in java/classes,andthefollowing commands will do the trick. Type printenv CLASSPATH. If nothing is printed, yourCLASSPATH is empty and you can set it as follows:setenv CLASSPATH ./classesOtherwise, if something was printed out, enter the following to append to the variable:setenv CLASSPATH ${CLASSPATH}:./classesNow you’re ready to run the test. From directory ∼/cs224n/pa1/java/ enter:java cs224n.assignments.LanguageModelTesterIf everything’s working, you’ll get some output describing the construction and testing of a(pretty bad) language model. The next section will help you make sense of what you’re seeing.2 Using the LanguageModelTesterTake a lo ok at the main() method of LanguageModelTester.java, and examine its output. Thisclass has the job of managing data files and constructing and testing a language model. Itsbehavior is controlled via command-line options. Each command-line option has a default value,and the effective values are printed at the beginning of each run. You can use shell scripts toeasily configure options for a run—we’ve supplied a shell script called run that will give you theidea.The -model option specifies the fully qualified class name of a language model to be tested. Itsdefault value is cs224n.langmodel.EmpiricalUnigramLanguageModel, a bare-bones language modelimplementation we’ve provided. Although this is a very poor language model, it illustrates theinterface (cs224n.langmodel.LanguageModel) that you’ll need to follow in implementing your ownlanguage models. A LanguageModel should implement a no-argument constructor, and mustimplement four other methods:• train(Collection List String trainingSentences). Trains the model from the supplied collec-tion of training sentences. Note that these sentence collections are disk-backed, so doinganything other than iterating over them will be very slow, and should be avoided.• getWordProbability(List String sentence, int index). Returns the probability of the word atindex, according to the model, within the specified sentence.• getSentenceProbability(List String sentence). Returns the probability, according to themodel, of the specified sentence. Note that this method and the previous method shouldbe consistent with one another, and in all likelihood this method will call that method.• checkModel(). Returns the sum of the probability distribution. A proper probabilitydistribution should sum to 1. checkModel() will not be run if -check is set to false.• generateSentence(). Returns a random sentence sampled according to the model.The -data option to LanguageModelTester specifies the directory in which to find data. Bydefault, this is /afs/ir/class/cs224n/pa1/data/; if you copy data to your own machine, you’ll wantto override this option.2The -train and -test options specify the names of sentence files (containing one sentence perline) in the data directory to be used as training and test data. The default values are europarl-train.sent.txt and europarl-test.sent.txt. These files contain sentences from the Europarl corpus.(For more details on the origin of the data, see the README files in the data directories.)After loading the training and test sentences, the LanguageModelTester will create a languagemodel of the specified class, and train it using the specified training sentences. It will thencompute the perplexity of the test sentences with respect to the language model. When thesupplied unigram model is trained on the Europarl data, it gets a perplexity between 800 and 900,which is very poor. A reasonably good perplexity number should be around 200; a competitiveperplexity can be around 100 on the test data.Next, if the -hub option is set to true (the default), the LanguageModelTester will do an eval-uation using a set of HUB speech


View Full Document

Stanford CS 224 - Language modeling

Documents in this Course
Load more
Download Language modeling
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Language modeling and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Language modeling 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?