Stanford CS 224 - Language modeling - D1914876

Home> Schools> Stanford University> Computer Science (CS) > CS 224> Language modeling

DOC PREVIEW

Stanford CS 224 - Language modeling

School name Stanford University

Course Cs 224- N Natural Language Processing with Deep Learning

Pages 6

This preview shows page 1-2 out of 6 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CS 224N / Ling 237Programming Assignment 1: Language ModelingDue Wednesday 16 April 2008This assignment may be done individually or in groups of two. We stronglyencourage collaboration, however your submission must include a statement describingthe contributions of each collaborator. See the collaboration policy on the website(http://cs224n.stanford.edu/assignments.html#collab).Please read this assignment soon and go through the Setup section to ensure that you areable to access the relevant ﬁles and compile the code. Especially if your programming experienceis limited, start working early so that you will have ample time to discover stumbling blocks andask questions.1SetupOn the Leland machines (such as bramble.stanford.edu)1, make sure you can access the followingdirectories:/afs/ir/class/cs224n/pa1/java/ : the Java code provided for this course/afs/ir/class/cs224n/pa1/data/ : the data sets used in this assignmentCopy the pa1/java/ directory to your local directory and make sure you can compile the codewithout errors. The code compiles under JDK 1.5, which is the version installed on the Lelandmachines.To ease compilation, we’ve installed ant in the class bin/ directory. ant is similar in functionto the Unix make command, but ant is smarter, is tailored to Java, and uses XML conﬁgurationﬁles. When you invoke ant, it looks in the current directory for a ﬁle called build.xml whichcontains project-speciﬁc compilation instructions. The java/ directory contains a build.xml ﬁlesuitable for this assignment (and a symlink to the ant executable). Thus, to copy the source ﬁlesand compile them with ant, you can use the following sequence of commands:cd ∼mkdir -p cs224n/pa1cd cs224n/pa1cp -r /afs/ir/class/cs224n/pa1/java .cd java./antIf you don’t want to use ant, you are welcome to write a Makeﬁle, or for a simple project likethis one, you can just docd ∼/cs224n/pa1/java/mkdir classes/javac -source 5 -d classes src/*/*/*.java1see http://www.stanford.edu/services/unixcomputing/environments.html for a list of Leland machines1Seethecollaborationpolicyonthewebsite(http://cs224n.stanford.edu/assignments.html#collab).rable.stanford.eduPlease note that some of the resources used in this assignment requirea Stanford Network Account and therefore may not be accessible.Once you’ve compiled the code successfully, you need to make sure you can run it. In order toexecute the compiled code, Java needs to know where to ﬁnd your compiled class ﬁles. As shouldbe familiar to every Java programmer, this is normally achieved by setting the CLASSPATHenvironment variable. If you have compiled with ant, your class ﬁles are in java/classes,andthefollowing commands will do the trick. Type printenv CLASSPATH. If nothing is printed, yourCLASSPATH is empty and you can set it as follows:setenv CLASSPATH ./classesOtherwise, if something was printed out, enter the following to append to the variable:setenv CLASSPATH ${CLASSPATH}:./classesNow you’re ready to run the test. From directory ∼/cs224n/pa1/java/ enter:java cs224n.assignments.LanguageModelTesterIf everything’s working, you’ll get some output describing the construction and testing of a(pretty bad) language model. The next section will help you make sense of what you’re seeing.2 Using the LanguageModelTesterTake a lo ok at the main() method of LanguageModelTester.java, and examine its output. Thisclass has the job of managing data ﬁles and constructing and testing a language model. Itsbehavior is controlled via command-line options. Each command-line option has a default value,and the eﬀective values are printed at the beginning of each run. You can use shell scripts toeasily conﬁgure options for a run—we’ve supplied a shell script called run that will give you theidea.The -model option speciﬁes the fully qualiﬁed class name of a language model to be tested. Itsdefault value is cs224n.langmodel.EmpiricalUnigramLanguageModel, a bare-bones language modelimplementation we’ve provided. Although this is a very poor language model, it illustrates theinterface (cs224n.langmodel.LanguageModel) that you’ll need to follow in implementing your ownlanguage models. A LanguageModel should implement a no-argument constructor, and mustimplement four other methods:• train(Collection List String trainingSentences). Trains the model from the supplied collec-tion of training sentences. Note that these sentence collections are disk-backed, so doinganything other than iterating over them will be very slow, and should be avoided.• getWordProbability(List String sentence, int index). Returns the probability of the word atindex, according to the model, within the speciﬁed sentence.• getSentenceProbability(List String sentence). Returns the probability, according to themodel, of the speciﬁed sentence. Note that this method and the previous method shouldbe consistent with one another, and in all likelihood this method will call that method.• checkModel(). Returns the sum of the probability distribution. A proper probabilitydistribution should sum to 1. checkModel() will not be run if -check is set to false.• generateSentence(). Returns a random sentence sampled according to the model.The -data option to LanguageModelTester speciﬁes the directory in which to ﬁnd data. Bydefault, this is /afs/ir/class/cs224n/pa1/data/; if you copy data to your own machine, you’ll wantto override this option.2The -train and -test options specify the names of sentence ﬁles (containing one sentence perline) in the data directory to be used as training and test data. The default values are europarl-train.sent.txt and europarl-test.sent.txt. These ﬁles contain sentences from the Europarl corpus.(For more details on the origin of the data, see the README ﬁles in the data directories.)After loading the training and test sentences, the LanguageModelTester will create a languagemodel of the speciﬁed class, and train it using the speciﬁed training sentences. It will thencompute the perplexity of the test sentences with respect to the language model. When thesupplied unigram model is trained on the Europarl data, it gets a perplexity between 800 and 900,which is very poor. A reasonably good perplexity number should be around 200; a competitiveperplexity can be around 100 on the test data.Next, if the -hub option is set to true (the default), the LanguageModelTester will do an eval-uation using a set of HUB speech

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2 out of 6 pages.

Stanford CS 224 - Language modeling

Sign up for free to view:

Please select your school