DOC PREVIEW
Stanford CS 224 - Study Notes

This preview shows page 1-2-3-4-5-6 out of 17 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 17 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 17 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 17 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 17 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 17 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 17 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 17 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Sameer ShariffJohnson Hsieh6/6/08Final Project - Classifying Reading Level Using Language ModelsAbstractThe problem we address in this paper is building a reading level classifier using naturallanguage processsing techniques. Prior work on this task includes some standard andwidely-used readabilty formulas such as Flesch-Kincaid, along with some work usinglanuage models for classification. We set out to use statistic language models along withmultinomial logistic regression models and features to try improve upon exisitingtechniques.DatasetFor our dataset for training and test data, we needed large passages of written prose thatwas labeled at specific reading levels. We decided to focus on novels that are suggested onreading lists at different grade levels, since this gave us the large amount of text we neededfor building statistical language models, and additionally, labeled data was readily available.A few caveats about our training data: First, due to copyright issues, we were only able topull fulltext for books whose copyright had expired, so our dataset consists only of the more"classic" books. This turned out to be an advantage, since we limit our type of books toclassic English novels, and avoid the problem of our classifier learning a different problem,for instance, a topical classifier. Additionally, since we pulled these books and manuallycross-referenced them against reading lists, we weren't able to obtain a large number ofbooks for use in our data set (although the documents do each provide a large amount oftext to model on). Finally, the labels for our data is also quite noisy - While looking atdifferent instances of reading lists online, we noticed a lot of variation in the recommendedreading level of several of our books. In general, we tried to pick books that wererepresentative of each class, and in the case of discrepancies we chose the class the bookmost commonly appeared in.In total we drew 48 books, distributed as follows:5th-7th Grade: 15 books8th-10th Grade: 16 books11th-12th Grade: 17 booksThe actual list of books is included in the Appendix at the end of this report.Framework and OrganizationOur main module, ReadabilityTester, reads in all the necessary data files, builds the models,and test and evaluates the overall performance of the system. Our organization/programflow is as follows:First, we read in the labeled training data one document at a time, and add the sentencedata from each document to a langage model corresponding to each class. That is, in ourcase, we maintain three different langauge models - one for each of our reading levelclasses. When we encounter a new document A which is labeled reading level "5-7", we addall the sentences from document A to language model for "5-7", and similar for documentsfor other classes.Then, during testing time, we classify each unseen document by computing the perplexity ofthe document according to each of the language models and choosing the class whoselanguage model had the lowest perplexity. This, in effect, chooses the model that wouldhave most likely produced this document. That is, we choose the model that has thehighest probabilty of generating this document.K-fold Cross-validationDue to our limited dataset and our desire to both train and test on as many books aspossible, we used standard techniques to do K-fold cross-validation, enabling us to train andtest on all of our data (over the course of several trials). By default, we used 10-fold cross-validation, but our tester had a flag that allowed us to change this dynamically. Essentially,the K-fold cross validation method splits the data into K subsamples, and then during eachtrial holds out one of these subsamples and trains on the rest. This held out subsample isthen used for evaluation for that trial. These is repeated K times, holding each subsample(and consequently each document) out exactly once and therefore using each evaluationexactly once. The overall cross-validated accuracy is then computed as the average of allthe trial accuracies (weighted by the number of documents in that trial, since onesubsample may be of different size).Evaluation MethodsOur primary metric for model evalutaion was simply the accuracy, which we defined as theoverall fraction of books that were predicted correctly. That is,Accuracy = (# of books predicted correctly)/(total # of books)Since cross-validation allowed us to use every book in our data set for evaluation exactlyonce our accuracy was essentially a fraction with a denominator of 48. In the results below,we report this accuracy as a decimal.In order to get a finer granularity in our accuracy measurements, and additionally to rewardthe system from being "not too far off", we also created a weighted accuracy metric whichgave partial credit to guesses whose true class was adjacent to the guessed class. Thesemade sense for the context of our problem since our classes were ordered. The formula forweighted accuracy is,Weighted Accuracy = ((# predicted correctly) + 0.5 * (# off by one))/(total # of books)Language ModelsAs a starting point, we began with one of our simplest models from PA1 - The SmoothedUnigram language model. We generated a Smoothed Unigram model for each class basedon the training data, and then used these models to do classification as described above.The overall performance of the system is shown in the table below:Model Name 10-Fold Cross-Validated AccuracySmoothed Unigram 0.3958Based on our observations during error analysis and while data-mining our books forsignals, we came up with several modifications to our base model which ultimately resultedin better overall model performance. Some of the modifications we made along withjustifications are listed below.Using the histogram of sentence lengths as a featureWhile mining our documents for various signals, we noticed that the distribution of sentencelengths seemed to vary between different classes. As a result, we thought that capturingthis information would help us differentiate between different classes during testing. As asimple way to take sentence lengths into account during classification, we added thesentence lengths right into our existing language model. That is, we simply emitted anextra "token" at the end of each sentence that denoted the sentence length, and then whenwe were computing the overall sentence probability, we made sure to include the lengthtoken in the calculation. As shown below, this indeed improved our model


View Full Document

Stanford CS 224 - Study Notes

Documents in this Course
Load more
Download Study Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Study Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Study Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?