DOC PREVIEW
Stanford CS 224 - Study Notes

This preview shows page 1-2-22-23 out of 23 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 23 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 23 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 23 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 23 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 23 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Classifying Movie Scripts by Genre with a MEMMUsing NLP-Based FeaturesAlex Blackstock Matt Spitz6/4/08AbstractIn this project, we hope to classify movie scripts into genres based on avariety of NLP-related features extracted from the scripts. We devised twoevaluation metrics to analyze the performance of two separate classifiers, aNaive Bayes Classifier and a Maximum Entropy Markov Model Classifier.Our two biggest challenges were the inconsistent format of the movie scriptsand the multiway classification problem presented by the fact that each moviescript is labeled with several genres. Despite these challenges, our classifier per-formed surprisingly well given our evaluation metrics. We have some doubts,though, about the reliability of these metrics, mainly because of the lack of alarger test corpus.1 IntroductionClassifying bodies of text, by either NLP or non-NLP features, is nothing new. Thereare numerous examples of classifying books, websites, or even blog entries either bygenre or by author [7, 8]. In fact, a previous CS224N final project was centeredaround classifying song lyrics by genre [6].Despite the large body of genre classification in other types of text, there isvery little involving movie script classification. A paper by Jehoshua Eliashberg [1]describes his work at The Wharton School in guessing how well a movie will performin various countries. His research group developed a tool (MOVIEMOD) that uses aMarkov chain model to predict whether a given movie would gross more or less thanthe median return on investment for the producers and distributors. His researchcenters on different types of consumers and how they will respond to a given movie.1Our project focuses on classifying movie scripts into genres purely on the basisof the script itself. We convert the movie scripts into an annotated-frame format,breaking down each piece of dialogue and stage direction into chunks. We thenclassify these scripts into genres by observing a number of features. These featuresinclude but are not limited to standard natural language processing techniques. Wealso observe components of the script itself, such as the ratio of speaking to non-speaking frames and the average length of each speaking part. The classifiers weexplore are the Maximum Entropy Markov Model from Programming Assignment 3and an open-source Naive Bayes Classifier.2 Corpus and DataThe vast majority of our movie scripts were scraped from online databases likedailyscript.com, and other sites which provide a front-end to what is apparently acommon collection of online hypertext scripts, ranging from classics like Casablanca(1942) to current pre-release drafts like Indiana Jones 4 (2008). Our raw pull yieldedover 500 scripts in .html and .pdf format, the latter of which had to be coerced intoa plain text format to become useful. Thanks to surprisingly consistent formattingconventions, that vast majority of these scripts were immediately ready for parsinginto object files. However, some of the scripts varied in year produced, genre, format,and writing style. The latter two posed significant problems for our ability to parsethe scripts reliably. After discarding the absolutely incorrigible data, we were leftwith 399 scripts to be divided between train and test sets.The second piece of raw data we acquired was a long-form text database of movietitles linked to their various genres, as reported by imdb.com.The movies in our corpus had 22 different genre labels. The most labels anymovie had was 7, the fewest was 1, and the average was 3.02. The exact breakdownis given below:2.1 ProcessingThe transformation of a movie script from a raw text file to the complex annotatedbinary file we used as datum during training required several rounds of pulling outhigher-level information from the current datum, and adding that information backinto the script. Our goal was to compute somewhat of an “information closure” foreach script to maximize our options for designing helpful features.2Genre CountDrama 236Thriller 172Comedy 119Action 103Crime 102Romance 76Sci-Fi 72Horror 71Adventure 62Fantasy 48Mystery 45Genre CountWar 17Biography 14Animation 13Music 12Short 11Family 10History 6Western 5Sport 5Musical 4Film-Noir 3Table 1: All genres in our corpus with appearance counts3Building a movie script datumThe first step was to use various formatting conventions in the usable scripts tobreak each movie apart into a sequence of “frames,” consisting of either characterdialogue tagged with the speaker, or stage direction / non-spoken cues tagged withthe setting. The generated list of frames, which still consisted only of raw text atthis point, were serialized into a .msr binary object file.Raw frames:FRAME--------------------------TYPE: T_OTHERSetting: ANGLE ON: A RING WITH DR. EVIL’S INSIGNIA ON IT.Text: THE RINGED HAND IS STROKING A WHITE FLUFFY CAT--------------------------------FRAME--------------------------TYPE: T_DIALOGUESpeaker: DR. EVILText: I spared your lives because I need you to help me rid the world ofthe only man who can stop me now. We must go to London.I’ve set a trap for Austin Powers!--------------------------------4... ... ...FRAME--------------------------TYPE: T_DIALOGUESpeaker: SUPERMODEL 1Text: Austin, you’ve really outdone yourself this time.--------------------------------FRAME--------------------------TYPE: T_DIALOGUESpeaker: AUSTINText: Thanks, baby.--------------------------------Using the textual search capabilities of Lucene (discussed below), we then pairedthe .msr files with the correct genre labels, to be used in training and testing. Finally,the text content of each labeled .msr was run through the Stanford NER system [2]and the Stanford POS tagger [9], generating two output files with the relevant part-of-speech and named entity tags attached to each word. The .msr was annotatedwith this data and then re-serialized, producing our complete .msa (“movie scriptannotated”) object file to be used as a datum.Annotated frames:FRAME--------------------------TYPE: T_OTHERSetting: INT. DR. EVIL’S PRIVATE QUARTERSText: Dr. Evil, Scott and the evil associates finish dinner.Dr.: [NNP][PERSON]Evil,: [NNP][PERSON]Scott: [NNP][PERSON]and: [CC][O]the: [DT][O]evil: [JJ][O]associates: [NNS][O]finish: [VBP][O]dinner.: [NN][O]--------------------------------... ... ...FRAME--------------------------TYPE: T_DIALOGUESpeaker: BASIL EXPOSITONText: Our next move is to infiltrate Virtucon. Any ideas?Our:


View Full Document

Stanford CS 224 - Study Notes

Documents in this Course
Load more
Download Study Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Study Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Study Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?