Stanford CS 224 - Classifying the Sentiment of movie review Data - D2928976

Home> Schools> Stanford University> Computer Science (CS) > CS 224> Classifying the Sentiment of movie review Data

DOC PREVIEW

Stanford CS 224 - Classifying the Sentiment of movie review Data

School name Stanford University

Course Cs 224- N Natural Language Processing with Deep Learning

Pages 13

This preview shows page 1-2-3-4 out of 13 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 13 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 13 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 13 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 13 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 13 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Classifying the Sentiment of Movie Review Data CS 224N Final Project Report Cheng-Tao Chu, Ryohei Takahashi, Pei-Chin Wang 1:Introduction With the rapidly increasing amount of text available on the internet, organizing this vast amount of information has become increasingly important. Many researchers in natural language processing have studied the problem of automatically assigning documents to different categories. One type of categorization that has been studied is classifying the sentiment found in documents. Although many documents, such as movie reviews, already have some measure of sentiment, such as star ratings, many do not, and classifying documents as having positive or negative sentiment automatically would be useful for these unlabelled documents. In our final project, we investigate the effectiveness of applying different feature extraction heuristics and feature selection methods and make an in-depth comparison among three very popular classifiers in the task of classifying movie reviews as having positive or negative sentiment. The three classifiers we examine are Maximum Entropy (MaxEnt) Classifier, Support Vector Machine (SVM), and Decision Tree (DT). The MaxEnt classifier models the probability distribution that maximizes the entropy under the constraints imposed by the training data. The SVM classifies data by maximizing the margin between the support vectors, which are the boundary for the classification. Finally, the DT classifies data into different categories by recursively partitioning the feature space into two parts and assigning different categories based upon which region in the divided space a document is, based on its features. The training and test data consists of 1000 positive reviews and 1000 negative reviews, found at http://www.cs.cornell.edu/people/pabo/movie-review-data/ and used by Pang et al (2002). Since we do not know how to divide the data, we use cross-validation to evaluate the performance and decrease the variance of the resulting estimate. 2:Methods The procedures we apply in the system are as follows. First, we perform part-of-speech tagging to the reviews, using the Stanford Tagger provided by the Stanford NLP Group (2004). We then perform further processing on the data. We replace all stop words with a special token, so that our feature counts are not dominated by many commonly-occurring words. We also perform stemming using the Porter Stemming Algorithm (Porter 2002), so that words with the same stem are counted as the same feature, reducing the number of features and the sparsity of the data. In addition, we use an idea from Pang et al (2002): the tag “NOT_” is added to words after a negation word, such as “not” or contractions ending in “n’t.” However, we modify their idea by only appending the tag to adjectives instead of to all words as they do, as it does not make sense to negate words such as prepositions or pronouns. Also, instead of adding the tag to every word between a negative word and the first punctuation mark following the word, we only consider words within a three-word window after the negative word. We do this because although a negative word will tend to negate adjectives that follow it, the “range” of a negative word generally does not extend very far, only a few words (e.g. “not very good,” “not very likely”). Next, we use the feature extraction heuristics to build up the feature matrix. In the matrix, we treat each document as a row and each feature as a column, and the value is the heuristic value for the feature in the document. After the feature extraction, we optionally apply some feature selection method to select the most significant features to reduce the dimension of the feature matrix. We then pass the reduced feature matrix to each classifier and evaluate their performance.2 A diagram showing the control flow in our system is shown below. 2.1:Feature Extraction Feature Set Number of Features Description Unigram 14408 The presence of each unigram as a feature Unigram+POS 21921 The presence of each unigram concatenated with its POS tag as a feature Unigram+Bigram 96725 The presence of unigram and bigram as a feature Unigram+Length 14508 The presence of each unigram and the percentages of sentences in each document having a particular length In the Unigram+Length feature set, the length features are calculated as the percentage of sentences in a document having a particular length. For example, if 5% of the sentences in a document have length 7, then the LENGTH-7 feature has a value of 0.05 for that document. As shown in the above table, we have four sets of features which can be used in the classifiers. In this table, the “Number” shows the number of features extracted in a subset of the whole data set, consisting of 150 positive and 150 negative reviews, which we used for some parameter tuning experiments. As we can see, the number of features increases significantly in the Unigram+Bigram case. Essentially, the bigrams to some extent reflect the context of a word, although it dramatically increases the dimension and thus the training and the testing time. In keeping with the results of Pang et al (2002), we only consider presence features for unigrams, unigrams with POS tags, and bigrams, instead of using the frequency. This also allows us to use the MaxEnt classifier with the same features, as the MaxEnt classifier must have features that are either present or absent. 2.2:Feature Selection As shown in the previous section, there are many more features than documents. This can create many problems. With such a large number of features, the feature matrix is very large, and the number of features is simply too big for most of the classifiers to train within a reasonable amount of time and without running out of memory. In addition, with a large number of features, the matrix is very sparse, leading to inaccurate estimates of the probability of certain features being present. Therefore we wish to select important features and discard the uninformative ones in order to speed up computation and to POS Tagging Feature Extraction Positive docs & Negative docs Feature Selection MaxEntSVM DT Evaluation3improve performance. Also, by selecting out the most important features, we may eliminate noise in the data as well, also potentially improving the performance of the classifiers. In our system, there are two feature selection algorithms. We describe them in

View Full Document