UMass Amherst CS 585 - Programming Assignment 2: Naive Bayes Classifier - D2003979

Home> Schools> University of Massachusetts Amherst> (CS) > CS 585> Programming Assignment 2: Naive Bayes Classifier

DOC PREVIEW

UMass Amherst CS 585 - Programming Assignment 2: Naive Bayes Classifier

School name University of Massachusetts Amherst

Course Cs 585- Introduction to Natural Language Processing

Pages 3

This preview shows page 1 out of 3 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 3 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 3 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CS 585: Natural Language ProcessingFall 2004Programming Assignment 2: Naive Bayes ClassifierOut: Tue, September 28, 2004Due: Tue, October 12, 2004Spam Filtering using Naive BayesNaive Bayes is a simple, effective machine learning solution to the problemof document classification. For this ass ignment, you will implement aNaive Bayes classifier to classify email messages as spam (junk mail) orham (legitimate messages).DataThe data was collected from http://spamassassin.org/publiccorpus/.For this assignme nt, use the slightly modified version found athttp://canberra.cs.umass.edu/~culotta/cs585/ass1-data.tgz(8.3M). The data consists of 4150 ham messages and 1897 s pam messages,with original header information intact.TasksRead data: Read in all messages and store them in some efficientmanner. You must decide how to tokenize each message – i.e. designatewhich characters indicate the start of a new “word.” Seehttp://www.paulgrahm.com/spam.html for one way of doing this (Youwill probably also assign each unique word an integer index you can use asan index into arrays of word counts and word probabilities). If you chooseto do this assignment in Java, you can use the skeleton code for this task fromhttp://www.cs.umass.edu/~mccallum/courses/inlp2004/pa2-skeleton.tgz1Spam HamSpam TP FPHam FN TNTable 1: Confusion Matrix: TP = “true positive”, TN = “true negative”,FP = “false positive”, ‘FN = “false negative”.Split Data: Randomly s plit the data into a training set (70% of themessages) and a testing set (30%).Train Classifier: Using the training set only, estimate and store the priorclass distributions P (spam) and P (ham), as well as the conditionalprobability distributions P (w|spam) and P (w|ham). It is crucial that youstore probability measures as log-probabilities to avoid running out offloating-point resolution. This also means you need to do arithmetic inlog-space. I.e., multiplications of probabilities become additions oflog-probabilities.Test Classifier: Classify each message in the testing set as spam or hamaccording to the Naive Bayes formulation.Evaluate Classifier: Evaluate the performance of your classifier usingtwo methods: a confusion matrix and a precision recall graph. A confusionmatrix summarizes the types of errors your classifier makes, as in Table ??.Here, TP is the number of spam messages classified as spam, TN is thenumber of ham messages classified as ham, FP is the number of hammessages misclassified as spam, and FN is the number of spam messagesmisclassified as ham.A precision-recall graph plots the classifier’s precision at various points ofrecall, where precision = (TP + TN) / testingSetSize, and recall = TP /(TP + FN). To construct a precision-recall graph, sort the predictedmessages in decreasing order of the posterior probability of the predictedclass (i.e. in decreasing order of confidence that the classification iscorrect). Next, iterate through the list in descending order, and at eachmessage, calculate the precision and recall of the classifications of thecurrent mess age and messages already iterated through. Plot these (recall,precision) points on the graph.Additional ExperimentsPerform 2 of the following 4 experiments, using the same evaluationtechniques as in the spam experiment.2- Effects of train/test split: Split data into 50% training, 50% testing(instead of 70-30).- Different word feature sets: Try some different word features andinvestigate their accuracies. Try at least two different word features. Someexamples include: (1) Parse each document to extract the TO, FROM, CC,and SUBJECT fields, and use only the tokens from these fields to represe nteach me ss age, (2) Downcase all the words so that case information is lost,(3) Remove MIME attachments, (4) remove HTML tags, and (5) Removewords with fewer than two occurrences in your training set.- Alternate priors: Instead of using “plus one” smoothing, try numbersdifferent than one. Does performance degrade if you use numbers muchsmaller than one? How about larger than one? Is there some number otherthan one that gives higher class ification accuracy than “plus one”?- Prune vocabulary size by information gain: Instead of using all thewords in the training data, use only the top 1000 words with the highestinformation gain (mutual information with the class label) found from thetraining set. Try the top 1000, 100, and 10 words, plus some others. Doesthe test set accuracy change? Why do you think so?What to turn inCode: Print out all source code written for the project.Report: Write a 2 page report that includes the following:1. Problems encountered - What difficulties did you have and how didyou resolve them?2. Tokenization method - If you did not use the Java skeleton or if youperformed experiments with different word feature sets, how did youtokenize the input and why?3. Experimental results - In addition to the confusion matrix andprecision-recall graphs for each experiment, also include the overallaccuracy for each.4. Discussion - Explain your results. What additional experime nts doyou think would improve your

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1 out of 3 pages.

UMass Amherst CS 585 - Programming Assignment 2: Naive Bayes Classifier

Sign up for free to view:

Please select your school