DOC PREVIEW
Stanford CS 276 - Exercise #2

This preview shows page 1-2 out of 7 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS 276 Programming Exercise #2 (100 points) Assigned: Tuesday, May 17, 2011 Due Date: Thursday, May 26, 2011 by 11:59 pm Delivery: All students should submit their work electronically. See below for details. Collaboration: You are allowed (but not required) to work in pairs for this assignment. Teams of two should only submit on copy of their work. Late policy: Refer to course webpage. Honor code: Please review the collaboration and honor code policy on the course webpage. 1. Overview In this assignment, you will conduct an exercise in supervised machine learning by training classifiers for Usenet newsgroup messages. We provide you with a data set and starter code that iterates through the messages. Your task is to classify new messages into one of the Usenet newsgroup categories by training classifiers such as a Naïve Bayes classifier. This assignment might be more challenging than practical exercise #1, but it should also be more conceptually interesting. It also offers greater opportunity for creativity and extra credit. a. Programming language and computing environment Although programming this assignment in Java may be the obvious choice, you are free to program it in any other language you are comfortable with. You should probably complete the assignment using the Stanford UNIX Computing Resources, possibly on one of the intensive-computing machines such as corn or pod. See the following webpage for more details. http://www.stanford.edu/services/unixcomputing/which.html#intensive While you are free to use any platform you choose to develop your code, your deliverables must run on corn by the grading script. b. Starter code and data set The starter code is located in /afs/ir.stanford.edu/class/cs276/pe2-2011. Make a new directory and copy everything into it. We’re using the 20 Newsgroups collection, an archive of 1000 messages from each of 20 different Usenet newsgroups. The version we selected for this assignment excludes cross-posts (messages that were posted to more than one of the 20 newsgroups), so the actual number of messages is slightly below 1000 per group. For more information, please visit http://people.csail.mit.edu/people/jrennie/20Newsgroups. A copy of the data set is located in /afs/ir.stanford.edu/data/linguistic-data/TextCat/20Newsgroups/20news-18828, with each newsgroup in a subdirectory containing one file per message. If you are using the Stanford UNIX Computing Resources and/or have access to the Stanford AFS file system, you should be able to read the data set from its current location. There is no need to make a copy of the files. More details about the start files: cs276/pe2-2011 /setup.sh Builds the code and runs MessageParser to produce parsed messages. To get started, run: ./setup.sh /afs/ir/data/linguisticdata/TextCat/20Newsgroups/20news-18828/ If you aren’t programming in Java or make any major modifications to the MessageParser, make sure that this script still works as it will be called by the automated grading script. This script explicitly calla Sun Java 1.6.0 instead of gij 4.2.3. Memory allocation has been doubled to account for increased memory necessary to store pointers in 64-bit computing environments such as corn machines cs276/pe2-2011/ runNaiveBayes.sh A wrapper around NaiveBayesClassifier.java. We will use this script to call your code, so make sure it works before you submit. This script also explicitly calls Sun Java 1.6.0 instead of gij 4.2.3. cs276/pe2-2011/ MessageParser.java Reads in the messages to produce an iterator for the data. Creates separate features for the subject and body fields, employs the stop words and the Porter stemmer and counts word frequencies. Outputs a file containing the data, including the newsgroup labels for those vectors. You may modify this file. cs276/pe2-2011/ MessageIterator .java Sample implementation to iterate the parsed output of MessageParser cs276/pe2-2011/ MessageFeatures .java The data class for the iterator, representing a single parsed message cs276/pe2-2011/ NaiveBayesClassifier .java This is the main entry point for your code. Right now it just parses the command line arguments and calls functions that you need to fill in. Feel free to add all of your code here or add additional files as needed. cs276/pe2-2011/ Stemmer.java Porter stemmer implementation english.stop List of stop words cs224n/util/ Counter.java A map from objects to doubles. Includes convenience methods for getting, setting, and incrementing element counts. (You might find this useful, but don’t have to use it). cs224n/util/ PriorityQueue.java A priority queue based on a binary heap. (You might find this useful, but don’t have to use it). c. Related literature A significant portion of this assignment is based on Jason Rennie’s paper on improving the text classification accuracy of Naïve Bayes classifiers. You will find the paper “Tackling the poor assumptions of Naïve Bayes Text Classifier” at http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf. 2. Write-Ups and Deliverables This project requires you to implement a number of modifications to the Naïve Bayes classifier.Please do them step by step and get results at each step, so that even if you don’t finish the whole project, you can still deliver a working partial project. Be sure to save your work at every milestone. You may wish to read the whole assignment before you start. a. Automatic grading script Most of your code will be graded by an automated grading script! Please pay attention to the required inputs and outputs for each part of your program described below. In particular, do NOT output any extra text to stdout. Instead output any debugging messages that you may have to stderr (i.e. in Java use System.err.println(...)). Before we test your code, we will run setup.sh once. You don’t need to modify this file to complete the assignment, but if you make significant changes to MessageParser/Iterator or choose not to program in Java, you may need to. Then for each part of the assignment below, we will call ./runNaiveBayes.sh <mode><train> where mode specifies which part of the assignment you should run and train is the location of a


View Full Document

Stanford CS 276 - Exercise #2

Documents in this Course
Load more
Download Exercise #2
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Exercise #2 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Exercise #2 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?