Columbia COMS W4705 - Document Classification - D2627113

Home> Schools> Columbia University> (COMS) > COMS W4705> Document Classification

DOC PREVIEW

Columbia COMS W4705 - Document Classification

School name Columbia University

Course Coms W4705- Natural Language Processing

Pages 8

This preview shows page 1-2-3 out of 8 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Homework 2 – Document ClassificationNatural Language ProcessingDue: Oct 29, 2007 at 2:00 p.m.Assignment:Your assignment is to run a number of machine learning experiments on a set of newsdata and describe your experiments and findings. The task involves classifying newsstories in a number of different ways:1. By Source TypeThe training and testing material will include both newswire (NWIRE) text digests andbroadcast news (BN) transcripts. This is a binary classification: BROADCAST vs. TEXT.2. By Source LanguageThe materials are drawn from news sources in three languages: Mandarin Chinese,American English and Modern Standard Arabic. The Chinese and Arabic materials aretranslated into English, so you will be classifying the translations. This is a three wayclassification: Mandarin (MAN) vs. English (ENG) vs. Arabic (ARB).3. By Source News OrganizationThere are 20 news organizations that contribute to the materials. No news organizationcrosses source types or source languages so the previous two classifications may behelpful here. E.g. The New York Times is an English Newswire organization; it willnever appear as the source of Mandarin text or BN. This is a 20-way classification. SeeAppendix A for a listing of news organizations along with their source type andlanguage.4. By Broad TopicThe data has been manually annotated for broad class of topics comprising general topicslike “Accidents” or “Sports News”. There are thirteen broad; some helpful informationabout the classes can be found in Appendix B. Bear in mind, not every story is annotatedfor topic. You are only asked to classify those that are. Therefore, only construct featurevectors for a subset of the stories. You are only required to classify broad topics. Youmay not use the narrow topic labels referring to specific events (for example, “DeadlyFire in Bangladeshi Garment Factory”) as features!You are to use the Machine Learning toolkit weka in order to run your classificationexperiments. To this end, one part of your submission will be a program that generatesweka .arff formatted files. As discussed in class and in the weka documentation, thesefiles describe your data set as a series of class-labeled feature vectors. Your programshould read the data set and produce one .arff file for each classification, for a total of 4files. The feature set that you extract for use in these classification experiments iscompletely up to you; however, obviously, you must not use any of the document labels(<SOURCE_TYPE>, <SOURCE_LANG>, <DOC_DATE>, <NARROW_TOPIC> etc.)as features in your feature vector. You may extract different features for differentclassification tasks, but you are not required to. You should try at least three differentclassification algorithms for each task so you can see how they operate on different tasks.For these classification experiments you should use 10-fold cross-validation. It isessential that you use the weka package that can be found at /home/cs4705/bin/weka.jar torun your experiments. If you do not, there is no guarantee that it will be possible toevaluate your final models.You must also export the model that yielded the best results for each task, and submit italong with your feature extractor code – if you do not, evaluating your submission will beimpossible. Also, it is essential that you indicate the classifier and parameters thatgenerated the submitted model.You may find that you want to use features that are calculated relative to the entire dataset. For example, “does this story have more or less words that the average story in thetraining data?” These types of features can be very useful. However, you need to becareful when using them in a cross-validation setting. These features should never becalculated using any testing material. This may force you to run the cross validationevaluation “manually”. That is, randomly dividing the training data into training andtesting sets for feature extraction and evaluation. For your submission you may build amodel on your entire training set.For some of the classifications (Source Type, Source Language, Source NewsOrganization) every story in a document will have the same class. However, theclassification should still operate on the story level, not the document level. Therefore, itmight make sense for every story from a document to have an identical feature vector.Submission:Your submission should require as little human interaction as possible to test; thereforeyou MUST follow these instructions:In your submission you must include the following files generated by your system:1) sourceType.arff and sourceType.model2) sourceLanguage.arff and sourceLanguage.model3) sourceNO.arff and sourceNO.model4) topicBroad.arff and topicBroad.modelThe following are crucial scripts:1) Submit one script to compile your code: make.sh2) Submit four additional scripts, one for each classification task. Each of these scriptsgenerates an arff file and runs weka on a given a directory that contains the input files.These scripts will be used to to test your models on unseen data, for example:./runSourceType.sh sourceType.model /home/nlp/hw2-testfiles=> It will extract features from all *.input files in /home/nlp/hw2-testfiles => generatessourceTypeTest.arff fileThis script will also run weka using sourceType.model and sourceTypeTest.arff ==>weka result report. To get these results from command line you can use the following:java -Xmx1G -cp /home/cs4705/bin/weka.jar weka.classifiers.trees.J48 –l sourceType.model -T sourceType.arff(assuming that J48 algorithm was used when you built your model)2) ./runSourceLang.sh sourceLang.model /home/nlp/hw2-testfiles…3) ./runSourceNOLang.sh sourceNO.model /home/nlp/hw2-testfiles…4) ./runTopicBroad.sh topicBroad.model /home/nlp/hw2-testfiles…You must also produce a write-up of your experiments. This write-up should describeyour experiments and also must include a discussion of the processes you took and theresults you obtained. Some questions that should be addressed can be found in thegrading section below. This write-up should definitely include the cross-validation resultreport of the experiments you ran. Make your discussion empirical rather thanimpressionistic (i.e. refer to specific

View Full Document