SVMLightTraining StepFormat of input file (training data)Testing StepExampleConfusion MatrixEvaluations of PerformanceSlide 8SVMLight•SVMLight is an implementation of Support Vector Machine (SVM) in C.•Download source from :http://svmlight.joachims.org/Detailed description about:•What are the features of SVMLight?•How to install it?•How to use it?•…Training Step•svm-learn [-option] train_file model_file•train_file contains training data;•The filename of train_file can be any filename; •The extension of train_file can be defined by user arbitrarily;•model_file contains the model built based on training data by SVM;Format of input file (training data)•For text classification, training data is a collection of documents;•Each line represents a document;•Each feature represents a term (word) in the document;–The label and each of the feature: value pairs are separated by a space character –Feature: value pairs MUST be ordered by increasing feature number •Feature value : e.g., tf-idf;Testing Step•svm-classify test_file model_file predictions•The format of test_file is exactly the same as train_file;•Needs to be scaled into same range;•We use the model built based on training data to classify test data, and compare the predictions with the original label of each testdocument;Which means the first document is classified correctly but the secondone is incorrectly.Example•In test_file, we have:1 101:0.2 205:4 209:0.2 304:0.2…-1 202:0.1 203:0.1 208:0.1 209:0.3………After running the svm_classify, the Predictions may be:1.045-0.987……Which means this classifierclassify these two documentsCorrectly.1.0450.987……orConfusion Matrix•a is the number of correct predictions that an instance is negative;•b is the number of incorrect predictions that an instance is positive;•c is the number of incorrect predictions that an instance if negative;•d is the number of correct predictions that an instance is positive; Predictednegative positiveActualnegativea bpositive c dEvaluations of Performance•Accuracy (AC) is the proportion of the total number of predictions that were correct.AC = (a + d) / (a + b + c + d)•Recall is the proportion of positive cases that were correctly identified.R = d / (c + d)•Precision is the proportion of the predicted positive cases that were correct.P = d / (b + d)• Actual positive cases numberpredicted positive cases numberExample4 5 0 " - "5 5 0 " + "A c t u a l T e s t C a s e s :4 0 05 3 0P r e d i c t e d :5 02 0For this classifier: a = 400b = 50c = 20d = 530Accuracy = (400 + 530) / 1000 = 93%Precision = d / (b + d) = 530 / 580 = 91.4%Recall = d / (c + d) = 530 / 550 =
View Full Document