Stanford CS 224n - Comparing Naïve Bayesian and kNN algorithms for automatic email classification - D2133840

Home> Schools> Stanford University> Computer Science (CS) > CS 224n> Comparing Naïve Bayesian and kNN algorithms for automatic email classification

DOC PREVIEW

Stanford CS 224n - Comparing Naïve Bayesian and kNN algorithms for automatic email classification

School name Stanford University

Course Cs 224n- Natural Language Processing with Deep Learning

Pages 6

This preview shows page 1-2 out of 6 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Comparing Naïve Bayesian and k-NNalgorithms for automatic email classificationLouis EisenbergStanford University M.S. studentPO Box 18199Stanford, CA [email protected] ABSTRACTThe problem of automatic email classification hasnumerous possible solutions; a wide variety ofnatural language processing algorithms arepotentially appropriate for this text classificationtask. Naïve Bayes implementations are popularbecause they are relatively easy to understand andimplement, they offer reasonable computationalefficiency, and they can achieve decent accuracyeven with a small amount of training data. Thispaper seeks to compare the performance of anexisting Naïve Bayesian system, POPFile [1], to ahand-tuned k-nearest neighbors system. Previousresearch has generally shown that k-NN shouldoutperform Naïve Bayes in text classification. Myresults fail to support that trend, as POPFilesignificantly outperforms the k-NN system. Thelikely explanation is that POPFile is a systemspecifically tuned to the email classification taskthat has been refined by numerous people over aperiod of years, whereas my k-NN system is acrude attempt at the problem that fails to exploitthe full potential of the general k-NN algorithm.INTRODUCTIONUsing machine learning to classify email messagesis an increasingly relevant problem as the rate atwhich Internet users receive emails continues togrow. Though classification of desired messagesby content is still quite rare, many users are thebeneficiaries of machine learning algorithms thatattempt to distinguish spam from non-spam (e.g.SpamAssassin [2]). In contrast to the relativesimplicity of spam filtering – a binary decision –filing messages into many folders can be fairlychallenging. The most prominent non-commercialemail classifier, POPFile, is an open-sourceproject that wraps a user-friendly interface aroundthe training and classification of a Naïve Bayesiansystem. My personal experience with POPFilesuggests that it can achieve respectable results butit leaves considerable room for improvement. Inlight of the conventional wisdom in NLP researchthat k-NN classifiers (and many other types ofalgorithms) should be able to outperform a NaïveBayes system in text classification, I adaptedTiMBL [3], a freely available k-NN package, tothe email filing problem and sought to surpass theaccuracy obtained by POPFile.DATAI created the experimental dataset from my owninbox, considering the more than 2000 non-spammessages that I received in the first quarter of2004 as candidates. Within that group, I selectedapproximately 1600 messages that I felt confidentclassifying into one of the twelve “buckets” that Iarbitrarily enumerated (see Table 1). I then spliteach bucket and allocated half of the messages tothe training set and half to the test set.As input to POPFile, I kept the messages inEudora mailbox format. For TiMBL, I had toconvert each message to a feature vector, asdescribed in section 3.* - training and test combinedTable 1. Classification bucketsPOPFILEPOPFile implements a Naïve Bayesian algorithm.Naïve Bayesian classification depends on twocrucial assumptions (both of which are results ofthe single Naïve Bayes assumption of conditionalindependence among features as described inManning and Schutze [4]): 1. each document canbe represented as a bag of words, i.e. the orderand syntax of words is completely ignored; 2. in agiven document, the presence or absence of agiven word is independent of the presence orabsence of any other word. Naïve Bayes is thusincapable of appropriately capturing anyconditional dependencies between words,guaranteeing a certain level of imprecision;however, in many cases this flaw is relativelyminor and does not prevent the classifier fromperforming well.To train and test POPFile, I installed the softwareon a Windows system and then used acombination of Java and Perl to perform thenecessary operations. To train the classifier I fedthe mbx files (separated by category) directly tothe provided utility script insert.pl. For testing, Isplit each test set mbx file into its individualmessages, then used a simple Perl script fed themessages one at a time to the provided scriptpipe.pl, which reads in a message and outputs thesame message with POPFile’s classificationdecision prepended to the Subject header and/oradded in a new header called X-Test-Classification. After classifying all of themessages, I ran another Java program,popfilescore, to tabulate the results and generate aconfusion matrix.k-NNTo implement my k-NN system I used the TilburgMemory-Based Learner, a.k.a. TiMBL. I installedand ran the software on various Unix-basedsystems. TiMBL is an optimized version of thebasic k-NN algorithm, which attempts to classifynew instances by seeking “votes” from the kexisting instances that are closest/most similar tothe new instance. The TiMBL reference guide [5]explains:Memory-Based Learning (MBL) is based on theidea that intelligent behavior can be obtained byanalogical reasoning, rather than by theapplication of abstract mental rules as in ruleinduction and rule-based processing. Inparticular, MBL is founded in the hypothesis thatthe extrapolation of behavior from storedrepresentations of earlier experience to newsituations, based on the similarity of the old andthe new situation, is of key importance.Preparing the messages to serve as input to the k-NN algorithm was considerably more difficultthan in the Naïve Bayes case. A major challenge inusing this algorithm is deciding how to represent atext document as a vector of features. I chose toconsider five separate sections of each email: theattachments, the from, to and subject headers, andthe body. For attachments each feature was adifferent file type, e.g. jpg or doc. For the otherfour sections, each feature was an email address,hyperlink URL, or stemmed and lowercased wordor number. I discarded all other headers. I alsoignored any words of

View Full Document