MIT 6 891 - Email Filtering: Machine Learning Techniques and an Implementation for the UNIX Pine Mail System - D894142

Home> Schools> Massachusetts Institute of Technology> Electrical Engineering and Computer Science (6) > 6 891> Email Filtering: Machine Learning Techniques and an Implementation for the UNIX Pine Mail System

MIT 6 891 - Email Filtering: Machine Learning Techniques and an Implementation for the UNIX Pine Mail System

School name Massachusetts Institute of Technology

Course 6 891- Advanced Topics in Theoretical Computer Science

Pages 42

Download Save

Unformatted text preview:

6.891 Final Project - December 1999 ____________ Yu-Han Chang 1 6.891 Final Project Email Filtering: Machine Learning Techniques and an Implementation for the UNIX Pine Mail System Yu-Han Chang M.I.T. A.I. Lab & L.C.S. Cambridge, MA 02139 [email protected] December 10, 1999 1. Introduction The problem of email filtering is a very practical one. As our lives become ever increasingly tied to the online world, the volume of email coming into our inboxes has also been increasing steadily. The need to make sense of all this information is critical if one is to retain their sanity amid spam, smut, silliness, and semblances of sense. The current solution usually consists of using a mail filtering program that can sort incoming mail based on user-specified rules. Many hours go into crafting filtering rules that can sort out mail from specific addresses or with specific subject lines. But keeping track of all these rules is yet again enough to drive one up the wall. New junk mail from new addresses keep coming through the door, and real correspondents change email addresses all the time. Here is where machine learning comes to the rescue. If we can train a machine, or rather a computer program, to learn what email should go where, then we would no longer have to deal with the hassle of sorting email ever again. Junk mail would be discarded. Personal correspondence could be filtered into folders specific to the correspondent. Business or coursework mail could go to their respective folders. Life would again be simple, or at least organized. Preferably, the machine would be an online-learner that can incrementally update its filtering abilities over time. Thus, when new types of email arrive, it can update the way it filters such mail accordingly. It learns as it watches us file this mail, and it can learn from its own mistakes. With such an intelligent mail filtering program at our side, we can finally just sit back, relax, and read our mail. This paper presents techniques to achieve this goal based on well-understood machine learning algorithms. It also describes our implementation, which we call sortmail, that we have written for users of the Pine mail system under UNIX. Section 2 provides the theory, and Section 3 describes the implementation.6.891 Final Project - December 1999 ____________ Yu-Han Chang 2 2. Techniques Machine learning offers a myriad of different methods to go about classifying various objects into their respective categories. From estimating probability distributions to regression techniques to reinforcement methods to finding separators in feature spaces, we have a wide ranging set of tools at our disposal. In our case, we must narrow our domain to the problem of text classification. Text classification takes a document consisting of a set of words and tries to categorize the document among a specified set of topics, or classes. So in our case, we thus classify each given email as belonging to a particular folder. However, we must realize that an email is not exactly a set of words, and we must decide how to extract the relevant words from the given email. This in itself is a topic of much research. We found these techniques rather interesting, and so Section 2.1. is devoted to a brief summary of these methods. The next two sections then outline two different machine learning techniques for text classification. 2.1. Feature Extraction from Email An email message does not fall neatly into the domain of text classification. First of all, it contains much extraneous information that is irrelevant (or at least redundant) to our text classification problem. For example, the major portion of the email header need not be considered part of our document. Lines such as: Received: from lychee ([170.1.11.124]) by apple.webjuice.com (8.9.3/8.9.3) with SMTP id KAA43232; Wed, 8 Dec 1999 10:54:40 -0800 (PST) (envelope-from [email protected]) do not convey much useful information to us for classification purposes. We might imagine that some of these IP addresses contain useful information in the sense that they suggest which user has sent us the email. This in turn might suggest a particular classification since most messages from this particular user are filed in a particular folder. However, we observe that most email programs hide this information from the user. Thus, if we are trying to mimic the user’s preferred method of classification, we can also ignore this information. Even if it contains relevant information in the sense just described, this information must have strong correlation to other user-observe-able elements of the email message. This is because the user only uses the observed information for classification, and thus if the hidden information leads to correct classification, the only reason is because it is highly correlated to the observed inputs. Essentially, our learning problem seeks to learn the user’s classification function, mapping documents to folders. If our learned function is equivalent to the user’s true function, then it must mean that the inputs of our function can be transformed into the inputs of the user’s function, and vice versa. Hence, we only use observed information in our learning problem. We thus use only the To:, From:, and6.891 Final Project - December 1999 ____________ Yu-Han Chang 3 Subject: fields of each message’s header as inputs for our classifier. We also include the entire text of the message body. Once we have done this, there are still many refinements we can make to the document which will facilitate the learning process. Since we are classifying messages based on the existence of certain words in each message, we can simplify our problem by eliminating those words which appear with high frequency in all documents, regardless of classification. These are words such as “the”, “he”, or “it.” Common pronouns, modifiers, simple abverbs and verbs all fall into this category. We call this list of words the stoplist. Since these words appear in almost all documents, we can ignore them without losing any information. Essentially this is just noise that does not help with classification. Andrew McCallum’s text processing package bow contains a stoplist of 524 words [McCal], which we utilize in our implementation. When we encounter any of these words in the process of lexing the message, we

View Full Document