Princeton COS 116 - Laboratory 11 - D1114438

Home> Schools> Princeton University> Computer Science (COS) > COS 116> Laboratory 11

DOC PREVIEW

Princeton COS 116 - Laboratory 11

School name Princeton University

Course Cos 116- The Computational Universe

Pages 6

This preview shows page 1-2 out of 6 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

COS 116 The Computational UniverseLaboratory 11: Machine LearningIn the lecture, we surveyed many machine learning algorithms and their applications. In this lab, you will explore algorithms for two of those applications in greater detail: spam filtering and text generation. Lab submission: Submit by Tuesday, May 6 in Room 410 of the CS building (Ms. Donna O’Leary’s office) during normal business hours. Turn in answers to all the questions posed in the body of the lab and in the “Additional Questions” section. Part 1: Introduction to Spam FilteringModern spam filters use statistical inference to classify an email as spam or “ham” (i.e. non-spam). These filters require access to a large corpus of emails, containing both spam and ham, with each email in the corpus labeled appropriately. The larger and more diverse the corpus, the better the performance of the filter. Even the simplest of these kinds of filters can detect more than 90% of spam. Here is the procedure that a simple spam filter follows to classify an email:• Step 1: Split the email to be classified into individual words. Ex. “Your loan request approved!” becomes ‘your’, ‘loan’, ‘request’, and ‘approved’.• Step 2: Compute the spam score for each word in the email. The formula for the spam score of a word is:Fraction of spam emails in the corpus that contain word SpamScore(word) = -------------------------------------------------------------------Fraction of ham emails in the corpus that contain wordRemark: If word is much more prevalent in spam than in ham, then SpamScore(word) will be a big number. Conversely, if word is much more prevalent in ham than in spam, then SpamScore(word) will be a small number. So the spam score correlates with the “spammy-ness” of a word.• Step 3: Multiply the spam scores for all the words in the email to get a spam score for the email itself. For example: SpamScore(“Your loan request approved!”) = SpamScore(‘your’) × SpamScore(‘loan’) × SpamScore(‘request’) × SpamScore(‘approved’).Remark: The spam score for an email will be large if contains many “spammy” words, and small otherwise.• Step 4: Classify the email as spam if its spam score is above a certain threshold, and classify it as ham otherwise. The value of the threshold controls how aggressively spam is filtered. A lower threshold will cause more email to be classified as spam.Review the preceding procedure carefully, as understanding it is necessary for completing the lab. Ask your TA for help if you have any questions.Part 2: Experiments with Spam Filtering1. Open this web page:http://www.cs.princeton.edu/courses/archive/spring08/cos116/lab11/spam1.htmlThis is a web interface to a spam filter. The spam filter is connected to a corpus of roughly 8,000 emails – 6,000 ham and 2,000 spam.2. Select “First Data Set”, and then click “Classify” (Don’t adjust the value of “Threshold” yet.) This data set contains ten emails. When you click “Classify”, the procedure described in Part 1 is performed for each email in the data set, and the results are displayed. 3. Read the emails that were misclassified, and comment in your report why you think each one was misclassified. Identify some words in each email that you suspect misled the spam filter. 4. Adjust “Threshold” so that all the spam emails are correctly classified while minimizing the number of misclassified ham emails. Note this threshold in your report. Click “Classify” after each adjustment to “Threshold”. Recall that an email is classified as spam if its spam score is above the threshold and as ham otherwise.5. Adjust “Threshold” so that all the ham emails are correctly classified while minimizing the number of misclassified spam emails. Note this threshold in your report.6. Which of the previous approaches to setting the threshold would you prefer to use for your own inbox? Explain your answer.Part 3: More Experiments with Spam Filtering1. Open this web page:http://www.cs.princeton.edu/courses/archive/spring08/cos116/lab11/spam2.htmlThis web interface allows you to enter your own emails and have them classified by the spam filter. You can enter any text you like; it doesn’t need to be in the format of an email (i.e. with the ‘From:’, ‘To:’, etc. at the top). You can even enter exactly one word to determine the “spammy-ness” of that word.2. Enter each of the following words individually into the text area, and click “Classify” for each word:a. Spam words: viagra, potency, money-back, mortgage, lender, megab. Ham words: blog, management, dialog, ouch, alumni, administrivia3. From Part 2, Step 3, obtain the words you identified as having misled the spam filter. Paste each of these words individually into the text area, and classify them. Compared to the words in the previous step, are they as spammy/hammy as you suspected? Report your findings.4. From your own inbox, copy a ham and spam email, and classify each of them. (If you are concerned about privacy, know that the web page does not record anything you enter.) Does the spam filter correctly give the spam email a higher spam score than the ham email? Put the text of both emails in your report.5. Try writing a couple of emails that defeat the spam filter. a. Write an email that conveys a legitimate message, but nonetheless has a high spam score. Try and make the spam score as high as you can. Ex: “Let’s meet for lunch at Frist, I am mega hungry. I hear the chili has a lot of potency.”b. Write an email that conveys a spammy message, but nonetheless has a low spam score. Try and make the spam score as low as you can. Can you force the score below 1.0? Ex: "How about some medicine that rhymes with a famous waterfall?"Put the text of both emails in your report.Part 4: Introduction to Text GenerationIt is surprisingly easy to generate novel, semantically-plausible text from a small amount of sample text. For example, from the 2007 State of the Union address, one can automatically generate text like the following:“This war is more competitive by strengthening math and science skills. The lives of our nation was attacked, I ask you to make the same standards, and a prompt up-or-down vote on the work we've done and reduce gasoline usage in the NBA.”Below is a simple procedure for generating this kind of text from a sample text. The procedure outputs one word at a time. There is a single

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2 out of 6 pages.

Princeton COS 116 - Laboratory 11

Sign up for free to view:

Please select your school