Berkeley STAT 133 - Text Manipulation: Creating Spam-related Variables

Unformatted text preview:

Stat 133, Fall 04Homework 3: Text Manipulation: Creating Spam-related VariablesDue: Monday, 4 OctFor this homework, you need to create three variables from the email messages. These messagesare available in an R dataset in the rda file, located athttp://www.stat.berkeley.edu/users/nolan/stat133/data/Emails.rda.Descriptions of thrity variables appear below. They are split into three groups: A, B, C. You are totransform the email data into three of these variables, one from each group. The list found at theend of the homework determines which three you will write code to create. For at least one of thethree your code must be in an R function.Email the code you use to create these three variables, as plain text in the body of your email. Inaddition, turn in a graphical comparison of spam and ham for each of these three variables. Discusswhether or not you think this variable will be useful in predicting if an email message is ham orspam.GROUP A:1. The subject is ”Re: something or other.”2. The number of lines in the body of the email.3. The number of characters in the body of the email.4. The Reply-To has an underline and numbers/letters.5. The number of exclamation marks in the subject.6. The number of question marks in the subject.7. The number of attachments.8. The X-priority or X-Msmail-Priority set to high.9. The number of recipients.GROUP B:1. The average length of words in the body.2. The Received time in the current time zone.13. The From: ends in numbers, e.g.david gezi <[email protected]>4. The subject is all capitals (excluding punctuation and numbers)5. The percent of lines in the body of the email that begin with >.6. The subject contains one of the following words: viagra, pounds, free, weight, guarantee,millions, dollars, credit, risk, prescription, generic, drug, money back, credit card.7. The Message-Id has no hostname.8. The body of the email contains a line with the two words ”Original” and ”Message” and noother alpha characters.9. The percentage of blanks in the subject.10. The body of the email contains a the word “wrote:” or “schrieb:” or “ecrit:” or a similiarexpression in another language.GROUP C:1. The email contains the recipient’s email address.2. The recipient list is sorted by address3. The subject has punctuation or digits surrounded by characters, e.g. V?agra and pay1ng, butnot New!4. The difference between the Date and the Received date (be careful with time differences.5. The header states that the message is multipart, but it is mostly text or html (i.e. the numberof attachments that are plain text or html).6. The email contains images.7. The number of dollar signs in the body of the email.8. The body contains Dear something, such as DEAR SIR, or Dear Madam9. The Message-Id has a hostname that does not match the senders hostname, but does match ahost name at a relay point.10. For an HTML attachment, the percentage of characters in the html tags as a percentage ofthe total number of characters in the message (excluding blanks). Note that html tags start <and end >.211. The percentage of the characters in the body of the email that are upper case (excludingblanks, numbers, and punctuation).The emails you will use for this assignment are in the list Emails, where each element of the listcontains one email message. Each email is itself a list consisting of three elements:• The element named “header” is a named character vector, where each name corresponds toa key in the email header and the value of the element corresponds to the text following the: in the key:value of the header.• The element named “body” is itself a list, the first element of which is named ”text” andcontains the body of the email message. This element is a character vector, with one stringper line in the email message. A second element, if it exists, is named “attachments.” Thiselement is a list containing one element per attachment. The individual attachment elementis a list of two elements – one containing information about the format of the attachment andthe other containing and the contents of the attachment.• The element named spam is a logical vector of length 1 that indicates whether the messageis spam (TRUE) or ham (FALSE).To determine which three variable you are to write the code to create, look up the last letter in yourSCF login, i.e. if your login is s133bu then your assignment is #8 in group A, #1 in group B, and#4 in group C.A B C SCF login A B C SCF login1 1 1 a 2 2 2 b3 3 3 c 4 4 4 d5 5 5 e 6 6 6 f7 7 7 g 8 8 8 h9 9 9 i 9 10 10 j5 3 4 k 9 2 6 l8 2 7 m 7 4 8 n6 5 9 o 5 6 10 p4 7 3 q 3 8 1 r2 9 2 s 1 10 3 t8 1 4 u 7 2 5 v6 3 7 w 5 4 6 x1 7 2 y 4 5 9


View Full Document

Berkeley STAT 133 - Text Manipulation: Creating Spam-related Variables

Download Text Manipulation: Creating Spam-related Variables
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Text Manipulation: Creating Spam-related Variables and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Text Manipulation: Creating Spam-related Variables 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?