Stanford CS 276 - Characteristics Identifier Scoring (13 pages)

Previewing pages 1, 2, 3, 4 of 13 page document View the full content.
View Full Document

Characteristics Identifier Scoring



Previewing pages 1, 2, 3, 4 of actual document.

View the full content.
View Full Document
View Full Document

Characteristics Identifier Scoring

54 views

Other


Pages:
13
School:
Stanford University
Course:
Cs 276 - Information Retrieval and Web Search

Unformatted text preview:

Characteristic Identifier Scoring and Clustering for Email Classification By Mahesh Kumar Chhaparia Email Clustering Given a set of unclassified emails the objective is to produce high purity clusters keeping the training requirements low Outline Characteristic Identifier Scoring and Clustering CISC Identifier Set Scoring Clustering Directed Training Comparison of CISC with some of the traditional ideas in email clustering Comparison of CISC with POPFile Na ve Bayes classifier Caveats Conclusion Evaluation Evaluation on Enron Email Dataset for the following users purity measured w r t the grouping already available User Number of folders Number of Messages Messages in smallest folder Messages in largest folder Lokay M 11 2489 6 1159 Beck S 101 1971 3 166 Sanders R 30 1188 4 420 Williams w3 18 2769 3 1398 Farmer D 25 3672 5 1192 Kitchen L 47 4015 5 715 Kaminski V 41 4477 3 547 CISC Identifier Set Sender and Recipients Words from the subject starting with uppercase Tokens from the message body Word sequences with each word starting in uppercase length 2 5 only split about stopwords excluding them Acronyms length 2 5 only Words followed by an apostrophe and s e g TW s extracted to TW Words or phrases in quotes e g Trans Western Words where any character excluding first is in uppercase e g eSpeak ThinkBank etc CISC Scoring Sender Initial idea generate clusters of email addresses with frequency of communication above some threshold Identifies good clusters of communication Difficult to score when an email has addresses spread across more than one cluster Fixed partitioning and difficult to update CISC Scoring Contd Sender Need a notion of soft clustering with both recipients and content Generate a measure of its non variability with respect to the addresses it co occurs with or the content it discusses in emails Example 1 2 3 3 4 2 3 4 in Folder 1 2 1 3 4 1 3 1 3 in Folder 2 Emphasizes social clusters 1 2 3 1 3 4 Classify 2 1 3 4 Traditionally Folder 2 address frequency



View Full Document

Access the best Study Guides, Lecture Notes and Practice Exams

Loading Unlocking...
Login

Join to view Characteristics Identifier Scoring and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Characteristics Identifier Scoring and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?