Penn CIS 400 - Learned Automatic Recognition Extraction of appointments from email - D2839522

Home> Schools> University of Pennsylvania> Cinema Studies (CIS) > CIS 400> Learned Automatic Recognition Extraction of appointments from email

Penn CIS 400 - Learned Automatic Recognition Extraction of appointments from email

School name University of Pennsylvania

Course Cis 400- Senior Project.

Pages 12

Download Save

Unformatted text preview:

Learned Automatic Recognition Extractionof appointments from emailLauren [email protected]: Fernando PereiraAbstractEmail has become one of the most prominent forms of communication. Many peoplereceive an abundance of email each day making reading every email and extractingimportant information a very time consuming and tedious daily task. A lot of people useemail as a means of scheduling appointments and notifying others of upcoming events. Inthis case, the recipient of the email must identify the details of the event and physicallyadd them to a calendar or remember them. If a user does not have her calendar availableas she is reading the email, she may forget to add the event.This project involves trying to automate the extraction of important informationregarding events and appointments from email. Specifically, two problems will beaddressed. The first is identifying whether the email contains information about an event.Then, if there is information about an event, extracting things such as the title of event,the date, the time, and the location so that it could automatically be added to a calendar.To perform these tasks, the program has two separate components: a classifier and anextractor. The classifier takes email as input and outputs a label indicating whether or notthe email contains an event. The emails that do contain events are input into the extractorand the title, date, time, and location is output. These components are combined to createa full system that takes emails as inputs and outputs event information. This system isintegrated as part of Penn’s CALO project.Related WorkInformation Extraction is a common application of natural language processing.Extracting “named entities, expressions that denote locations, people, companies, times,and monetary amounts” help interpret a block of text. A popular way to extract namedentities is comparing text with “named entity recognizers” such as lists of people andgeographic names (Bird et al, 2005). Extraction from email is a more recent applicationof natural language processing and a more complex version of the general informationextraction problem. This is one that cannot be solved simply by using a series of namedentity recognizers considering the variety of email and the lack of a globally consistentformat. In the past few years, there have been numerous research projects regarding eventextraction and other similar extraction applications. To date, no one has been successfulwith the extraction of events from email.The known research most similar to mine is the project done by Julie A. Black andNisheeth Ranjan of Stanford University. They too were working on extracting eventsfrom email. They categorized all emails as one of three types: official meeting emailswhere the event information is separate from the rest of the email, personal meetingemails where the event information is in the body, and all other emails. Theirclassification algorithm is based on word frequencies. The extractor “uses pairs of sampledocuments and filled templates to induce pattern match rules that directly extract fillersfor the slots in the template.” In addition, they have three specific features. They were notable to achieve their performance goals using this system (Black and Ranjan, 2004).Some additional prior work has been done by Angelo Dalli of the University of Sheffield.Dalli’s goal was similar to mine and that of Black and Ranjan. He took a differentapproach to his project by summarizing emails. His project had four components: anentity recognizer, a threaded email filter, an email signature filter, and an emailsummarizer (Dalli, 2004).At Carnegie Mellon University, William Cohen, Vitor Carvallo, and Tom Mitchell had asimilar project about email classification that involved more natural language processingtechniques. Their project was more general, not strictly focused on finding events butclassifying all kinds of email. They came up with many classifications for emails basedon nouns and verbs contained in the email. Examples are request, propose, commit,amend, and deliver information. These are combined with information about the nouns tocome up with a classification or combination of classifications (Mitchell et al, 2004).Trausti Kristjansson, Aron Culotta, Paul Viola, and Andrew McCallum have done workwith information extraction to fill in database forms. They extracted the information fromunstructured resources such as documents or email. Their goal was to create aninteractive information extraction system which is “to assist the user in filling in databaseforms while giving the user confidence of the integrity of the data.” They usedConditional Random Fields (CRFs) which is the same technique used in my extractor(Kristjansson et al, 2004).My project takes parts from each of these projects. This project is most similar to that ofBlack and Ranjan because it focuses on email and has both separate classification andextraction components. Their system however is based on pattern matching with sampledocuments whereas my project has a more natural language processing approach.Building specific features should be more effective than sample documents. Samples areonly effective if all of the data is similar. Features can take into account different kinds ofinput and are will be more effective on unexpected data. Like Kristjansson, Culotta, andViola, the extraction algorithm uses CRFs but I am applying this to email. A combinationof all the techniques should, in the future, prove to be more successful.Technical ApproachThis program is divided into two large components: the classifier and the extractor. Thediagram below depicts how the components interact. The following will explain thetechnical details of each component.Figure 1Both components use artificial intelligence machine learning techniques to train eachsystem using pre-labeled emails. Each component has its own array of features that arespecifically engineered to improve its accuracy. During training, each feature is assigneda weight. The weight of a feature represents the likeliness that an email or token matchingthis feature will positively identify with what the feature is testing for. For example, inthe classifier, all of the features are designed to determine if the message contains anevent. If a feature is weighted positively, that signifies that when that feature is found, themessage most likely contains an event.When an email initially

View Full Document


School:
Email:
New Password:
Confirm Password:

Penn CIS 400 - Learned Automatic Recognition Extraction of appointments from email

Sign up for free to view:

Please select your school