DOC PREVIEW
Stanford CS 224 - Lecture Notes

This preview shows page 1-2-3 out of 8 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Applying Name Entity Recognition to Informal Text Yu-shan Chang Department of Computer Science Stanford University [email protected] Yun-Hsuan Sung Department of Electrical Engineering, Stanford University [email protected] Abstract – Although Name Entity Recognition (NER) has been a well-studied problem in recent years, it is seldom applied to informal document, such as E-mail message and Newsgroup postings. Unlike the formal text, which is well-structured and with seldom error, the name entities are much more difficult to recognize in informal one. The key problems for informal text are that it has unstructured properties, more grammatical error, and more spelling error. All of these properties will degrade the performance of the existent classifiers and well-designed features which are suitable for original NER. In this project, we are going to apply two approaches, Maximum Entropy Classifier (Max-Ent ) and Conditional Random Field (CRF), which are often used for formal text NER, to informal text NER. We do some experiment to show if they are still good for the informal task. We also focus on how to extract efficient and effective fea-tures especially for informal text. I. INTRODUCTION Name Entity Recognition (NER) has been a well-studied problem for formal text. The key problem is to extract some special name entities, such as person name, location, and or-ganization, in a text. There are several classification methods which are successful to be applied on this task. Chieu and Ng[1] and Bender et al.[2] used Maximum Entropy approach as the classifier. Conditional Random Filed (CRF) was explored by McCallum and Li [3] to NER. Mayfield et al.[4] applied Support Vector Machine (SVM) to classify each name entity. Florian et al. [5] even combined Maximum Entropy and hid-den Markov Model (HMM) under different conditions. Some other researches are focused more on extracting some efficient and effective features for NER. Chieu and Ng[1] successful used local features, which are near the word, and global fea-tures, which are in the whole document together. Klein et al.[6] and Whitelaw et al.[7] reports that character-based features are useful for recognizing some special structure for the name entity. All of these works are only applied to formal text, which is a collection of news wire articles form the Reuters Corpus1. These kinds of texts are well-structured, well organized, and 1 http://about.reuters.com/researchandstandards/corpus/ have seldom grammatical and spelling error. The sentences are well-formed and easily to detect the sentence boundary. There are only few non-word characters in this kind of text, which are easily confused when doing Name Entity Recognition. All of these properties of formal text make NER relatively easier compared with informal text. Until now, there are still seldom works on this task. Minkov[8] has some early research results about informal text NER by using Hidden Markov Model (HMM) and Conditional Random Field (CRF). In our project, we are going to explore Maximum Entropy Classifier (MaxEnt) and Conditional Random Field (CRF) in informal text NER. The informal text can separate into two subsets, E-mail message and NewsGroups postings. We used some special features which are designed for E-mail combined with those features which are general used in formal text NER. The following report is organized as follows. We first clarify the problem settings and define the performance metrics in Section Ⅱ. Then, the two different approaches we used are explained in Section Ⅲ. In section Ⅳ, we describe the feature engineering work in the following experiment. Section V is the corpora and experiment setting. The experiment results and error analysis are shown in section VI. Finally, we conclude this report in Section VII. II. PROBLEM STATEMENT The main problem of NER is that we want to recognize the name entity, which is general organized into personal name, location, and organization, from a document. In general, we extract some features, either string label or indicator function for each word. Then, we use labeled training data and extracted features to train a classifier by numerical optimization algo-rithm to get the optimized parameters. Finally, given the fea-tures of a word in testing data, we use this classifier to recog-nize which label it belongs to. The features we used can separate into two different categories. The first one is string label, which is just the description of the features. For example, “isFristWord=1” means the word is the first word in a sentence. Another one is using the indicator function, which can be seen as the answer of a true or false question. For example:2, if w is the first word of a sentence ⎩⎨⎧=01)(wf , otherwise Unlike formal text, informal texts have special properties compared with formal text. It makes the original well-designed features not suitable for this new extension any more. We need to observe these properties and find some new suitable fea-tures. The first property is that e-mail and newsgroup posting always begin with some well-defined header, which has special but less syntax structure. Take email as example, If we treat this part as general paragraph, we won’t get good result due to its special format. However, we found that all the name entity only exist in some special headers, such as From, to , …etc. We can do some special process for these headers to extract the name entity. Another special zone is the “signature”, which is general found at the end of email and posting. We always have the name is this signature. However, just like the headers, it lacks sentence structure and is difficult to use general features to find the name entity. Lacking common format makes it even more difficult to deal with than the headers. There is also some research to extract the signature form email (Carvalho and Cohen, 2004). For example: Even in the text zone, the structure of those sentences is not well-formed and it has lot of grammar and spelling errors. Further, since the senders and receivers are restricted, they contain more abbreviation and group-specific and task-related jargon. For example We have four different performance metrics, accuracy, preci-sion, recall, and F1. Their definitions are as follow: Correct Not correct Selected tp fp Not selected fn tn tnfnfptptntpAccuracy++++=, fptptpecision+=P:Pr fntptp+=RRecall , ()R1-1p111αα+=F


View Full Document

Stanford CS 224 - Lecture Notes

Documents in this Course
Load more
Download Lecture Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?