DOC PREVIEW
Stanford CS 224 - Lecture Notes

This preview shows page 1-2-3 out of 9 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Figure 2. Flow Chart to Apply HMM in BondecFigure 3. Original Sentence Format in the Train/Test FilesFigure 4. A Document Unit for the HMM PackageBondec – A Sentence Boundary DetectorHaoyi WangStanford Engineering InformaticsStanford UniversityStanford, CA [email protected] HuangStanford Medical InformaticsStanford UniversityStanford, CA [email protected] Bondec system is a sentence boundary detectionsystem. It has three independent applications (Rule-based, HMM, and Maximum Entropy). MaximumEntropy Model is the central part of this system,which achieved an error rate less than 2% on part ofthe Wall Street Journal (WSJ) Corpus with onlyeight binary features. The performance of the threeapplications is illustrated and discussed.Keywords: Sentence boundary disambiguation, MaximumEntropy Model, Features, Generalized IterativeScaling, Hidden Markov Model.1 INTRODUCTIONSentence boundary disambiguation is the task ofidentifying the sentence elements within a paragraphor an article. Because the sentence is the basictextual unit immediately above the word and phrase,Sentence Boundary Disambiguation (SBD) is one ofthe essential problems for many applications ofNatural Language Processing – Parsing, InformationExtraction, Machine Translation, and DocumentSummarizations. The accuracy of the SBD systemwill directly affect the performance of theseapplications. However, the past research work in thisfield has already achieved very high performance,and it is not very active now. The problem seems toosimple to attract the attention of the researchers. In fact, the problem itself is not as simple as itappears to be. We all know that a sentence is asequence of words ending with a terminalpunctuation, such as a ‘.’, ‘?’, or ‘!’. Most sentencesuse a period at the end. However, we should noticethat sometimes a period can be associated with anabbreviation, such as “Mr.” or represent a decimalpoint in a number like $12.58. In these cases, it is apart of an abbreviation or a number; we cannotdelimit a sentence because the period has a differentmeaning here. On the other hand, the trailing periodof an abbreviation can also represent the end of asentence at the same time. In most such cases, theword following this period is a capitalized commonword (e.g., The President lives in Washington D.C.He likes that place.). Moreover, if the followingword is a proper noun or part of a proper phrase,which is always capitalized, the SBD system usuallyshould not label the period as the start of the nextsentence but as a part of the same sentence (e.g., P.R. China). Disambiguating a proper name from acommon word is a challenging problem and makesthe sentence boundary ambiguity problem evenmore complicated. The original SBD systems were built from manuallygenerated rules in the form of regular expressionsfor grammar, which is augmented by a list ofabbreviations, common words, proper names, etc.For example, the Almebic system (Aderdenn et al.,1995) deploys over 100 regular-expression ruleswritten in Flex. Such a system may work well on thelanguage or corpus for which they were designed.Nevertheless, developing and maintaining anaccurate rule-based system require substantial handcoding effort and domain knowledge, which are verytime-consuming. Another drawback of this kind ofsystems is that it is difficult to port an existingsystem to other domains or corpora of otherlanguages. Such a switch is equal to building a newsystem beginning from scratch. The current research activity in SBD focuses onemploying machine learning techniques, such as1Decision Tree, Neural Network, Maximum Entropy,and Hidden Markov Model, which treat the SBDtask as a standard classification problem. Thegeneral principles of these systems are: training thesystem on a training set (usually annotated) to makethe system “remember” the features of the localcontext around the sentence-breaking punctuation orglobal information on the list of abbreviations andproper names, and then recognize the real textsentences using this trained system.Because the system we developed only implementsmachine-learning techniques, we will limit ourdiscussion to the scope of this category. For thisproject report, Section One describes the problemwe want to solve; Section Two summarizes therelated research on the machine-learning systems;Section Three illustrates the approach we chose forthis topic, including an introduction to themathematic background; Section Four discusses theprincipal algorithms and explains the architecture ofBondec system; Section Five evaluates theperformance of our system and compares it amongthree applications; Section Six demonstrates theexperiences and lessons we derived from our work.2. RELATED RESEARCHPalmer and Hearst (1997) developed a system -SATZ - to use the local syntactic context to classifythe potential sentence boundary. To obtain thesyntactic information for local context, SATZ needsthe words in the context to be tagged with part-of-speech (POS). “However, requiring a single POStagging part-of-speech assignment for each wordintroduces a processing circularity: because mostpart-of-speech taggers require predeterminedsentence boundaries, the boundary disambiguationmust be done before tagging. But if thedisambiguation must be done before tagging, nopart-of-speech assignments are available for theboundary determination system.” To bypass this problem, SATZ redefines PennTreebank POS tags into 18 generic POS categories,such as noun, article, proper noun, preposition, etc.These 18 categories are combined with two othernon-POS categories – capitalization and following apunctuation mark – to compose a syntactic categoryset for each word. Thus, the sets for three tokensbefore and three tokens after the boundary candidateconstitute the local syntactic context, which is theinput for two types of


View Full Document

Stanford CS 224 - Lecture Notes

Documents in this Course
Load more
Download Lecture Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?