Stanford CS 224N - Hidden Markov Models for Information Extraction

Unformatted text preview:

Hidden Markov Models for Information ExtractionAbstractII. MethodsII.i DataII.ii HMM Structure Template and Learning AlgorithmII.iii Parameter EstimationIII. Experimental ResultsIV. Conclusion & Further WorkThis was a very challenging yet fun project, through which I learned about the technical, real-life implementation issues in natural language processing that does not surface on the theoretical level. One of the most important conclusions from this proj The topic of structural learning of HMMs applied to information extraction holds much potential for further research. Technical issues such as implementing the Baum-Welch algorithm, the structure selection algorithm, and data processing functions comprV. Related WorkVI. ReferencesHidden Markov Models for Information ExtractionNancy R. ZhangJune, 2001AbstractAs compared to many other techniques used in natural languageprocessing, hidden markov models (HMMs) are an extremelyflexible tool and has been successfully applied to a wide varietyof stochastic modeling tasks. This paper uses a machine learningapproach to examine the effectiveness of HMMs on extractinginformation of varying levels of structure. A stochasticoptimization procedure is used to find the optimal structure for agiven task, and a modified version of the Baum Welch algorithmis used for parameter estimation.I. IntroductionThe purpose of this project is to investigate the relationship between structure and performance ofHMMs applied to information extraction problems. It is intuitive that different stateconfigurations are appropriate for different types of extraction problems. What would be theeffect of using the same structural template to train HMMs for different extraction tasks ofvarying levels of complexity?A simple structural template will be used for HMM structure learning by the stochasticoptimization algorithm described in Freitag & McCallum (1999). Due to the simplicity of thestructural template (which will be explained in detail in II.ii), I hypothesize that the search spaceit defines is sufficient for good performance on simple extraction tasks (extracting fields that arerelatively consistent in symbol content and phrase structure), but will not generalize well toharder tasks.A more important objective of this project is to gain real-life experience in the use ofHMMs in information extraction, and to explore the fascinating, less-understood domain ofautomated structural learning.II. MethodsII.i DataThe corpus used for this project is created form the L.A. Weekly Restaurants data set taken fromthe RISE archive of Ion Muslea’s group (http://www.isi.edu/~muslea/RISE/). It consists of lists ofrestaurant information and review in HTML format. The text is semi-structured. The restaurantsare organized in HTML list format, with each entry headed by the restaurant name, location, andphone number and followed by the review. The corpus contains listings of restaurants of avariety of different cuisines, which will be useful for the less-structured information retrievaltasks. Each datapoint from the corpus consists of the HTML entry for a specific restaurant.Thus, the task in this project is to extract an informational record (e.g. name, phone, cuisine) fromeach datapoint.For the purpose of HMM training, the datapoint will be labelled to reflect the target andnon-target fields. Accompanying each data symbol sequence will be a binary sequence Li, with Li= 1 if the ith symbol in the sequence belongs to the target field, and 0 otherwise.II.ii HMM Structure Template and Learning AlgorithmThe structure learning algorithm used is based on that described in Freitag & McCallum. Thestates in the HMM will be denoted by S = {s1, …, sn}, and the emission symbols by V = { w1, …,wK}. O = { o1,…, om} represents a the symbol sequence of a single datapoint. The probability oftransition to sj state while in state si will be denoted by P(sj| s1i), and that of emitting symbol wk instate si by P(wk| si).The HMM models used in this study will have four different kinds of states: Background,Prefix, Suffix, and Target. Target states emit tokens for the target field. Prefix and Suffix statesemit tokens that appear respectively before and after the target field. All other tokens are emittedin the background state. Each state will contain a label denoting its type.The structure learning algorithm follows a fixed template in its search and finds theoptimal structure from all of the structures reachable given the template. This reduces the searchspace significantly, and also adds human guidance to the process. Designing the template for thesearch is no trivial matter, since it determines which structures are and are not considered by thealgorithm. The Freitag and McCallum study allows for any number of background states, eachhaving outgoing transition only to the first prefix state and incoming transition only from the lastsuffix state. This may be useful in the case where the target field is distributed in several regionsof the text and must be extracted in fragments, since then the background symbols between thefragments may follow a different distributions. However, in the information extraction tasks thatwe are considering in this project, each field appears complete and unfragmented and at only onelocation in each observation sequence. Thus, there is not as much incentive to differentiatebetween one background state against another, which would significantly increase the complexityof the model. Thus, only one background state is allowed in the template, which would haveoutgoing transitions to itself and the prefix states, and incoming transitions from the prefix states.To allow variable prefix length, transition is allowed from the background state to anyprefix state, and from any suffix state to the background state. This is another deviation from theFreitag & McCallum study, which allows transitions only between the background state and thefirst prefix and final suffix state. This increase in model complexity will prove beneficial,especially for extracting less-structured data. For example, consider the task of extracting thetype of cuisine from the restaurant review. The following sequences appear in the corpus:… light but traditional Mexican faire …… intriguing mix of Japanese, Indian, and Italian cuisines …… authentic faire from Szechuan China …If the prefix (suffix) were fixed length, then the further away the prefix (suffix)


View Full Document
Download Hidden Markov Models for Information Extraction
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Hidden Markov Models for Information Extraction and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Hidden Markov Models for Information Extraction 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?