DOC PREVIEW
Stanford CS 224 - Lexical Chains

This preview shows page 1-2-3-4 out of 13 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Topic Segmentation of MeetingsUsing Lexical ChainsFrancisco AdarveHelen KwongMike SperiosuCS224NSpring 2007Final Project2AbstractTopic segmentation attempts to divide a document into segments, where each segment corresponds to a particular discourse topic. Lexical chains are a disambiguation tool often used for text summarization, and more recently in topic segmentation. A lexical chain encapsulates the concept of a single word (or group of closely related words) that occurs repeatedly across some portion of a document. While it might be uninteresting to attempt topic segmentation on news articles, which often revolve around just a single topic for the duration of the article, meeting conversations typically move across several topics and are more interesting to segment by topic. Some work has been done using lexical chains to perform topic segmentation on transcribed meeting corpora (Galley et al., 2003), but this work used a very simple implementation of lexical chains that only counted identical words as belonging to one lexical chain. We present here an implementation of topic segmentation on meeting text using more advanced lexical chains, utilizing synonymy and other relationships between distinct words.IntroductionThe natural language task known as topic segmentation (or discourse segmentation) involves deciding where transitions from one topic to another are within a document. In some genres, such as news articles, a single document often discusses only a single topic. One way to apply topic segmentation to this genre is to concatenate many articles on different subjects into one large transcript, and then attempt to determine the boundaries between the articles. However, another interesting genre to apply topic segmentation to is that of conversation or meeting text. In these types of text, speakers typically discuss a variety of different topics as time progresses.There are difficulties that arise from working with transcribed spoken text. For instance, a large fraction of spoken language is ungrammatical due to the spontaneous manner in which it is generated. Words may be unnecessarily repeated, sentences restarted or abandoned altogether, or speakers interrupted mid-utterance. However, the usefulness of natural language applications capable of processing spoken text is obvious, and we decided to attempt topic segmentation on a transcribed meeting corpus.One method that is often used in text summarization, a related but different problem, is known as lexical chains. A lexical chain represents the repeated occurrence of a single word or of several closely related words over the course of some fraction of a document. Barzilay and Elhadad (1997) showed an implementation of lexical chains as applied to summarization.Galley et al. (2003) presented a lexical chain approach to topic segmentation of the ICSI meeting corpus. In their experiment, they used a very simple conception of lexical chains where each item in a chain is identical to all the other items. In order to generalize somewhat, however, they performed stemming on each token in the transcript as a preprocessing step. In this way, “rat” and “rats” would belong to the same lexical chain, but “mouse” would not.Following our intuition that lexical chains should encapsulate an entire concept, rather than just a particular word that can be used for that concept, we decided to use more advanced lexical chains such as those implemented by Silber and McCoy (2002). In this version, synonymy, hypernymy, hyponymy, and other3relations are used to group words into a lexical chain. We used WordNet (Miller et al.), a well known lexical information database, to retrieve these relationships.Since lexical chains are intended to contain concepts or ideas, we only formed lexical chains out of nouns for many of our experiments. Verbs, prepositions, and other such parts of speech tend to connect concepts together in various ways rather than introduce new concepts. We used a part-of-speech tagger to determine which words were nouns. As we will discuss later, we also used adjectives in some experiments.In the following sections, we will first discuss preprocessing of the corpus that we performed. Next, we will give a detailed explanation of the two different types of advanced lexical chains we implemented. Then we will explain the methods we used to determine topic boundaries given lexical chains. Finally, we will present our results according to two common evaluation measures, followed by concluding remarks.Data PreprocessingFor this paper we used the same 25 meetings from the ICSI meeting corpus as Galley et al. (2003). The meetings are transcribed as speaker turns (i.e. one instance of one person speaking without significant pause) rather than sentences. For some of our experiments, we wanted to work in terms of sentences, so a separate copy of the corpus broken into sentences instead had to be made. We used the simple heuristic that periods (.), question marks (?), and exclamation marks (!) usually serve as sentence boundaries. A common problem with this method is that abbreviations (e.g. “O.K.”) tend to have periods in them, but the data was preprocessed to handle these cases (e.g. “O_K”), which helped with sentence boundary determination.Since we used only nouns (and adjectives in some experiments) in our lexical chains, we needed to run a part-of-speech tagger on the data. We used the Stanford Log-linear Part-Of-Speech Tagger (Toutanova et al.,2003). It did not make a significant difference to the output of the tagger whether we ran it on the speaker turns or sentences.To aid in successfully looking up words in WordNet, we used the first step of Porter Stemmer, the one that gets rid of plurals. We only performed stemming on nouns, as the most important suffix we wanted to remove was pluralization. When attempting a lookup in WordNet, we first queried the stemmed form of the noun. If that stemmed word was found in WordNet, we used the information from WordNet for it. If it wasn’t found, we tried querying the unstemmed form. If the unstemmed form wasn’t found as a noun, we queried it as an adjective. If it still wasn’t found, we discarded that particular word. In summary, WordNet helps us with the stemming of words. Looking at the stemmed word in WordNet is not satisfactory, since some roots can adopt many meanings.We thought it might help make more accurate lexical chains if we performed anaphora


View Full Document

Stanford CS 224 - Lexical Chains

Documents in this Course
Load more
Download Lexical Chains
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lexical Chains and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lexical Chains 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?