DOC PREVIEW
Combining Lexical and Grammatical Features to Improve Readability Measures

This preview shows page 1-2-3 out of 8 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Combining Lexical and Grammatical Features to Improve Readability Measures for First and Second Language Texts Michael J. Heilman Kevyn Collins-Thompson Jamie Callan Maxine Eskenazi Language Technologies Institute School of Computer Science Carnegie Mellon University 4502 Newell Simon Hall Pittsburgh, PA 15213-8213 {mheilman,kct,callan,max}@cs.cmu.edu Abstract This work evaluates a system that uses in-terpolated predictions of reading difficulty that are based on both vocabulary and grammatical features. The combined ap-proach is compared to individual gram-mar- and language modeling-based approaches. While the vocabulary-based language modeling approach outper-formed the grammar-based approach, grammar-based predictions can be com-bined using confidence scores with the vocabulary-based predictions to produce more accurate predictions of reading dif-ficulty for both first and second language texts. The results also indicate that gram-matical features may play a more impor-tant role in second language readability than in first language readability. 1 Introduction The REAP tutoring system (Heilman, et al. 2006), aims to provide authentic reading materials of the appropriate difficulty level, in terms of both vo-cabulary and grammar, for English as a Second Language students. An automatic measure of read-ability that incorporated both lexical and gram-matical features was thus needed. For first language (L1) learners (i.e., children learning their native tongue), reading level has been predicted using a variety of techniques, based on models of a student’s lexicon, grammatical sur-face features such as sentence length (Flesch, 1948), or combinations of such features (Schwarm and Ostendorf, 2005). It was shown by Collins-Thompson and Callan (2004) that a vocabulary-based language modeling approach was effective at predicting the readability of grades 1 to 12 of Web documents of varying length, even with high levels of noise. Prior work on first language readability by Schwarm and Ostendorf (2005) incorporated grammatical surface features such as parse tree depth and average number of verb phrases. This work combining grammatical and lexical features was promising, but it was not clear to what extent the grammatical features improved predictions. Also, discussions with L2 instructors suggest that a more detailed grammatical analysis of texts that examines features such as passive voice and various verb tenses can provide better features with which to predict reading difficulty. One goal of this work is to show that the use of pedagogically motivated grammatical features (e.g., passive voice, rather than the number of words per sen-tence) can improve readability measures based on lexical features alone. One of the differences between L1 and L2 read-ability is the timeline and processes by which first and second languages are acquired. First language acquisition begins at infancy, and the primary grammatical structures of the target language are acquired by age four in typically developing chil-dren (Bates, 2003). That is, most grammar is ac-quired prior to the beginning of a child’s formal education. Therefore, most grammatical features seen at high reading levels such as high school are present with similar frequencies at low reading levels such as grades 1-3 that correspond to ele-mentary school-age children. It should be noted that sentence length is one grammar-related differ-ence that can be observed as L1 reading level in-creases. Sentences are kept short in texts for low L1 reading levels in order to reduce the cognitive load on child readers. The average sentence length of texts increases with the age and reading level of the intended audience. This phenomenon has been utilized in early readability measures (Flesch, 1948). Vocabulary change, however, continues even into adulthood, and has been shown to be a more effective predictor of L1 readability than simpler measures such as sentence length (Collins-Thompson and Callan, 2005). Second language learners, unlike their L1 coun-terparts, are still very much in the process of ac-quiring the grammar of their target language. In fact, even intermediate and advanced students of second languages, who correspond to higher L2 reading levels, often struggle with the grammatical structures of their target language. This phenome-non suggests that grammatical features may play a more important role in predicting and measuring L2 readability. That is not to say, however, that vocabulary cannot be used to predict L2 reading levels. Second language learners are learning both vocabulary and grammar concurrently, and reading materials for this population are chosen or au-thored according to both lexical and grammatical complexity. Therefore, the authors predict that a readability measure for texts intended for second language learners that incorporates both grammati-cal and lexical features could clearly outperform a measure based on only one of these two types of features. This paper begins with descriptions of the lan-guage modeling and grammar-based prediction systems. A description of the experiments follows that covers both the evaluation metrics and corpora used. Experimental results are presented, followed by a discussion of these results, and a summary of the conclusions of this work. 2 Language Model Readability Prediction for First Language Texts Statistical language modeling exploits patterns of use in language. To build a statistical model of text, training examples are used to collect statistics such as word frequency and order. Each training example has a label that tells the model the ‘true’ category of the example. In this approach, one statistical model is built for each grade level to be predicted. The statistical language modeling approach has several advantages over traditional readability formulas, which are usually based on linear regres-sion with two or three variables. First, a language modeling approach generally gives much better accuracy for Web documents and short passages (Collins-Thompson and Callan, 2004). Second, language modeling provides a probability


Combining Lexical and Grammatical Features to Improve Readability Measures

Download Combining Lexical and Grammatical Features to Improve Readability Measures
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Combining Lexical and Grammatical Features to Improve Readability Measures and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Combining Lexical and Grammatical Features to Improve Readability Measures 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?