Penn CIS 700 - Comprehending Technical Texts

Unformatted text preview:

Comprehending Technical Texts: Predicting and Defining Unfamiliar TermsNoemie Elhadad, Ph.D.Department of Computer Science, City College of New York, New York, NYWe investigate how to improve access to medical lit-erature for health consumers. Our focus is on medicalterminology. We present a method to predict automat-ically in a given text which medical terms are unlikelyto be understood by a lay reader. Our method, whichis linguistically motivated and fully unsupervised, re-lies on how common a specific term is in texts that wealready know are familiar to a lay reader. Once a termis identified as unfamiliar, an appropriate definition ismined from the Web to be provided to the reader. Ourexperiments show that the prediction and the additionof definitions significantly improve lay readers’ com-prehension of sentences containing technical medicalterms.INTRODUCTIONThe field of health literacy has gained much attentionrecently. Studies show that most documents targetedat health consumers are ill-fitted to the intendedaudience and its level of health literacy [1, 2, 3].While there are many components involved in healthliteracy that are specific to the reader (e.g., readinglevel and cultural background), we investigate whatcan be done from the standpoint of the text to makesure it is adapted to the literacy level of a given reader.Determining how difficult a text is for a readerhas been a subject of research for several decades.However, most metrics have been developed tocharacterize texts written in general English, and,moreover, their efficacy has been questioned over theyears. A typical readability metric counts the numberof syllables in a word to assess its complexity. Thisstrategy is not well-suited to the medical domain.Previous work showed that the Dale-Chall familiarityscore [4], for instance, is not a reliable indicator ofterm familiarity [5].In this paper we investigate two questions: given amedical text and a reader at a given reading level,is it possible to predict automatically which termsin the text are unlikely to be familiar to the reader?Furthermore, if such complex terms are presentin the text, is it possible to improve the reader’scomprehension by augmenting the text with extrainformation?We propose a method that is corpus-based and fullyunsupervised to determine whether a term is famil-iar for a reader population. It follows the psycho-New acute myocardial infarction or death was predicted byleft ventricular ejection fraction of 30% (OR 2.00, 95% CI1.20 to 3.40; P = .008), prior angina (OR 2.70, 95% CI1.34 to 5.57; P = .001), and age > 65 years (OR 1.64, 95%CI 1.00 to 2.70; P = .01).Figure 1. A Technical Sentence.linguistic finding that the more common a term is ina body of texts known to a reader, the more familiarthe term is likely to be to the reader. On the ques-tion of improving reader comprehension, we proposea simple method to provide appropriate definitions, asmined from the World Wide Web, for the terms pre-dicted to be unfamiliar.METHODSThe sentences we aim to adapt for lay readers appearin technical medical texts, such as clinical studies pub-lished in scientific journals. They are typically not un-derstood by lay readers. Figure 1 shows an example ofa technical sentence.We first describe our experimental setup: the resourceswe investigated to predict familiarity, and the onesused for term definitions. Next we explain how weobtained a gold standard to evaluate our methods. Wethen turn to our techniques (1) to predict whether aterm is familiar, and (2) to define unfamiliar terms in agiven text.Experimental SetupResources for familiarity prediction: Our methodfor familiarity prediction relies on examples of textsthat are typically understandable to a lay reader. Weinvestigated two types of corpora: an in-domain andan out-of-domain corpus.Our lay corpus contains news stories summarizingclinical studies from the Reuters Health E-line news-feed.1Reuters journalists take technical publicationsand report the main findings, methods and sometimesinterviews with the authors of the publication. Thereare two important characteristics of this corpus: (1) thestories are written for a lay audience at a 12th-gradereading level, and (2) every story in our corpus con-tains a reference to the original scientific publication.Thus, it is possible to gather the original texts, whichconvey the same information but were written for amore technical audience. The stories draw upon stud-ies from reputable medical journals, such as Annals of1http://www.reutershealth.comTable 1. ReutersHealth Corpus Statistics.Nb. of texts 9,775Nb. of sentences 160,208Nb. of words 4,373,104Internal Medicine, New England Journal of Medicineand Lancet. Overall, we collected 9,775 such stories.Table 1 shows statistics about the corpus, which wecall ReutersHealth. The ReutersHealth corpus is usedin our method as an example of texts that are under-standable to a college-educated lay reader.To investigate whether the use of out-of-domain textsis helpful in gauging term familiarity, we relied onthe Brown corpus [6], which is a one-million wordgold-standard corpus of English, containing texts fromdifferent genres and domains.Our method also investigates features other than howcommon a term is in a corpus, namely manual termfamiliarity indexing and term polysemy as a mea-sure of familiarity. In this context, we looked at theinformation provided by the MRC PsycholinguisticDatabase [7]. This database contains 150,937 wordsof general English with up to 26 linguistic and psy-cholinguistic attributes for each. We looked, as well,at the electronic dictionary WordNet [8].Resources for term definitions: We collected sev-eral glossaries of medical terms, but we did not findany that had sufficient coverage to provide definitionsfor most terms in our training set of unfamiliar terms.Instead, we rely on the Web as a resource for def-initions, and use the Google “define:” functionalityto retrieve them. Using Google is advantageous be-cause the work of mining definitions from multipleglossaries and web pages is already done for us.Gold standard for familiarity prediction: Toevaluate our familiarity prediction algorithm, wecollected a gold standard of 100 medical terms, asidentified by UMLS, and for each term, a binary labeldesignating it as understandable (i.e., familiar) to acollege-level reader or not.The 100 terms were randomly selected from ourcorpus of technical clinical studies, and rangedfrom “common”


View Full Document

Penn CIS 700 - Comprehending Technical Texts

Documents in this Course
Lists

Lists

19 pages

Actors

Actors

30 pages

Load more
Download Comprehending Technical Texts
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Comprehending Technical Texts and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Comprehending Technical Texts 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?