New version page

Generating Gene Summaries from Biomedical Literature

This preview shows page 1-2-22-23 out of 23 pages.

View Full Document
View Full Document

End of preview. Want to read all 23 pages?

Upload your study docs or become a GradeBuddy member to access this document.

View Full Document
Unformatted text preview:

Generating Gene Summaries from BiomedicalLiterature: A Study of Semi-StructuredSummarizationXu Ling, Jing Jiang, Xin He, Qiaozhu MeiChengxiang Zhai, Bruce SchatzDepartment of Computer ScienceInstitute for Genomic BiologyUniversity of Illinois at Urbana-Champaign, IL 61801E-mail: {xuling,jiang4,xinhe2,qmei2,czhai,schatz}@uiuc.eduAbstractMost knowledge accumulated through scientific discoveries in genomics and related biomed-ical disciplines is buried in the vast amount of biomedical literature. Since understand-ing gene regulations is fundamental to biomedical research, summarizing all the existingknowledge about a gene based on literature is highly desirable to help biologists digestthe literature. In this paper, we present a study of methods for automatically generatinggene summaries from biomedical literature. Unlike most existing work on automatic textsummarization, in which the generated summary is often a list of extracted sentences, wepropose to generate a semi-structured summary which consists of sentences covering spe-cific semantic aspects of a gene. Such a semi-structured summary is more appropriate fordescribing genes and poses special challenges for automatic text summarization. We pro-pose a two-stage approach to generate such a summary for a given gene – first retrievingarticles about a gene and then extracting sentences for each specified semantic aspect. Weaddress the issue of gene name variation in the first stage and propose several differentmethods for sentence extraction in the second stage. We evaluate the proposed methodsusing a test set with 20 genes. Experiment results show that the proposed methods can gen-erate useful semi-structured gene summaries automatically from biomedical literature, andour proposed methods outperform general purpose summarization methods. Among all theproposed methods for sentence extraction, a probabilistic language modeling approach thatmodels gene context performs the best.Key words: Summarization, Genomics, Probabilistic language modelPreprint submitted to Elsevier Science 13 December 20061 IntroductionBiomedical literature has been playing a central role in the research activities of allbiologists. The growing amount of scientific discoveries in genomics and relatedbiomedical disciplines have led to a corresponding growth in the amount of liter-ature information. Because of its daunting size and complexity, there have beenincreasing efforts devoted to integrate this huge resource for biologists to digestquickly.Understanding gene functions is fundamental to biomedical research, and one fun-damental task that biomedical researchers often have to perform is to find and sum-marize all the knowledge about a particular gene from the literature, a problem thatwe call gene summarization.Because of the importance of genes, there has been much manual effort on con-structing an informative summary of a gene based on literature information. Forexample, FlyBase1(R. A. Drysdale and Consortium, 2005) (one of the modelorganism genome database) provides a text summary for each Drosophila gene,including DNA sequence, functional description, mutant informationetc.. Com-pressing and arranging all the knowledge from a huge amount of literature intodifferent aspects enable biologists to quickly understand the target gene.However, such gene summaries are currently generated by manually extracting in-formation from literature, which is extremely labor-intensive and cannot keep upwith the rapid growth of the literature information. As the growing amount of sci-entific discoveries in genomics and related biomedical disciplines, automatic sum-marization of gene descriptions in multiple aspects from biomedical literature hasbecome an urgent task.One characteristic of an informative gene summary is that the summary shouldideally consists of sentences that cover several important semantic aspects such assequence information, mutant phenotype, and gene product. That is, the summary issemi-structured. For example, Figure 1 shows a sample gene summary in FlyBaseretrieved in 2005. Here we see that the summary consists of sentences coveringthe following aspects of a gene: (1) Gene products (GP); (2) Expression location(EL); (3) Sequence information (SI); (4) Wild-type function and phenotypic infor-mation (WFPI); (5) Mutant phenotype (MP); and (6) Genetical interaction (GI),as annotated. We thus propose to frame the gene summarization problem as to au-tomatically generate a semi-structured summary consisting of sentences coveringthese six aspects of a gene. Such a summary not only is itself very useful, but alsocan serve as useful entry points to the literature through linking each aspect to thesupporting evidence in the literature.1http://flybase.bio.indiana.edu/2Fig. 1. Example Gene Summary In FlyBase.Most existing work on automatic text summarization has focused on news sum-marization and the generated summary is generally unstructured, consisting of alist of sentences. The existing summarization methods are thus inadequate for gen-erating a semi-structured summary. In this paper, we present a study of methodsfor automatically generating semi-structured gene summaries from biomedical lit-erature. Although our studies mainly focus in the biomedical literature domain, theapproaches we proposed are generally applicable to semi-structured summarizationin other applications, such as product reviews. Under the assumption that we havesome training sentences for each aspect, generalizing our methods for applying toother applications is very straightforward.We propose a two-stage approach to generate such a summary for a given gene, inwhich we would first retrieve articles about a gene and then extract sentences foreach of six specified semantic aspects. While the first stage can be implementedusing any standard information retrieval techniques, a standard IR technique gener-ally cannot handle gene name variations well. We address this issue through addingsome heuristic methods on top of regular keyword matching. For the second stage,we leverage some existing training resources and propose several different methodsto learn from the training data and extract sentences in each semantic aspect.We evaluate the proposed methods using a test set with 20 randomly selected genes.Experiment results show that the proposed methods are potentially useful in auto-matically generating informative semi-structured gene summaries from biomedicalliterature and outperform general


Loading Unlocking...
Login

Join to view Generating Gene Summaries from Biomedical Literature and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Generating Gene Summaries from Biomedical Literature and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?