New version page

C-DEM: A Multi-Modal Query System for Drosophila Embryo Databases

Upgrade to remove ads

This preview shows page 1 out of 4 pages.

Save
View Full Document
Premium Document
Do you want full access? Go Premium and unlock all 4 pages.
Access to all documents
Download any document
Ad free experience

Upgrade to remove ads
Unformatted text preview:

C-DEM: A Multi-Modal Query System for DrosophilaEmbryo DatabasesFan Guo, Lei Li, Christos Faloutsos, Eric P. XingSchool of Computer ScienceCarnegie Mellon University5000 Forbes AvenuePittsburgh, PA 15213, United States{fanguo, leili, christos, epxing}@cs.cmu.eduABSTRACTThe amount of biological data publicly available has experiencedan exponential growth as the technology advances. Online databasesare now playing an important role as information repositories aswell as easily accessible platforms for researchers to communi-cate and contribute. Recent research projects in image bioinfor-matics produce a number of databases of images, which visual-ize the spatial expression pattern of a gene (eg. “fj”), and most ofwhich also have one or several annotation keywords (eg., “embry-onic hindgut”).C-DEM is an online system for Drosophila (= fruit-fly) Embryoimages Mining. It supports queries from all three modalities to allthree, namely, (a) genes, (b) images of gene expression, and (c) an-notation keywords of the images. Thus, it can find images that aresimilar to a given image, and/or related to the desirable annotationkeywords, and/or related to specific genes. Typical queries are whatare most suitable keywords to assign to image insitu28465.jpg orfind images that are related to gene “fj”, and to the keyword “em-bryonic hindgut”. C-DEM uses state-of-the-art feature extractionmethods for images (wavelets and principal component analysis).It envisions the whole database as a tri-partite graph (one type foreach modality), and it uses fast and flexible proximity measures,namely, random walk with restarts (RWR).In addition to flexible querying, C-DEM allows for navigation:the user can click on the results of an earlier query (image thumb-nails and/or keywords and/or genes), and the system will report themost related images (and keywords, and genes). The demo is on areal Drosophila Embryo database, with 10,204 images, 2,969 dis-tinct genes, and 113 annotation keywords. The query response timeis below one second on a commodity desktop.1. MOTIVATION AND SIGNIFICANCEThe goal of the C-DEM system is to provide a flexible way ofquerying and navigating large databases with biological images,and specifically annotated Drosophila (= fruit-fly) embryo images.The problem has several challenges: (a) the ideal system has tobe multi-modal, because we want multiple data types (images, textPermission to copy without fee all or part of this material is granted providedthat the copies are not made or distributed for direct commercial advantage,theVLDB copyrightnotice andthetitle of the publication and itsdate appear,and notice is given that copying is by permission of the Very Large DataBase Endowment. To copy otherwise, or to republish, to post on serversor to redistribute to lists, requires a fee and/or special permission from thepublisher, ACM.VLDB ‘08, August 24-30, 2008, Auckland, New ZealandCopyright 2008 VLDB Endowment, ACM 000-0-00000-000-0/00/00.annotations); (b) it has to be flexible, since domain experts do notalways have a specific query to ask; instead, they try to developand test hypotheses, where the ultimate, 50-year goal is to find aregulation map of thousands of genes in a Drosophila genome, andstudy how it evolves during the development of embryos in the firstfew hours or longer; (c) the system has to be fast, to handle thegrowing volume of experimental data; (d) the system has to be user-friendly, ideally requiring a few mouse-clicks per query, as opposedto an elaborate query language.The significance and impact of such a system is growing. High-throughput technologies has become quite popular in bioinformat-ics research, providing a low-cost solution to data generation. Suchdata explosion makes biological databases more and more impor-tant in scientific discovery and communication. The number ofmolecular biology databases grows from less than one hundred in1998 to almost one thousand by the end of 2007 [10], where thesize of GenBank [7], one of the most accessed database of biolog-ical sequences, doubles approximately every 18 months [3]. Theaccumulation of experimental evidence would grant ever strongerpower of databases in scientific discovery, especially with the helpof statistical methods developed in data mining and machine learn-ing.Our system is based on one of the few most accessed datasetsof biological images. It is released by the Berkeley DrosophilaGenome Project (BDGP), and consists of more than 70,000 digi-tal photographs documenting the expression patterns of more than3,000 genes during the development of Drosophila embryos, anno-tated with a standardized set of terms for developmental anatomy[1]. Users could query and browse the database by gene name or itssynonyms (eg. CG10917 or “fj”), which display a summary pageof corresponding image thumbnails and annotation keywords, aswell as other relevant information of gene expression. However,more complicated queries like what are most suitable keywords toassign to image insitu28465.jpg or find images that are related togene “fj”, and to the keyword “embryonic hindgut” cannot be an-swered by BDGP and, to our knowledge, any other existing querysystem for biological databases of images and texts.In this paper, we propose to demonstrate C-DEM, the CMU sys-tem for Drosophila Embryo Mining, which meets the challengesfor fast, flexible and user-friendly querying of biological databaseslike BDGP. The software architecture of C-DEM de-associates thefront-end web server and back-end computing engine with a clearand stable API. They are deployed on separate machines for betterperformance by distributing the workload, and it also makes futureupdates of either component of the system easier. Implementationof the system also includes offline data pre-processing, image fea-ture extraction, and construction of a graph representation of theBrowser based!!UIQueries Result!PagesHTTPTomcat!Web!ServerJSP!ApplicationResultsRemote!Function!CallsRMIComputing EngineFigure 1: The software architecture of the C-DEM system. Itconsists three tiers: the browser-based UI, the Tomcat webserver and the computing engine. They communicate viaHTTP and RMI protocols.dataset. The back-end computing engine loads the graph repre-sentation and estimate the proximity from query to a target of anytype using random walk with restarts (RWR), for which fast algo-rithms already exist [17]. Algorithmic details of the system arepresented in [12].


Download C-DEM: A Multi-Modal Query System for Drosophila Embryo Databases
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view C-DEM: A Multi-Modal Query System for Drosophila Embryo Databases and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view C-DEM: A Multi-Modal Query System for Drosophila Embryo Databases 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?