1. INTRODUCTION1.1 Schema Matching1.2 Standards1.3 Knowledge Representation and the Semantic Web1.4 Our Past Experience2. THE MORPHEUS DATA TRANSFORMATION SYSTEM2.1 GUI and Browser2.2 TCT2.3 Searchable Repository3. MORPHEUS 2.03.1 Crawler3.2 Data as a Browsing Dimension3.3 Search by Lineage3.4 Search for Composite Transform3.5 Search of the Classification Hierarchy3.6 Browsing Model3.7 Putting It All Together4. RELEVANT PREVIOUS WORK5. CONCLUSIONS AND FUTURE WORK6. REFERENCESMorpheus 2.0: A Data Transformation Management SystemPete Dobbins1, Tiffany Dohzen2, Christan Grant1, Joachim Hammer1, Malachi Jones1, Dev Oliver1, Mujde Pamuk2, Jungmin Shin1, Mike Stonebraker21CISE Department, University of Florida PO Box 116120 Gainesville, FL 32611-6120 Phone: (352) 262 - 7383, country code: 001 {pjd, cgrant, jhammer, mjones, doliver, jshin}@cise.ufl.edu 2CSAIL, The Stata Center Massachusetts Institute of Technology Cambridge, MA 02139 Phone: (603) 714 - 4451, country code: 001 {dohzen, mujde}@mit.edu, [email protected] ABSTRACT The authors have previously built the Morpheus Data transformation system. Based on feedback from demo-ing the system at SIGMOD 2006 as well as to numerous CIO’s and researchers at IBM and Microsoft, we have completely redesigned the system to facilitate a state-based browsing paradigm, the ability to filter transforms based on lineage and input-output properties and best-fit search for composite transforms. Also included is a novel crawler to search for transforms of interest either within an enterprise or across the web. The result is Morpheus 2.0, which is now operational, and is described in this paper. 1. INTRODUCTION Information integration has been listed on all four self assessments of the DBMS community [1-4] as an “achilles heel” of computing. Basically, large enterprises have hundreds of operational systems, which are usually constructed by independent groups at different times, and a desire to share information between these systems. This requires integrating a large collection of independently written data base schemas, a task that most enterprises find enormously challenging. The industry thrust toward web services and the internet will increase the scope of this information integration problem from inside a single enterprise (intra-enterprise) to among enterprises (inter-enterprise). This thrust will make information integration that much more daunting. At the same time, the need for information integration is not limited to industry. The internet is also becoming the preferred method for disseminating scientific data from a variety of disciplines and domains such as astronomy, biology, the geosciences, public health, and health care. The number of independently developed schemas and databases is very large and scientists have been struggling to keep up with this wealth of information. For example, in bioinformatics, the need to access and integrate data from the many and typically large genomics repositories which use a large number of data models, languages, and formats is hampering the discovery of genes and their functions. Given the complexity of the genomics data integration problem, systems such as GUS (Genomics Unified Schema) at the University of Pennsylvania [5], which provides an integrated warehouse for portions of GenBank, EMBL, DDBJ, Swiss-Prot, and dbEST, remain the exception. This schema integration problem exists when the goal is to integrate information by extracting information from operational systems, transforming it in some sort of middleware ETL system and then loading it in a data warehouse. It exists equally when the goal is to share live information between operational systems through some sort of federated data base system, such as the IBM information integrator [6] or BEA’s WebLogic Integration Suite [7]. There are three possible approaches to schema integration, which we explore in the following three subsections. 1.1 Schema Matching Some researchers (such as [8-10]) have focused on the schema matching problem. For example, a positive result from such efforts would be to discover that wages in one human resources schema matched salary in a second schema. Although such research is well-intentioned, we believe that it only solves a small portion of the overall data integration problem. Specifically, independently constructed schemas never have identical data elements. For example, wages in the first schema might represent the salary of a French worker, which would be expressed in Euros, net-after taxes and include a lunch allowance. In contrast, salary might represent the compensation of a U.S. worker, which would be expressed in U.S dollars, gross-before taxes, while not including a lunch allowance. Hence, syntactically matching the numbers in the two fields, wages and salaries, will produce garbage, since they represent different semantic objects, respectively U.S compensation and French compensation. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Very Large Database Endowment. To copy otherwise, or to republish, to post on servers or to redistribute to lists, requires a fee and/or special permissions from the publisher, ACM. VLDB ’07, September 23-28, 2007, Vienna, Austria. Copyright 2007 VLDB Endowment, ACM 978-1-59593-649-3/07/09. There are several reasons why identical elements do not exist. First, it is rarely, if ever, the case that two schemas use the same representation for the same semantic construct. There are many representations for calendar dates. All of the following are reasonable representations for one of our birthdays:• October 11, 1943 • 10/11/43 • 11/10/43 • Oct. 11 1943 • 11-10-43 Obviously, one cannot merely match two attributes, even if they have the same or similar names, because they have different representations. That produces a composite column with “jumble” in it. Instead, one must define a transform that will map the individual data elements in one representation to that used by the second schema. Unless an organization has company-wide standards for the representation of common data elements, it will face this issue. A second more difficult issue occurs when the two attributes do not semantically mean
View Full Document