Interactive Generation of Integrated Schemas Laura Chiticariu Phokion G Kolaitis UC Santa Cruz laura cs ucsc edu Lucian Popa IBM Almaden IBM Almaden kolaitis almaden ibm com lucian almaden ibm com ABSTRACT General Terms Schema integration is the problem of creating a unified target schema based on a set of existing source schemas that relate to each other via specified correspondences The unified schema gives a standard representation of the data thus offering a way to deal with the heterogeneity in the sources In this paper we develop a method and a design tool that provide 1 adaptive enumeration of multiple interesting integrated schemas and 2 easy to use capabilities for refining the enumerated schemas via user interaction Our method is a departure from previous approaches to schema integration which do not offer a systematic exploration of the possible integrated schemas The method operates at a logical level where we recast each source schema into a graph of concepts with Has A relationships We then identify matching concepts in different graphs by taking into account the correspondences between their attributes For every pair of matching concepts we have two choices merge them into one integrated concept or keep them as separate concepts We develop an algorithm that can systematically output without duplication all possible integrated schemas resulting from the previous choices For each integrated schema the algorithm also generates a mapping from the source schemas to the integrated schema that has precise information preserving properties Furthermore we avoid a full enumeration by allowing users to specify constraints on the merging process based on the schemas produced so far These constraints are then incorporated in the enumeration of the subsequent schemas The result is an adaptive and interactive enumeration method that significantly reduces the space of alternative schemas and facilitates the selection of the final integrated schema Algorithms Design Keywords Schema integration model management schema mapping data integration interactive generation concept graph 1 INTRODUCTION Schema integration is the problem of creating a unified target schema from a set of existing source schemas that relate to each other possibly via correspondences between their elements or via some other forms of schema mappings such as constraints or views By providing a standard representation of the data the integrated target schema can be viewed as a means for dealing with heterogeneous data sources The schema integration problem is encountered in data integration and in several other related contexts In data integration 12 the unified schema yields a single access point against which queries are posed to access a set of heterogeneous sources Other applications include consolidating data sources of merged organizations into one database or warehouse and integrating related application silos to create aggregated intelligence In general schema integration is a form of metadata chaos reduction quite often many overlapping schemas variations or evolutions of each other exist even in the same computer and need to be consolidated into one Schema integration is recognized as one of the building blocks for metadata applications in the model management framework of Bernstein 3 Schema integration is a long standing research problem 2 6 14 18 21 and continues to be a challenge in practice All the approaches that we know require a substantial amount of human feedback during the integration process Furthermore the outcome of these approaches is only one integrated schema In general however there can be multiple possible schemas that integrate the data in different ways and each may be valuable in a given scenario In some of the previous approaches some of these choices appear implicitly as part of the design process while interacting with the user However there is no principled approach towards the enumeration of these choices In this paper we develop a method and a design tool that provide 1 adaptive enumeration of multiple interesting integrated schemas and 2 easy to use capabilities for refining the enumerated schemas via user interaction Furthermore the method operates at a logical conceptual level that abstracts away the physical details of relational or XML schemas and makes it easy to express user requirements Overview of our approach We assume that we are given a set of two or more source schemas describing data in a common domain together with a set of correspondences that relate pairs of Categories and Subject Descriptors H 2 1 Logical Design Schema and subschema H 2 5 Heterogeneous Databases Data translation Supported in part by NSF CAREER Award IIS 0347065 and NSF grant IIS 0430994 Work partially done while at IBM Almaden On leave from UC Santa Cruz Work partially funded by U S Air Force Office for Scientific Research under contract FA9550 07 1 0223 Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page To copy otherwise to republish to post on servers or to redistribute to lists requires prior specific permission and or a fee SIGMOD 08 June 9 12 2008 Vancouver BC Canada Copyright 2008 ACM 978 1 60558 102 6 08 06 5 00 833 Schema Correspondences User schema S1 Integrated schema I1 1 3 2 S1 concepts mapping from the source schemas to the integrated schema that has precise information preserving properties The mapping generation component is in the spirit of the more general algorithms of mapping systems such as Clio 17 which construct mappings between independently designed source and target schemas However our mapping algorithm is more direct and has no ambiguity by taking full advantage of how the integrated schema is generated from the source schemas Conceptual enumeration of the candidate schemas We develop an enumeration algorithm that can systematically generate all possible integrated schemas and the associated mappings by considering all possible choices of which concepts to merge An essential feature of the enumeration algorithm is that it avoids exploring different configurations that yield the same schema This duplicationfree algorithm makes use as a subroutine of a polynomial delay algorithm by Creignou and He brard 8 for the enumeration of all Boolean
View Full Document