Bootstrapping Pay As You Go Data Integration Systems Anish Das Sarma Xin Dong Alon Halevy Stanford University California USA AT T Labs Research New Jersey USA Google Inc California USA anish cs stanford edu lunadong research att com halevy google com ABSTRACT 1 INTRODUCTION Data integration systems offer a uniform interface to a set of data sources Despite recent progress setting up and maintaining a data integration application still requires significant upfront effort of creating a mediated schema and semantic mappings from the data sources to the mediated schema Many application contexts involving multiple data sources e g the web personal information management enterprise intranets do not require full integration in order to provide useful services motivating a pay as you go approach to integration With that approach a system starts with very few or inaccurate semantic mappings and these mappings are improved over time as deemed necessary This paper describes the first completely self configuring data integration system The goal of our work is to investigate how advanced of a starting point we can provide a pay as you go system Our system is based on the new concept of a probabilistic mediated schema that is automatically created from the data sources We automatically create probabilistic schema mappings between the sources and the mediated schema We describe experiments in multiple domains including 50 800 data sources and show that our system is able to produce high quality answers with no human intervention Data integration systems offer a single point interface to a set of data sources A data integration application is typically built by creating a mediated schema for the domain at hand and creating semantic mappings between the schemas of the data sources and the mediated schema The user or application poses queries using the terminology of the mediated schema and the query is reformulated onto the sources using the semantic mappings Despite recent progress in the field setting up and maintaining a data integration application still requires significant upfront and ongoing effort Hence reducing the effort required to set up a data integration application often referred to as on the fly integration has been a recurring challenge for the field In fact as pointed out in 12 many application contexts e g the web personal information management enterprise intranets do not require full integration in order to provide useful services This observation led to proposing a pay as you go approach to integration where the system starts with very few or inaccurate semantic mappings and these mappings are improved over time as deemed necessary This paper describes the first completely self configuring data integration system The goal of our work is to investigate how advanced of a starting point we can provide a pay as you go system and how well a completely automated system can perform We evaluate our system on several domains each consisting of 50 800 heterogeneous tables obtained from the Web The key contribution of the paper is that we can obtain very good query precision and recall compared to the alternatives of 1 treating all the sources as text or 2 performing full manual integration To completely automate data integration we need to automatically create a mediated schema from the sources and automatically create semantic mappings between the sources and the mediated schema Automatic creation of schema mappings has received considerable attention 5 7 8 9 13 14 18 21 23 26 28 Recently 10 introduced the notion of probabilistic schema mappings which provides a foundation for answering queries in a data integration system with uncertainty about semi automatically created mappings To complete the puzzle we show how to automatically create a mediated schema from a set of data sources The specific contributions of the paper are the following First we show how to automatically create a mediated schema from a set of data sources In doing so we introduce the concept of a probabilistic mediated schema which is a set of mediated schemas with probabilities attached to each We show that probabilistic mediated schemas offer benefits in modeling uncertainty about the semantics of attributes in the sources We describe how to create a deterministic mediated schema from the probabilistic one which is the schema exposed to the user Our second contribution is an algorithm for creating probabilis Categories and Subject Descriptors H Information Systems General Terms Algorithms Design Experimentation Keywords data integration pay as you go mediated schema schema mapping This work was supported by a Microsoft Graduate Fellowship and by NSF under grants IIS 0415175 IIS 0324431 and IIS 0414762 Work was partially done while these authors were at Google Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page To copy otherwise to republish to post on servers or to redistribute to lists requires prior specific permission and or a fee SIGMOD 08 June 9 12 2008 Vancouver BC Canada Copyright 2008 ACM 978 1 60558 102 6 08 06 5 00 tic schema mappings from the data sources to the mediated schema Since a mapping is constructed from a set of weighted attribute correspondences between a source schema and the mediated schema and such weighted correspondences do not uniquely determine a semantic mapping 10 we construct a probabilistic mapping that is consistent with the correspondences and obtains the maximal entropy As our final contribution we describe a set of experimental results establishing the efficacy of our algorithms We compare the precision and recall of our system with several alternative approaches including 1 a perfect integration where the mediated schema and mappings are created manually 2 a document centric approach where we perform keyword search on the sources and 3 posing queries directly on the sources without a mediated schema We show that our automatic methods achieve F measure of around 0 92 compared to 1 and significantly outperform 2 and 3 and several variants of our algorithms Hence we believe that our approach can substantially reduce the amount of time taken to create a data integration application The paper is organized as follows Section 2 gives an overview of our
View Full Document