Discovering Topical Structures of Databases Wensheng Wu Berthold Reinwald Yannis Sismanis Rajesh Manjrekar IBM Almaden Research Center San Jose CA 95120 wensheng reinwald syannis rajeshm us ibm com ABSTRACT hundreds or even thousands of tables storing several terabytes of data 30 To make things worse the documentation and metadata for these enterprise databases are often scattered throughout the IT departments of an enterprise they are incomplete inaccurate or simply missing 20 In fact a recent study 19 16 indicates that up to 70 of a data architect s time is actually spent on discovering the metadata of databases Thus the scale of the databases along with the prevalent lack of documentation make it a daunting task for data architects and application developers to understand the databases and incur signi cant cost in integrating the databases 26 16 To illustrate these challenges consider a data architect who tries to understand and integrate two large HumanResource databases HR2 and HR3 shown in the source and target panes of Figure 1 a respectively These databases are taken from a real world scenario described in Section 6 Suppose that HR2 has 200 tables HR3 has 300 tables and on the average there are 10 attributes per table Both databases were designed by contractors and have been in service for several years The designers left the company but did not leave any design documents Furthermore the implementation of the databases might not be consistent with the design For example the referential relationships of tables often are not enforced in the databases due to varied reasons including the cost of enforcing the constraints 10 28 All these make it extremely di cult for the data architect to understand reverse engineer and integrate the databases A key step in integrating the databases is to identify the semantic correspondences or mappings among the attributes from di erent databases 25 12 The scale of the databases again poses serious challenges to this schema matching task Existing matching solutions typically attempt to nd mappings between every two attributes 25 12 2 This all to all approach is based on the assumption that the databases are small and all attributes in one database are potentially relevant to all attributes in another database This assumption might not hold for large databases 26 For example tables in both HR2 and HR3 may be naturally divided into several subject areas or topics such as employee and claim and the tables from di erent subject areas are likely not very relevant As a result the all to all approach is ine cient in that it requires 6M attribute level comparisons and inaccurate in that it may match many attributes from irrelevant tables To illustrate arrows in Figure 1 a indicate tables whose attributes are matched by this approach To avoid the cluttering not all arrows are shown For exam The increasing complexity of enterprise databases and the prevalent lack of documentation incur signi cant cost in both understanding and integrating the databases Existing solutions addressed mining for keys and foreign keys but paid little attention to more high level structures of databases In this paper we consider the problem of discovering topical structures of databases to support semantic browsing and large scale data integration We describe iDisc a novel discovery system based on a multi strategy learning framework iDisc exploits varied evidence in database schema and instance values to construct multiple kinds of database representations It employs a set of base clusterers to discover preliminary topical clusters of tables from database representations and then aggregate them into nal clusters via meta clustering To further improve the accuracy we extend iDisc with novel multiple level aggregation and clusterer boosting techniques We introduce a new measure on table importance and propose an approach to discovering cluster representatives to facilitate semantic browsing An important feature of our framework is that it is highly extensible where additional database representations and base clusterers may be easily incorporated into the framework We have extensively evaluated iDisc using large real world databases and results show that it discovers topical structures with a high degree of accuracy Categories and Subject Descriptors H 2 8 Database Applications Data Mining General Terms Algorithms Design Experimentation Performance Keywords Database structure topical structure discovery clustering 1 INTRODUCTION A large enterprise typically has a huge number of databases that are increasingly complex 6 16 For example the database for a single SAP installation might now contain Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page To copy otherwise to republish to post on servers or to redistribute to lists requires prior specific permission and or a fee SIGMOD 08 June 9 12 2008 Vancouver BC Canada Copyright 2008 ACM 978 1 60558 102 6 08 06 5 00 1019 a Without the topical structures b With the topical structures Figure 1 Understanding integrating large databases 10 3 20 28 However the focus of previous research was mostly on the discovery of keys 10 28 foreign keys 10 21 and functional dependencies 3 while the problem of discovering topical structures has received little attention A related problem is schema summarization 31 which produces an overview of a complex schema with important elements in the schema It measures the importance of an element by the number of its relationship attribute links and the cardinality of its data values It is not concerned with the topics of the elements For example there may be multiple elements in the summary which are all on the same topic or there may be no elements in the summary to represent the less dominant topics in the schema In contrast our goal is to categorize the elements by their topics and exploit the topical structures to not only facilitate semantic browsing but also address the scalability issue in existing schema matching solutions to support a large scale integration In addition we introduce a new measure on table importance based on shortest paths and propose an approach to discovering representative tables within each topic In this paper we describe iDisc a system that
View Full Document