UIC CS 583 - Chapter 10 - Information-integration

Unformatted text preview:

Chapter 10: Information IntegrationIntroductionDatabase integration (Rahm and Berstein 2001)Integrating two schemasIntegrating two schemas (contd)Different types of matchingPre-processing for integration (He and Chang SIGMOG-03, Madhavan et al. VLDB-01, Wu et al. SIGMOD-04Schema-level matching (Rahm and Berstein 2001)An exampleLinguistic approaches (See (Liu, Web Data Mining book 2007) for many references)Linguistic approaches (contd)Constraint based approaches (See (Liu, Web Data Mining book 2007) for references)Domain and instance-level matching (See (Liu, Web Data Mining book 2007) for references)Match of simple domainsMatch of simple domains (contd)Handling composite domainsCombining similarities1:m match: two typesSome other issues (Rahm and Berstein 2001)Web information integration (See (Liu, Web Data Mining book 2007) for references)Global Query Interface (He and Chang, SIGMOD-03; Wu et al. SIGMOD-04)Building global query interface (QI)Schema model of query interfaces (He and Chang, SIGMOD-03)Schema model of query interfaces (contd)Interface matching  schema matchingWeb is different from databases (He and Chang, SIGMOD-03)The interface integration problemSchema matching as correlation mining (He and Chang, KDD-04)Slide 29Correlation measuresA clustering approach (Wu et al., SIGMOD-04)Using the transitive propertyComplex MappingsComplex Mappings (Cont’d)Instance-based matching via query probing (Wang et al. VLDB-04)Query Interface and Result PageConstructing a global query interface (Dragut et al. VLDB-06)Slide 38NLP connectionSummaryChapter 10: Information IntegrationBing Liu, UIC ACL-072IntroductionAt the end of last topic, we identified the problem of integrating extracted data: column match and instance value match. Unfortunately, limited research has been done in this specific context. Much of the Web information integration research has been focused on the integration of Web query interfaces. In this part, we introducesome basic integration techniques, andWeb query interface integrationBing Liu, UIC ACL-073Database integration (Rahm and Berstein 2001)Information integration started with database integration, which has been studied in the database community since the early 1980s. Fundamental problem: schema matching, which takes two (or more) database schemas to produce a mapping between elements (or attributes) of the two (or more) schemas that correspond semantically to each other. Objective: merge the schemas into a single global schema.Bing Liu, UIC ACL-074Integrating two schemasConsider two schemas, S1 and S2, representing two customer relations, Cust and Customer. S1 S2Cust CustomerCNo CustIDCompName CompanyFirstName ContactLastName PhoneBing Liu, UIC ACL-075Integrating two schemas (contd)Represent the mapping with a similarity relation, , over the power sets of S1 and S2, where each pair in  represents one element of the mapping. E.g., Cust.CNo  Customer.CustIDCust.CompName  Customer.Company{Cust.FirstName, Cust.LastName}  Customer.ContactBing Liu, UIC ACL-076Different types of matchingSchema-level only matching: only schema information is considered.Domain and instance-level only matching: some instance data (data records) and possibly the domain of each attribute are used. This case is quite common on the Web. Integrated matching of schema, domain and instance data: Both schema and instance data (possibly domain information) are available.Bing Liu, UIC ACL-077Pre-processing for integration (He and Chang SIGMOG-03, Madhavan et al. VLDB-01, Wu et al. SIGMOD-04Tokenization: break an item into atomic words using a dictionary, e.g., Break “fromCity” into “from” and “city”Break “first-name” into “first” and “name”Expansion: expand abbreviations and acronyms to their full words, e.g., From “dept” to “departure”Stopword removal and stemmingStandardization of words: Irregular words are standardized to a single form, e.g., From “colour” to “color”Bing Liu, UIC ACL-078Schema-level matching (Rahm and Berstein 2001)Schema level matching relies on information such as name, description, data type, relationship type (e.g., part-of, is-a, etc), constraints, etc. Match cardinality:1:1 match: one element in one schema matches one element of another schema. 1:m match: one element in one schema matches m elements of another schema. m:n match: m elements in one schema matches n elements of another schema.Bing Liu, UIC ACL-079An examplem:1 match is similar to 1:m match. m:n match is complex, and there is little work on it.Bing Liu, UIC ACL-0710Linguistic approaches (See (Liu, Web Data Mining book 2007) for many references)They are used to derive match candidates based on names, comments or descriptions of schema elements:Name match:Equality of namesSynonymsEquality of hypernyms: A is a hypernym of B is B is a kind-of A. Common sub-stringsCosine similarityUser-provided name match: usually a domain dependent match dictionaryBing Liu, UIC ACL-0711Linguistic approaches (contd)Description match: in many databases, there are comments to schema elements, e.g., Cosine similarity from information retrieval (IR) can be used to compare comments after stemming and stopword removal.Bing Liu, UIC ACL-0712Constraint based approaches (See (Liu, Web Data Mining book 2007) for references)Constraints such as data types, value ranges, uniqueness, relationship types, etc. An equivalent or compatibility table for data types and keys can be provided. E.g.,string  varchar, and (primiary key)  uniqueFor structured schemas, hierarchical relationships such as is-a and part-of may be utilized to help matching. Note: On the Web, the constraint information is often not available, but some can be inferred based on the domain and instance data.Bing Liu, UIC ACL-0713Domain and instance-level matching (See (Liu, Web Data Mining book 2007) for references)In many applications, some data instances or attribute domains may be available. Value characteristics are used in matching.Two different types of domainsSimple domain: each value in the domain has only a single component (the


View Full Document

UIC CS 583 - Chapter 10 - Information-integration

Download Chapter 10 - Information-integration
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Chapter 10 - Information-integration and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Chapter 10 - Information-integration 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?