Unformatted text preview:

Learning to Create DataIntegration Queries Partha Pratim Talukdar Marie Jacob Muhammad Salman Mehmood Koby Crammer Zachary G Ives Fernando Pereira Sudipto Guha VLDB2008 Seminar Presented by Noel Gunasekar CSE Department SUNY Buffalo Learning to Create DataIntegration Queries Introduction Motivation Example Existing solutions Q System Solution Q System Architecture Query and Query Answers Executing Query Learning From feedback Conclusion Experimental results Future Work Learning to Create Data Integration Queries 2 Introduction Learning to Create Data Integration Queries 3 Motivation Need for non expert user to pose queries across multiple data resources Non expert user Not familiar with querying languages Multiple resource Databases Data warehouses Virtual integrated schemas Learning to Create Data Integration Queries 4 Bio Science Field Many standardized databases with overlapping and cross referenced information Each site is being independently extended corrected and analyzed Differing levels of data quality confidence Protein Databases Protein DataBase PDB information and service listings at Brookhaven National Laboratory BNL PIR Protein Identification Resource database at JHU PRF Protein Research Foundation database at GenomeNet SwissProt Protein database at ExPASy Learning to Create Data Integration Queries Switzerland 5 Example What are the proteins and genes associated with the disease Narcolepsy Life Sciences researcher querying on data sources like genomics disease studies and pharmacology genomics Life Sciences Researcher Disease Studies Pharmacolog y http www expasy org uniprot P04049 Learning to Create Data Integration Queries 6 Existing Solution Using keyword based queries on Web Forms Match the keywords with terms in the tuples and form the query by joining different databases using foreign keys Cost for the query is fixed and doesn t accommodate the context of the query http www expasy org uniprot P0C852 Learning to Create Data Integration Queries 7 Proposed Solution Q System Automatically generate Web Forms for given set of keywords Pose queries across multiple data resources using the generated web form Learning to Create Data Integration Queries 8 Proposed Solution Q System Create re usable web form User Author Q System Keywords Protein gene disease Reusable Web Form For querying Use web form for Querying Users Author others Parameters Query Results Reusable Web Form For querying Learning to Create Data Integration Queries 9 Q System Architecture Learning to Create Data Integration Queries 10 Architecture of Q System Four Components Initial Schema Loader Query Template Creation Query Execution Learning Through Feedback Learning to Create Data Integration Queries 11 Architecture of Q System Learning to Create Data Integration Queries 12 Initial Setup Schema Loader Input Given a set of data sources with its own schema Foreign Keys and Links Schema Mappings Record Link Output Schema Graph Learning to Create Data Integration Queries 13 Initial Setup Example Schema Graph b 0 07 0 1 c 0 04 d Node Databases and their attributes UniProt database Entrez GeneInfo db term Edge Relation based on foreign keys cross references UniProt to PIR Cost Reliability completeness Learning to Create Data Integration Queries 14 Query Template Creation Learning to Create Data Integration Queries 15 Query Template Creation Example Input protein plasma membrane gene and disease Output Learning to Create Data Integration Queries 16 Query Template Creation Example a Schema Graph b 0 1 0 07 0 1 c Query Keywords a e f 0 04 d 0 1 f 0 1 e a 0 1 b a Find trees connecting red nodes 0 1 Rank 1 Cost 0 4 0 07 c 0 04 0 1 Q1 b d 0 1 e 0 1 f Q2 Rank 2 Cost d 0 1 e 0 1 f Query Formulation Trees can be easily written as executable queries Steiner Tree a 0 1 b 0 1 d 0 1 f 0 1 e Conjunctive query a x y b y z d z w e w u f w v View Refinement Web Form Query Execution Learning to Create Data Integration Queries 21 Input Web Form Output Result Answers Q1 Q1 Q1 2 Q2 Q2 Q2 System determines producer queries using provenance Query Execution Query Processing Engine with Support for querying remote data sources Record data provenance Solution ORCHESTRA http www cis upenn edu zives orchestra Learning to Create Data Integration Queries 24 Orchestra Project The ORCHESTRA project focuses on the challenges of data sharing scenarios in the sciences Bioinformatics Scenario many standardized databases with overlapping information similar but not identical data and differing levels of data quality confidence Each site is being independently extended corrected and analyzed ORCHESTRA collaborative data sharing system CDSS is on how to support reconciliation across different schemas with disagreeing users Learning to Create Data Integration Queries 25 Orchestra Project Data Provenance http www cis upenn edu zives research exchange p df Learning to Create Data Integration Queries 26 Learning through Feedback Learning to Create Data Integration Queries 27 Learning through Feedback Input Ranked Results provenance Q1 Q1 Q1 2 Q2 Q2 Q2 Learning to Create Data Integration Queries 28 Learning through Feedback User provides feedback Q1 Q1 Q1 2 Q2 Q2 Q2 Learning to Create Data Integration Queries 29 Query Formulation Recap a Schema Graph b 0 1 0 07 0 1 c Query Keywords 0 04 d 0 1 a e f f 0 1 e a 0 1 b a Find trees connecting red nodes 0 1 Rank 1 Cost 0 4 0 07 c 0 04 0 1 Q1 b d 0 1 e 0 1 f Q2 Rank 2 Cost d 0 1 e 0 1 f Learning through Feedback a 0 1 Q1 b a 0 1 d 0 1 0 1 0 05 0 07 Q2 f c 0 04 d 0 1 Rank 1 2 b 0 1 Rank 1 Rank 2 Cost 0 0 39 Cost 41 e Cost 0 4 Change weights so Q2 is cheaper than Q1 a 0 1 b 0 0 0 07 5 0 1 c 0 04 d 0 1 e 0 1 f e 0 1 f Iteration Learning to Create Data Integration Queries 32 Q System Challenges Computation of ranked queries which in turn produce ranked tuples K Best Steiner Tree Generation Predicting new query rankings based on user feedback over tuples and also generalizing feedback Learning Maintaining associations between tuples and queries Query answers with provenance Everything at interactive speed Cost of a Query Query Cost Sum of edge costs in the tree Edge Cost Sum of weights of features defined over it Features are properties of the edges e g nodes connected Each feature has a corresponding weight Feature example f 1 if the edge connects Term and Synonym tables else 0 Term f 1 w8 Synonym Steiner Trees Finding Lowest Cost Queries A tree of minimal cost in a graph G which includes all the required


View Full Document

UB CSE 705 - Learning to Create Integration Queries

Download Learning to Create Integration Queries
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Learning to Create Integration Queries and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Learning to Create Integration Queries and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?