Geographically Typed Geospatial Data Source Matching with High Quality Clustering and Multi Attribute Matching Jeffrey Partyka Dr Latifur Khan Dr Bhavani Thuraisingham Funded by NGA US Air Force Topic Outline Problem Statement Background Information Matching Procedures Generalized Solution N grams Non Geographic Matching NGT Matching Geographic Matching GT Matching Attribute Weighting High Quality Clustering 1 N Matching Experimental Results Future Work Motivation Internet Architecture Highly Distributed Federated Architecture Web Application Problems Low Performance for Information Retrieval Accuracy of Retrieved Information Sample Scenario Query Publication of Academic Staff Rank Data Source MIT Ontology UMBC Ontology Karlsruhe Ontology Article Book Booklet InBook InCollection InProceedings Manual Misc Proceedings Report Technical Report Project Report Thesis Master Thesis PhD Thesis Unpublished Faculty Member Lecturer Different Bibliography Ontologies UMBC Ontology MIT Ontology Karlsruhe Ontology Problem Statement Schema Matching Given 2 data sources S1 and S2 each of which is composed of a set of tables where T11 T12 T13 T1k T1m S1 and T21 T22 T23 T2j T2n S2 with 1 k m and 1 j n determine the similarity between T 1k and T2j Road S1 Road roadName City Road Johnson Rd Plano Custer Pwy Cooke School Dr Richardson 15th St Collin Zeppelin St Lakehurst Parker Rd Collin Alma Dr Richardson Alma Dr Collin City County Anacortes Skagit Friday Harbor San Juan Argyle San Juan Kirkland King COUNTY Destination SNOHOMISH Mukilteo PIERCE Point Defiance KITSAP Southworth SNOHOMISH Edmonds County S2 Problem Statement Ontology Given 2 ontologies O and O each of which is composed Matching 1 2 of a set of concepts where C11 C12 C13 C1k C1m O1 and C21 C22 C23 C2j C2n O2 with 1 k m and 1 j n determine the similarity between C 1k and C2j Motivating Scenarios 1 Making Complex Business Decisions Should we invest in a new cholesterol drug for the Asia Pacific region Marketin g 2 2 Regulator y Affairs Corporat e R D Yes No Maybe Manufacturin g Robust Semantic Web Applications Find the group of friends around Jeff Then find the most important person out of the group Find out if this person was at an event of type Meeting and happened between 9AM 11AM within 5 miles of UTD Social Network RDFS Lookup Geospati al Ontology Temporal Logic Yes No Maybe Jeff Jeff s friends Event of Type Meetin g Within 5 miles of UTD 9 00am11 00am Matching Approaches Mappings may be generated in several ways some approaches are 1 Name Matching Email emailAddress 2 Structure Matching 3 Instance Matching County DSP Kitsap Kingston Wahkiak Puget Island COUNTYNAME CID TRAIL RANGE DR 96 KITSAP 97 Some Definitions Definition 1 attribute An attribute of a table T denoted as att T is defined as a property of T that further describes it Definition 2 instance An instance x of an attribute att T is defined as a data value associated with att T Definition 3 keyword A keyword k of an instance x associated with attribute att T is defined as a meaningful word not a stopword representing a portion of the instance Some Definitions cont Definition 4a geographic type GT A geographic type GT associated with attribute att T is defined as a class of instances of att T that represent the same geographic feature e g lake road Definition 4b non geographic type NGT A nongeographic type NGT associated with attribute att T is defined as a group of keywords from instances of att T that are semantically related to each other Collin Plano Richardson New Jersey Trenton Monmouth Topic Outline Problem Statement Background Information Matching Procedures Generalized Solution N grams Non Geographic Matching NGT GT Matching Attribute Weighting High Quality Clustering 1 N Matching Experimental Results Future Work Overview of Matching Algorithm 1 Select attribute pairs for comparison roadNam e roadType city rType rName Match instances between compared attributes 2 roadNam e K Ave Jupiter Rd Coit Rd Run Sim algorithms rName L Ave LBJ Freeway US 75 Determine final attribute similarity 3 roadNam e EBD 98 rName town county Determining Semantic Similarity We use Entropy Based Distribution EBD EBD is a measurement of type similarity between 2 attributes or columns EBD H C T H C EBD takes values in the range of 0 1 Greater EBD corresponds to more similar type distributions between compared attributes columns Applying EBD to Semantic Matching att1 att2 X X X X X Y Y Y Y Y Z Z Y X X X Y Z Y Y XZ Y X Entropy H C YY Y YY Z XX X X X Conditional Entropy H C T Z Topic Outline Problem Statement Background Information Matching Procedures Generalized Solution N grams Non Geographic Matching NGT GT Matching Attribute Weighting High Quality Clustering 1 N Matching Experimental Results Future Work Matching Using N grams Use commonly occurring N grams 2 3 in compared attributes to determine similarity N 2 StrName FENAME Status LOCUST GROVE DR LOCUST GROVE BUILT TRAIL RANGE DR TRAIL RANGE BUILT Street Laddress LOUISE DOVER DR CR45 MANET CT 1600 2500 TA OV OV Raddress 1798 2598 LO LO ST TB Some N grams extracted from A StrName LO OC CU ST OV Some N grams extracted from B Street LO OU UI OV UI Conditional Entropy H C T 2 Jeffrey Partyka Neda Alipanah Latifur Khan Bhavani M Thuraisingham Shashi Shekhar Content based ontology matching for GIS datasets ACM SIGSPATIAL GIS 2008 ACM GIS Laguna Beach California Nov 2008 51 3 Jeffrey Partyka Neda Alipanah Latifur Khan Bhavani M Thuraisingham Shashi Shekhar Ontology Alignment Using Multiple Contexts 7th International Semantic Web Conference ISWC Karlsruhe Germany Oct 2008 Faults of this Method Semantically similar columns are not guaranteed to have a high similarity score A T1 City Country ctyName country Dallas USA Shanghai China Houston USA Beijing China Kingston Jamaica Tokyo Japan Halifax Canada New Delhi India Mexico City Mexico Kuala Lumpur Malaysia B T2 2 grams extracted from A Da al la as Ho ou us 2 grams extracted from B Sh ha an ng gh ha ai Be ei ij Topic Outline Problem Statement Background Information Matching Procedures Generalized Solution N grams Non Geographic Matching NGT GT Matching Attribute Weighting High Quality Clustering 1 N Matching Experimental Results Future Work Non Geographic Matching Use clustering methods to group keywords of instances together without relying on shared N grams between instances 4 K means is not suitable because we cannot compute a centroid among string instances so we use K medoid clustering Use
View Full Document
Unlocking...