DOC PREVIEW
Collective Entity Resolution in Relational Data

This preview shows page 1-2-16-17-18-34-35 out of 35 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 35 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 35 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 35 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 35 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 35 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 35 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 35 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 35 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Collective Entity Resolution in Relational DataINDRAJIT BHATTACHARYAandLISE GETOORUniversity of Maryland, College ParkMany databases contain uncertain and imprecise references to real-world entities. The absenceof identifiers for the underlying entities often results in a database which contains multiple ref-erences to the same entity. This can lead not only to data redundancy, but also inaccuracies inquery processing and knowledge extraction. These problems can be alleviated through the useof entity resolution. Entity resolution involves discovering the underlying entities and mappingeach database reference to these entities. Traditionally, entities are resolved using pair-wise sim-ilarity over the attributes of references. However, there is often additional relational informationin the data. Specifically, references to different entities may co-occur. In these cases, collectiveentity resolution, in which entities for co-occurring references are determined jointly, rather thanindependently, can improve entity resolution accuracy. We propose a novel relational clusteringalgorithm that uses both attribute and relational information for determining the underlying do-main entities, and we give an efficient implementation. We investigate the impact that differentrelational similarity measures have on entity resolution quality. We evaluate our collective entityresolution algorithm on multiple real-world databases. We show that it improves entity resolu-tion performance over both attribute-based baselines and over algorithms that consider relationalinformation but do not resolve entities collectively. In addition, we perform detailed experimentson synthetically generated data to identify data characteristics that favor collective relationalresolution over purely attribute-based algorithms.Categories and Subject Descriptors: H.3.3 [Information Storage and Retrieval]: InformationSearch and Retrieval—Clustering; H.2.8 [Database Management]: Database Applications—Data miningGeneral Terms: AlgorithmsAdditional Key Words and Phrases: entity resolution, graph clustering, record linkage, datacleaning1. INTRODUCTIONIn many applications, there are a variety of ways of referring to the same under-lying real-world entity. For example in a census database, “J. Doe”, “JonathanDoe” and “Jon Doe” may all refer to the same person. Additionally, in many do-mains references to different entities often co-occur in the data. For example, thesame database may also have records showing that “Jonathan Doe” is married to“Jeanette Doe” and has dependents“James Doe” and “Jason Doe”, “Jon Doe” ismarried to “Jean Doe”, and “J. Doe” has dependents “Jim Doe”, “Jason Doe” and“Jackie Doe”. Such relationships between entity references are best represented asa graph, which we refer to as the reference graph, where the nodes are the entityreferences and edges (or often hyper-edges) in the graph indicate references whichco-occur. The problem is, for any real-world entity, there may be multiple refer-ences to it, and, accordingly, there is more than one node in the reference graphcorresponding to that entity.ACM Journal Name, Vol. v, No. n, mm 20yy, Pages 1–35.2 · Bhattacharya and GetoorThus an important first step in any graph mining algorithm is transforming sucha reference graph into an entity graph, where nodes are the entities themselves andedges are among entities. Given a collection of references to entities, we wouldlike to a) determine the collection of ‘true’ underlying entities and b) correctlymap the entity references in the collection to these entities. Figure 1(a) shows thereference graph from our census example and Figure 1(b) shows the entity graphafter the references in the reference graph have been resolved. Even in this simpleexample, the entity graph is much smaller than the reference graph and consists ofa single connected component. This provides a much more accurate picture of theunderlying domain structure than the collection of disconnected subgraphs in thereference graph.Entity resolution is a common problem that comes in different guises (and is givendifferent names) in many computer science domains. Examples include computervision, where we need to figure out when regions in two different images refer tothe same underlying object (the correspondence problem); natural language pro-cessing when we would like to determine which noun phrases refer to the sameunderlying entity (co-reference resolution); and databases, where, when mergingtwo databases or cleaning a database, we would like to determine when two tuplerecords are referring to the same real-world object (deduplication and data inte-gration). Deduplication [Hern´andez and Stolfo 1995; Monge and Elkan 1996] isimportant for both accurate analysis, for example determining the number of cus-tomers, and for cost-effectiveness, for example removing duplicates from mailinglists. In information integration, determining approximate joins [Cohen 2000] isimportant for consolidating information from multiple sources; most often therewill not be a unique key that can be used to join tables in distributed databases,and we must infer when two records from different databases, possibly with differ-ent structures, refer to the same entity. In many of these examples, co-occurrenceinformation in the input can be naturally represented as a graph.Traditional approaches to entity resolution and deduplication use a variety ofattribute similarity measures, often based on approximate string matching crite-ria. These work well for correcting typographical errors and other types of noisyreference attributes. More sophisticated approaches make use of domain specificattribute similarity measures and often learn such mapping functions from resolveddata. However, it is still difficult to decide when identical references are in fact dis-tinct. For example, two people with name ‘J. Doe’ and living at the same addressand of the same age may be brothers and not the same person.More recent approaches take structural (i.e., relational) similarity into account[Ananthakrishna et al. 2002; Bhattacharya and Getoor 2004; Kalashnikov et al.2005; Dong et al. 2005]. One approach simply looks at the attributes of relatedreferences, and incorporates them into the attribute similarity score. For example,if we are comparing two census records for ‘Jon Doe’ and ‘Jonathan Doe’, we shouldbe more likely to match


Collective Entity Resolution in Relational Data

Download Collective Entity Resolution in Relational Data
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Collective Entity Resolution in Relational Data and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Collective Entity Resolution in Relational Data 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?