Ranking-Based Clustering of Heterogeneous Information Networks with Star Network Schema

Home> Academic Documents> Ranking-Based Clustering of Heterogeneous Information Networks with Star Network Schema

DOC PREVIEW

This preview shows page 1-2-3 out of 9 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Ranking-Based Clustering of Heterogeneous InformationNetworks with Star Network Schema∗Yizhou Sun Yintao Yu Jiawei HanDepartment of Computer Science, [email protected] [email protected] [email protected] heterogeneous information network is an informationnetwork composed of multiple types of objects. Cluster-ing on such a network may lead to better understanding ofb oth hidden structures of the network and the individual roleplayed by every object in each cluster. However, althoughclustering on homogeneous networks has been studied overdecades, clustering on heterogeneous networks has not beenaddressed until recently.A recent study proposed a new algorithm, RankClus, forclustering on bi-typ ed heterogeneous networks. However,a r eal-world network may consist of more than two types,and the interactions among multi-typed objects play a keyrole at disclosing the rich semantics that a network carries.In this paper, we study clustering of multi-typed heteroge-neous networks with a star network schema and propose anovel algorithm, NetClus, that utilizes links across multi-typed objects to generate high-quality net-clusters. An it-erative enhancement method is developed that leads to ef-fective ranking-based clustering in such heterogeneous net-works. Our experiments on DBLP data show that NetClusgenerates more accurate clustering r esults than the baselinetopic model algorithm PLSA and the recently proposed al-gorithm, RankClus. Further, NetClus generates informativeclusters, presenting good ranking and cluster membershipinformation for each attribute object in each net-cluster.Categories and Subject DescriptorsH.2.8 [Information Systems Applications]: DatabaseApplications—Data Mining∗The work was supported in part by the U.S. National Sci-ence Foundation grants IIS-08-42769 and BDI-05-15813, Of-fice of Naval Research (ONR) grant N00014-08-1-0565, andthe Air Force Office of Scientific Research MURI awardFA9550-08-1-0265. Any opinions, findings, and conclusionsexpressed here are those of the authors and do not necessar-ily reflect the views of the funding agencies.Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.KDD’09, June 28–July 1, 2009, Paris, France.Copyright 2009 ACM 978-1-60558-495-9/09/06 ...$10.00.General TermsAlgorithmsKeywordsheterogeneous information network, clustering1. INTRODUCTIONInformation networks, containing a large number of indi-vidual agents or components interacting with each other, areubiquitous in many applications, e.g., the Internet that con-sists of a gigantic network of webpages, co-author networksand citation networks extracted from bibliographic data,user networks extracted from email systems, and friendshipnetwork extracted from web sites like Facebook1and Mys-pace2. Clustering on an information networ k based on linksb etween objects may give us a grand view of the huge net-work. For example, communities can be detected by cluster-ing on co-author network [11]. Most current studies [20, 17,19, 21] on information network are on homogeneous net-works, i.e., networks consisting of single type of objects, asshown above. However, in reality, objects could be of mul-tiple types, forming a heterogeneous network. A recentalgorithm RankClus [16] deals with bi-typed heterogeneousnetworks. Unfortunately, in reality there often exists morethan two types of interacting objects in a network. Amongthem, networks with star network schema (called starnetwork) such as bibliographic network centered with pa-p ers and tagging network (e.g., http://delicious.com) cen-tered with a tagging event are popular and important. Infact, any n-nary relation set such as records in a relationaldatabase can be mapped into a star network, with each re-lation as the center object and all attribute entities linkingto it.Example 1.1 (Bibliographic Information Network)A bibliographic network consists of rich information aboutresearch papers, each written by a group of authors, usinga set of terms, and published in a venue (a conference ora journal). Such a bibliographic network is composed offour types of objects: autho rs, venues, terms, and papers.Links exist between papers and authors by the relation of“write” and “w ritten by”, between papers and terms by therelation of “contain” and “contained in”, between papers andvenues by the relation of “publish” and “published by”. Thetop ological structure of a bibliographic network is shown inthe left par t of Figure 1, which forms a star network schema,1http://www.facebook.com/2http://www.myspace.com/where paper is a center type and all other types of objectsare linked via papers.Research PaperTermAuthorVenuePublish WriteContainPTAVPTAVĂĂPTAVNetClusComputer ScienceDatabaseHardwareTheoryFigure 1: Clustering on A Bibliographic NetworkOne possible way to cluster a heterogeneous network isto firs t extract from it a set of homogeneous networks andthen apply traditional graph clustering algorithms. How-ever, such an extraction is an information reduction pro-cess: some valuable information, e.g., paper title or venuepublished in, is lost in an extracted co-author network. Fur-ther, although clustering co-author network may discoverauthor communities, a research network contains not onlyauthors, but also venues, terms, and papers. It is impor-tant to preserve such information by directly clustering onheterogeneous networks, which may lead to generating sub-network clusters carrying rich information. This motivatesus to develop NetClus, a method that discovers net-clusters,i.e., a set of sub-network clusters induced from the originalheterogeneous network (Figure 1).The second weakness of current clustering algorithms isthat they do not consider the importance of each object inthe network and merely output the cluster label for eachobject. As a result, clusters ar e difficult to understand, es-p ecially when the size of clusters are large. NetClus not onlydiscovers net-clusters but also gives ranking distribution foreach type of objects in each cluster, which makes the clusterso discovered quite meaningful, as shown in the followingexample.Example 1.2.


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3 out of 9 pages.

Please select your school