New version page

Graph Summarization with Bounded Error

This preview shows page 1-2-3-4 out of 13 pages.

View Full Document
View Full Document

End of preview. Want to read all 13 pages?

Upload your study docs or become a GradeBuddy member to access this document.

View Full Document
Unformatted text preview:

Graph Summarization with Bounded ErrorSaket Navlakha∗Dept. of Computer ScienceUniversity of MarylandCollege Park, MD, [email protected] Rastogi†Yahoo! LabsBangalore, [email protected] ShrivastavaBell Labs ResearchBangalore, [email protected] propose a highly compact two-part representation of agiven graph G consisting of a graph summary and a set ofcorrections. The graph summary is an aggregate graph inwhich each node corresponds to a set of nodes in G, andeach edge represents the edges between all pair of nodes inthe two sets. On the other hand, the corrections portionsp ecifies the list of edge-corrections that should be appliedto the summary to recreate G. Our representations allowfor both lossless and lossy graph compression with boundson the introduced error. Further, in combination with theMDL principle, they yield highly intuitive coarse-level sum-maries of the input graph G. We develop algorithms to con-struct highly compressed graph representations with smallsizes and guaranteed accuracy, and validate our approachthrough an extensive set of exp eriments with multiple real-life graph data sets.To the best of our knowledge, this is the first work tocompute graph summaries using the MDL principle, and usethe summaries (along with corrections) to compress graphswith bounded error.Categories and Subject DescriptorsE.2 [Data Storage Representations]; H.3 [InformationStorage and Retrieval]: Information StorageGeneral TermsAlgorithmsKeywordsGraph Compression, Minimum Description Length, Approx-imation∗This work was done when the author was visiting Bell LabsResearch, Bangalore, India.†This work was done when the author was with Bell LabsResearch, Bangalore, India.Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.SIGMOD’08, June 9–12, 2008, Vancouver, BC, Canada.Copyright 2008 ACM 978-1-60558-102-6/08/06 ...$5.00.1. INTRODUCTIONGraphs are a fundamental abstraction that have been em-ployed for centuries to model real-world systems and phe-nomena. Today, numerous large-scale systems and appli-cations need to analyze and store massive amounts of datathat involve interactions between various entities – this datais best represented as a graph; for instance, the link structureof the World Wide Web, group of friends in social networks,data exchange between IP addresses, market basket data,etc., can all be repr esented as massive graph structures. Be-low, we look at some of these application domains.• World Wide Web. The Web has a natural graph struc-ture with a node for each page and a directed edge foreach hyperlink. This link structure of the Web hasb een exploited very successfully by search engines likeGoogle [4] to improve search quality. Other con-temporar y research works mine the Web graph to finddense bipartite cliques, and through them Web com-munities [21] and link spam [12]. Recent estimatesfrom search engines put the size of the Web graph ataround 3 billion nodes and more than 50 billion arcs[3]. (Note that these are clearly lower bounds since theWeb graph has been growing rapidly over the years asmore of the Web gets discovered and indexed.) Thus,the Web graph can easily occupy many terabytes ofstorage.• Social Networking. Popular social networking websiteslike Facebook, MySpace and LinkedIn cater to millionsof users at a time, and maintain information abouteach user (nodes) and their f riend-lists (edges). Miningthe social network graph can provide valuable informa-tion on social relationships between users, the music,movies, etc. that they like, and user communities withcommon interests.• IP Network Monitoring. IP routers export records con-taining source and destination IP addresses, numberof bytes transmitted, duration, etc. for each IP com-munication flow. Recently, Iliofotou et. al. [14] pro-p osed the idea of extracting Traffic Dispersion Graphs(TDGs) from network traces, where each node cor-resp onds to an IP address and there is an edge be-tween any two IP addresses who sent traffic to eachother. Such graphs can be used to detect interestingor unusual communication patterns, security vulnera-bilities, hosts that are infected by a virus or a worm,and malicious attacks against machines. These graphs,however, can be large – it has been reported in [7] thatthe AT&T IP backbone network alone generates 500GB of IP flow data per day (about ten billion fifty-byterecords).• Market Basket Data. Market basket data contains in-formation about products bought by millions of cus-tomers. This is essentially a bipartite graph with anedge between a customer and every product that heor she purchases. Mining this graph to find groups ofcustomers with similar buying patterns can help withcustomer segmentation and targeted advertising.A common theme in all of the above applications is theneed to analyze large graphs with millions and even bil-lions of nodes and edges. Visualizing such massive graphsis clearly a major challenge due to the difficulty of gettingeverything to fit in a single screen. Furthermore, develop-ing graph mining algorithms that can scale to such giganticproportions is another non-trivial challenge, especially whenthe graph is too large to fit entirely in main memory.In this paper, we propose information-theoretic techniquesfor computing compressed graph representations. Our graphrepresentation R has two parts: the first is a graph summaryS (much smaller than the input) that captures the impor-tant clusters and relationships in the input graph, while thesecond is a set of corrections C that helps to recreate theoriginal graph, if necessary. Moreover, if the user is will-ing to tolerate a certain amount of error in the recreationpro cess, we also show how to exploit this leeway to get fur-ther reduction in the size of the representation, and strike atrade-off between accuracy and memory.Our graph representation has the following benefits:• The summary S is itself a graph with substantiallyfewer nodes and links that can easily fit in memory.Thus, it is amenable to visualization and other graphanalysis techniques (e.g., finding


Loading Unlocking...
Login

Join to view Graph Summarization with Bounded Error and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Graph Summarization with Bounded Error and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?