Unformatted text preview:

Self-Similarity In the WebSTEPHEN DILL, RAVI KUMAR, KEVIN S. MCCURLEY, SRIDHARRAJAGOPALAN, D. SIVAKUMAR, and ANDREW TOMKINSIBM Almaden Research Center, San JoseAlgorithmic tools for searching and mining the Web are becoming increasingly sophisticated andvital. In this context, algorithms that use and exploit structural information about the Web performbetter than generic methods in both efficiency and reliability.We present an extensive characterization of the graph structure of the Web, with a view toenabling high-performance applications that make use of this structure. In particular, we showthat the Web emerges as the outcome of a number of essentially independent stochastic processesthat evolve at various scales. A striking consequence of this scale invariance is that the structureof the Web is “fractal”—cohesive subregions display the same characteristics as the Web at large. Anunderstanding of this underlying fractal nature is therefore applicable to designing data servicesacross multiple domains and scales.We describe potential applications of this line of research to optimized algorithm design forWeb-scale data analysis.Categories and Subject Descriptors: H.3.5 [Information Storage and Retrieval]: InformationSearch and Retrieval—information filteringGeneral Terms: Experimentation, Measurement, Theory, VerificationAdditional Key Words and Phrases: Fractal, graph structure, online information services, self-similarity, Web-based services, World-Wide-Web1. INTRODUCTIONAs the the size of the Web grows exponentially, data services on the Web arebecoming increasingly complex and challenging tasks. These include both basicservices such as searching and finding related pages, and advanced applicationssuch as Web-scale data mining, community extraction, constructions of indices,taxonomies, and vertical portals. Applications are beginning to emerge that arerequired to operate at various points on the “petabyte curve”—billions of Webpages that each have megabytes of data, tens of millions of users in a peer-to-peer setting each with several gigabytes of data, etc. The upshot of the rateand diversity of this growth is that data service applications for collections ofAuthors’ address: IBM Almaden Research Center, 650 Harry Road, San Jose, CA 95120; email:{dill,ravi,mccurley,sridhar,siva,tomkins}@almaden.ibm.com.Permission to make digital or hard copies of part or all of this work for personal or classroom use isgranted without fee provided that copies are not made or distributed for profit or direct commercialadvantage and that copies show this notice on the first page or initial screen of a display alongwith the full citation. Copyrights for components of this work owned by others than ACM must behonored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers,to redistribute to lists, or to use any component of this work in other works, requires prior specificpermission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 1515Broadway, New York, NY 10036 USA, fax: +1 (212) 869-0481, or [email protected]°2002 ACM 1533-5399/02/0800-0205 $5.00ACM Transactions on Internet Technology, Vol. 2, No. 3, August 2002, Pages 205–223.206•S. Dill et al.hyperlinked documents need to be efficient and effective at several scales ofoperation. As we will show, a form of “scale invariance” exists on the Web thatallows simplification of this multiscale data service design problem.The first natural approach to the wide range of analysis problems emergingin this new domain is to develop a general query language to the Web. Therehave been a number of proposals along these lines [Mendelzon et al. 1997;Abiteboul et al. 1997; Spertus 1997]. Further, various advanced mining opera-tions have been developed in this model using a Web-specific query languagelike those described above, or a traditional database encapsulating some do-main knowledge into table layout and careful construction of SQL programs[Chakrabarti et al. 1999a; Spertus and Stein 1998; Arocena et al. 1997].However, these applications are particularly successful precisely when theytake advantage of the special structure of the document collections and thehyperlink references among them. An early example of this phenomenon in themarketplace is the paradigm shift witnessed in search applications—rankingschemes for Web pages were vastly improved when link-based analysis wasadded to more traditional text-based schemes [Kleinberg 2000; Brin and Page1998].The success of these specialized approaches naturally led researchers to seeka finer understanding of the hyperlinked structure of the Web. Broadly, thereare two (very related) lines of research that have emerged. The first one is moretheoretical, and is concerned with proposing stochastic models that explain thehyperlink structure of the Web [Kumar et al. 2000; Barabasi and Albert 1999;Aiello et al. 2000]. The second line of research [Broder et al. 2000; Barabasiand Albert 1999; Adamic and Huberman 1999; Kumar et al. 1999a] is moreempirical; new experiments are conducted that either validate or refine existingmodels.There are several driving applications that motivate (and are motivated by)a better understanding of the neighborhood structure on the Web. In particu-lar, the “second generation” of data service applications on the Web—includingadvanced search applications [Chakrabarti et al. 1998a; Chakrabarti et al.1998b; Bharat and Henzinger 1998], browsing and information foraging[Botafogo and Shneiderman 1991; Carriere and Kazman 1997; Chakrabartiet al. 1999b; Pirolli et al. 1996; Pitkow and Pirolli 1997], community extraction[Kumar et al. 1999a], taxonomy construction [Kumar et al. 1999b, 2001]—haveall taken tremendous advantage of knowledge about the hyperlink structure ofthe Web. As just one example, let us mention the community extraction algo-rithm of [Kumar et al. 1999a]. In this algorithm, a characterization of degreesequences within Web-page neighborhoods allowed the development and anal-ysis of efficient pruning algorithms for a subgraph enumeration problem thatis in general intractable.Even more recently, new algorithms have been developed to benefit fromstructural information about the Web. Arasu et al. [2001] have shown howto take advantage of the macroscopic “bow-tie” structure of the Web [Broderet al. 2000] to design an efficient algorithmic partitioning method for cer-tain


View Full Document

UD ELEG 867 - Self-Similarity In the Web

Documents in this Course
Firewalls

Firewalls

53 pages

Load more
Download Self-Similarity In the Web
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Self-Similarity In the Web and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Self-Similarity In the Web 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?