New version page

he Web Changes Everything

Upgrade to remove ads

This preview shows page 1-2-3 out of 10 pages.

Save
View Full Document
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience

Upgrade to remove ads
Unformatted text preview:

The Web Changes Everything: Understanding the Dynamics of Web Content Eytan Adar University of Washington Seattle, WA, USA [email protected] Jaime Teevan, Susan T. Dumais Microsoft Research Redmond, WA, USA {teevan, sdumais}@microsoft.com Jonathan L. Elsas Carnegie Mellon University Pittsburgh, PA, USA [email protected] ABSTRACT The Web is a dynamic, ever changing collection of information. This paper explores changes in Web content by analyzing a crawl of 55,000 Web pages, selected to represent different user visitation patterns. Although change over long intervals has been explored on random (and potentially unvisited) samples of Web pages, little is known about the nature of finer grained changes to pages that are actively consumed by users, such as those in our sample. We describe algorithms, analyses, and models for characterizing changes in Web content, focusing on both time (by using hourly and sub-hourly crawls) and structure (by looking at page-, DOM-, and term-level changes). Change rates are higher in our behavior-based sample than found in previous work on randomly sampled pages, with a large portion of pages changing more than hourly. Detailed content and structure analyses identify stable and dynamic content within each page. The understanding of Web change we develop in this paper has implications for tools designed to help people interact with dynamic Web content, such as search engines, advertising, and Web browsers. Categories and Subject Descriptors H.5.4 [Information Systems]: Information Interfaces and Presentation (e.g., HCI) – Hypertext/Hypermedia: User issues. General Terms: Human Factors, Measurement. Keywords: Web page dynamics, change, re-finding. 1. INTRODUCTION The content on the Web is different from most other types of content we normally interact with because it changes regularly. For example, while most documents on a person’s desk remain constant, with perhaps only an annotation or two added over time, even Web pages we think of as relatively static, like the WSDM conference home page, change in subtle ways (see Figure 1). In this paper, we characterize Web change by analyzing a multi-week Web crawl of 55,000 pages. The pages in the crawl were selected to represent a range of visitation patterns, and our dataset is unique in that it reflects how the observed Web is changing. Understanding this segment of the Web is important for crawlers and search engines that must keep pace with changing content, as well as applications intended to help people interact with dynamic Web content. Our research confirms a number of previously identified trends in Web evolution and highlights several important differences for pages that are visited by users compared with those selected at random. Through analysis of hourly, sub-hourly, and concurrent crawls we explore page change on a fine-grained time scale. We find that many pages change in a way that can be represented using a two-segment, piece-wise linear model that would be unobservable in a coarser crawl. We further extend previous work by characterizing the nature of the changes to Web page content and structure. The analysis of content change presented in this paper focuses on understanding which terms within a page appear consistently over time, and which come and go. We introduce the notion of the staying power of a term within a document over time. It appears there is a bi-modal distribution of terms, with most terms either being very stable over time or changing very rapidly. Stable terms reflect the ongoing central topic of a page as well as common function words or navigational elements. Our analyses can inform algorithms for improved search engine ranking and contextualized advertising. We also explore the dynamics of Web page structure through analysis of DOM-level changes. In particular we concentrate on short-term survivability of various DOM elements, an important metric for Web clipping and template extraction applications which rely on the DOM structure to function [1][3][6]. The large amounts of data analyzed in our DOM analysis necessitated the creation of an efficient algorithm for tracking the motion of blocks of text, changes in the DOM structure, and identification of blocks of simultaneously changing content. The algorithm we develop can be used for both analysis and higher level tasks like predicting content flow in the page and block identification. We begin this paper with a discussion of related work and a detailed description of our unique dataset. We then explore both content and structural change in greater detail. We conclude with a discussion of future work and potential applications. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. WSDM’09, February 9-12, 2009, Barcelona, Spain. Copyright 2009 ACM 978-1-60558-390-7…$5.00. Figure 1. Web content changes regularly. Here changes to theWSDM 2009 homepage are highlighted.2. RELATED WORK Characterizing the amount of change on the Web has been of considerable interest to researchers [4], [8], [11], [12], [13], [14], [16]. For example, Cho and Garcia-Molina [4] crawled 720,000 pages once a day for a period of four months and looked at how the pages changed. Ntoulas, Cho, and Olston [14] studied page change through weekly snapshots of 150 websites collected over a year. They found that most pages did not change according to a bag-of-words measure of similarity. Even for pages that did change, the changes were minor. Frequency of change was not a great predictor of the degree of change, but the degree of change was a good predictor of the future degree of change. Perhaps the largest scale study of Web page change was conducted by Fetterly et al. [8]. They crawled 150 million pages once a week for 11 weeks, and compared the change across pages. Like Ntoulas, Cho, and Olston [14], they found a relatively small amount of change, with 65% of all page pairs remaining exactly the same. The study additionally found that past change was a good predictor of future change, that page length was correlated with change, and that the top-level domain of a page was


Download he Web Changes Everything
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view he Web Changes Everything and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view he Web Changes Everything 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?