Penn CIS 112 - Network Science and the Web A Case Study

Unformatted text preview:

Network Science and the Web: A Case StudyThe Web as NetworkGraph Structure in the Web [Broder et al. paper]Five Easy PiecesSize of the FiveDiameter MeasurementsDegree DistributionsExploiting Web Structure: Google and PageRankThe PageRank AlgorithmThe “Random Surfer” ModelLooking Ahead: Left Side vs. Right SideNetwork Science and the Web:A Case StudyNetworked LifeCIS 112Spring 2009Prof. Michael KearnsThe Web as Network•Consider the web as a network–vertices: individual (html) pages–edges: hyperlinks between pages–will view as both a directed and undirected graph•What is the structure of this network?–connected components–degree distributions–etc.•What does it say about the people building and using it?–page and link generation–visitation statistics•What are the algorithmic consequences?–web search–community identificationGraph Structure in the Web[Broder et al. paper]•Report on the results of two massive “web crawls”•Executed by AltaVista in May and October 1999•Details of the crawls:–automated script following hyperlinks (URLs) from pages found–large set of starting points collected over time–crawl implemented as breadth-first search–have to deal with webspam, infinite paths, timeouts, duplicates, etc.•May ’99 crawl:–200 million pages, 1.5 billion links•Oct ’99 crawl:–271 million pages, 2.1 billion links•Unaudited, self-reported Sep ’03 stats:–3 major search engines claim > 3 billion pages indexedFive Easy Pieces •Authors did two kinds of breadth-first search:–ignoring link direction  weak connectivity–only following forward links  strong connectivity•They then identify five different regions of the web:–strongly connected component (SCC):•can reach any page in SCC from any other in directed fashion–component IN:•can reach any page in SCC in directed fashion, but not reverse–component OUT:•can be reached from any page in SCC, but not reverse–component TENDRILS:•weakly connected to all of the above, but cannot reach SCC or be reached from SCC in directed fashion (e.g. pointed to by IN)–SCC+IN+OUT+TENDRILS form weakly connected component (WCC)–everything else is called DISC (disconnected from the above)–here is a visualization of this structureSize of the Five•SCC: ~56M pages, ~28%•IN: ~43M pages, ~ 21%•OUT: ~43M pages, ~21%•TENDRILS: ~44M pages, ~22%•DISC: ~17M pages, ~8%•WCC > 91% of the web --- the giant component•One interpretation of the pieces:–SCC: the heart of the web–IN: newer sites not yet discovered and linked to–OUT: “insular” pages like corporate web sitesDiameter Measurements•Directed worst-case diameter of the SCC:–at least 28•Directed worst-case diameter of IN  SCC  OUT:–at least 503•Over 75% of the time, there is no directed path between a random start and finish page in the WCC–when there is a directed path, average length is 16•Average undirected distance in the WCC is 7•Moral: –web is a “small world” when we ignore direction–otherwise the picture is more complexDegree Distributions•They are, of course, heavy-tailed•Power law distribution of component size –consistent with the Erdos-Renyi model•Undirected connectivity of web not reliant on “connectors”–what happens as we remove high-degree vertices?Exploiting Web Structure:Google and PageRankThe PageRank Algorithm•Let’s define a measure of page importance we will call the rank•Notation: for any page p, let–N(p) be the number of forward links (pages p points to)–R(p) be the (to-be-defined) rank of p•Idea: important pages distribute importance over their forward links•So we might try defining–R(p) := sum of R(q)/N(q) over all pages q  p–can define iterative algorithm for computing the R(p)–(if it converges, solution has an eigenvector interpretation)–problem: cycles accumulate rank but never distribute it•The fix:–R(p) := [sum of R(q)/N(q) over all pages q  p] + E(p)–E(p) is some external or exogenous measure of importance–some technical details omitted here (e.g. normalization)•Let’s play with the PageRank calculatorThe “Random Surfer” Model•Let’s suppose that E(p) sums to 1 (normalized)•Then the resulting PageRank solution R(p) will–also be normalized–can be interpreted as a probability distribution•R(p) is the stationary distribution of the following process:–starting from some random page, just keep following random links–if stuck in a loop, jump to a random page drawn according to E(p)–so surfer periodically gets “bored” and jumps to a new page–E(p) can thus be personalized for each surfer•PageRank one (important) component Google’s search tech–don’t forget the words! –information retrieval and statistical language processingLooking Ahead: Left Side vs. Right Side •So far we are discussing the “left hand” search results on Google–a.k.a “organic” search;•“Right hand” or “sponsored” search: paid advertisements in a formal market–We will spend a lecture on these markets later in the term•Same two types of search/results on Yahoo!, MSN,…•Common perception:–organic results are “objective”, based on content, importance, etc.–sponsored results are subjective advertisements•But both sides are subject to “gaming” (strategic behavior)…–organic: invisible terms in the html, link farms and web spam, reverse engineering–sponsored: bidding behavior, “jamming”–optimization of each side has its own industry: SEO and SEM•… and perhaps to outright fraud–organic: typo squatting–sponsored: click fraud•More


View Full Document

Penn CIS 112 - Network Science and the Web A Case Study

Documents in this Course
Load more
Download Network Science and the Web A Case Study
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Network Science and the Web A Case Study and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Network Science and the Web A Case Study 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?