Node RankingLidan Wang and Brad SkaggsCMSC 828G1TiTopicsWhat is PageRankWhat is PageRank PageRank algorithm [Brin and Page 2002] Topic-sensitive PageRank [Richardson et al. 2002, Haveliwala 2002] HITS [Kleinberg 2002] Other variationsApplicationsApplicationsIt d tiIntroductionWeb search results: quantity vs qualityWeb search results: quantity v.s. qualityAmount of web pages increases rapidlyHuman ability to inspect information is limitedypUsers only want to see a few very relevant pagesNeeded: a reliable measure for relevance and importanceSolution: analysis of link structurePRkititiPageRank intuitionHow can we tell a page’s importance?How can we tell a pages importance? Relies on the link structure of the web to indicate an page’s value. A hyperlink from page A to page B as a vote, for page B by page A.A page is “important” if:A page is “important” if: There are many pages pointing to it. These pages are considered valuable.ElExampleTaken from http://en.wikipedia.org/wiki/PageRankPageRank (Definition) [Brin and Page 2002]2002] Use the link structure of the web to calculate a qualityUse the link structure of the web to calculate a quality ranking (PageRank) for each web page. Assigns a numerical value to each element of a hyperlinked set of documents, such as WWW. “Measuring” an element’s relative importance within the setthe set The numerical weight assigned to an element E is called the PageRank of E.Li k t (1)Links as votes (1)A link from page B to page A denotes endorsement:A link from page B to page A denotes endorsement: Page B considers page A an authority on a subject An authority value or PageRank(PR) value is assigned to every page (document)every page (document)025PR(A) = 0.75AB0.250250.25DC0.25Li k t (2)Links as votes (2)PR conferred by an outgoing link is the document’sownPRPR conferred by an outgoing link is the document s own PR divided by the number of outgoing links:PR(A) = PR(B)/2 + PR(C) + PR(D)/3ABPR(B)/2PR(D)/3DCPR(C)PR(D)/3PR(A) = PR(t1)/O(t1)+ +PR(tn)/O(tn); t= page linking to ADCPR(A) = PR(t1)/O(t1)+…+PR(tn)/O(tn); ti= page linking to A, O(ti): number of links from tiSi lifi d P R k l ithSimplified PageRank algorithm0.2A BThe Web as a directed graph G= (V, E). Let the total number of pages be n. The PageRank score of the page i0.4 0.20.20.20.4PR(i) =PR(j)Oj(ji)E∑,Ojis the number of out-link of j(denoted by PR(i)) is defined by:PageRank computations require several iterations until convergenceOj(j,i)∈Eoutlink of jC0.4Probability distribution represents the likelihood that a person randomly clicking th li k ill i t ti lPRk=[0.4; 0.2; 0.4]PRk+1=[0.4; 0.2; 0.4]on the links will arrive at a particular pageStationary PageRank vectorMti tiMatrix notionHyperlink matrixH=[H]in⎧ABHyperlink matrix H=[Hij] in which the entry in the ithrow and jthcolumn is:Hij=1Oiif (i, j) ∈ E0 otherwise⎧ ⎨ ⎪ ⎩ ⎪ C⎩ H=0 12 12001⎡ ⎢ ⎢ ⎤ ⎥ ⎥ ABCABLet PR= (PR(1), …, PR(n))TPR=HTPRH 0 0 1 1 0 0⎣ ⎢ ⎢ ⎦ ⎥ ⎥ BCStochastic matrix: all entries >=0; We can write equations: PRHPReach row sums up to 1Sl f P R kSolve for PageRankABABCA well known mathematicalPRk+1HTPRkABH = 0 12 120 0 1 ⎡ ⎢ ⎢ ⎢⎤ ⎥ ⎥ ⎥ABCABA well known mathematical technique called power iteration can be used to find PR ie PRk+1=HTPRkC1 0 0⎣ ⎢ ⎦ ⎥ C PR0PR1PR2PR3PRki.e.The solution to PR is an eigenvector with thePR PRPRPRPR0.33 0.33 0.33 0.42 0.40 0.33 0.17 0.25 0.17 0.20 eigenvector with the corresponding eigenvalue 1.If some conditions are satisfied, 0.33 0.50 0.42 0.42 0.40 ,1 is the largest eigenvalue and the PageRank vector PR is the principal eigenvector. Problem: the Web graph does not meet the conditions (later).Rd fiRandom surfingMarkov chain.Markov chain. Each web page in the Web graph is a state. A link is a transition from one state to another.We can model web surfing as a stochastic process.We can model web surfing as a stochastic process. It models a web surfer randomly clicks links without clicking back buttonclicking back button. The fraction of time the surfer would spend at a page if the random process were iterated at infinitum is itsthe random process were iterated at infinitum is its PageRank.CConvergence Markov Chain Theorem: A finite Markov chain defined by the stochastic matrix H has a unique stationary probability distribution if H is irreducible and aperiodic. After a number of transitions PRkwill converge to a steady-state probability vectorlimk→∞PRk=PR When the steady-state is reached, then PRk= PRk+1= PRPRHTPRk→∞PR=HTPR PR is the principal eigenvector of HTwith eigenvalue 1.Wb hWeb graph Questions: Is His a stochastic matrix? IsHirreducible?Is Hirreducible? Is Hand aperiodic? If the above conditions are not satisfied, how to extend the simplified equation to the actual PageRank formula?IHthti?(1)IsHstochastic? (1)Is the sum of entries in each row equal to 1?Is the sum of entries in each row equal to 1? If web pages have no out-links, i.e. some rows in transition matrix Hhave all 0’s. They are called the dangling pages:00001⎛ ⎞ v1H =0000110 00001/2 0 01/20012012⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ v1v2 v5f 00 000⎝ ⎜ ⎜ ⎜ ⎠ ⎟ ⎟ ⎟ v4v3Entries of all 0’s!IHthti?(2)IsHstochastic? (2)Is the sum of entries in each row equal to 1?Is the sum of entries in each row equal to 1? If web pages have no out-links, i.e. some rows in transition matrix Hhave all 0’s. They are called the dangling pages:00001⎛ ⎞ v1H =000011000001/20 01/20012012⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ v1v2 v5 1/5 1/5 1/5 1/5 1/5⎝ ⎜ ⎜ ⎜ ⎠ ⎟ ⎟ ⎟ v4v3FixedIHi d bl d i di ?IsHirreducable and aperiodic?Irreducible:Aperiodic: States vi and vj can communicate in the directed graph, i.e. a path exists between them.Aperiodic:State vi is periodic.Aperiodic Markov chain: none Irreducible Markov chain: all states communicateAperiodic Markov chain: none of the states is periodicv2v1v2v1v0v3v4v0v3v4H t fi th t iti t iHow to fix the transition matrixW fi th bl b ddi li k f h tWe can fix these problems by adding a link from each page to every page and give each link a small transition probability controlled by a parameter d. Then, the
View Full Document