More on RankingsQuery-independent LAR• Have an a-priori ordering of the web pages• Q: Set of pages that contain the keywords in the query q• Present the pages in Q ordered according to order π• What are the advantages of such an approach?InDegree algorithm• Rank pages according to in-degree– wi= |B(i)|1. Red Page2. Yellow Page3. Blue Page4. Purple Page5. Green Pagew=1w=1w=2w=3w=2PageRank algorithm [BP98]• Good authorities should be pointed by good authorities• Random walk on the web graph– pick a page at random– with probability 1- α jump to a random page– with probability α follow a random outgoing link• Rank according to the stationary distribution•1. Red Page2. Purple Page3. Yellow Page4. Blue Page5. Green PagenqFqPRpPRpq11)()()(Markov chains• A Markov chain describes a discrete time stochastic process over a set of statesaccording to a transition probability matrix– Pij= probability of moving to state j when at state i• ∑jPij= 1 (stochastic matrix)• Memorylessness property: The next state of the chain depends only at the current state and not on the past of the process (first order MC)– higher order MCs are also possibleS = {s1, s2, … sn}P = {Pij}Random walks• Random walks on graphs correspond to Markov Chains– The set of states S is the set of nodes of the graph G– The transition probability matrix is the probability that we follow an edge from one node to anotherAn examplev1v2v3v4v521000210031313100010100000021210P1000100111000101000000110AState probability vector• The vector qt= (qt1,qt2, … ,qtn) that stores the probability of being at state i at time t– q0i= the probability of starting from state iqt= qt-1PAn example02100210031313100010100000021210Pv1v2v3v4v5qt+11= 1/3 qt4+ 1/2 qt5qt+12= 1/2 qt1+ qt3+ 1/3 qt4qt+13= 1/2 qt1+ 1/3 qt4qt+14= 1/2 qt5qt+15= qt2Stationary distribution• A stationary distribution for a MC with transition matrix P, is a probability distribution π, such that π = πP• A MC has a unique stationary distribution if – it is irreducible• the underlying graph is strongly connected– it is aperiodic• for random walks, the underlying graph is not bipartite• The probability πiis the fraction of times that we visited state i as t → ∞• The stationary distribution is an eigenvector of matrix P– the principal left eigenvector of P – stochastic matrices have maximum eigenvalue 1Computing the stationary distribution• The Power Method– Initialize to some distribution q0– Iteratively compute qt= qt-1P– After enough iterations qt ≈ π– Power method because it computes qt= q0Pt• Rate of convergence– determined by λ2The PageRank random walk• Vanilla random walk– make the adjacency matrix stochastic and run a random walk02100210031313100010100000021210PThe PageRank random walk• What about sink nodes?– what happens when the random walk moves to a node without any outgoing inks?02100210031313100010000000021210P0210021003131310001051515151510021210P'The PageRank random walk• Replace these row vectors with a vector v– typically, the uniform vectorP’ = P + dvTotherwise0sink is i if1d515151515151515151515151515151515151515151515151512100021003131310001051515151510021210'P' )1(The PageRank random walk• How do we guarantee irreducibility?– add a random jump to vector v with prob α• typically, to a uniform vectorP’’ = αP’ + (1-α)uvT, where u is the vector of all 1sEffects of random jump• Guarantees irreducibility• Motivated by the concept of random surfer• Offers additional flexibility – personalization– anti-spam• Controls the rate of convergence– the second eigenvalue of matrix P’’ is αA PageRank algorithm• Performing vanilla power method is now too expensive – the matrix is not sparseq0 = vt = 1repeatt = t +1until δ < ε1tTtq'P'q1ttqqδEfficient computation of y = (P’’)Txβvyyyx βxαPy11TRandom walks on undirected graphs• In the stationary distribution of a random walk on an undirected graph, the probability of being at node i is proportional to the (weighted) degree of the vertex• Random walks on undirected graphs are not “interesting”Research on PageRank• Specialized PageRank– personalization [BP98]• instead of picking a node uniformly at random favor specific nodes that are related to the user– topic sensitive PageRank [H02]• compute many PageRank vectors, one for each topic• estimate relevance of query with each topic• produce final PageRank as a weighted combination• Updating PageRank [Chien et al 2002]• Fast computation of PageRank– numerical analysis tricks– node aggregation techniques– dealing with the “Web frontier”Topic-sensitive pagerank• HITS-based scores are very inefficient to compute• PageRank scores are independent of the queries• Can we bias PageRank rankings to take into account query keywords? Topic-sensitive PageRankTopic-sensitive PageRank• Conventional PageRank computation:• r(t+1)(v)=ΣuЄN(v)r(t)(u)/d(v)• N(v): neighbors of v• d(v): degree of v• r = Mxr• M’ = (1-α)P+ α[1/n]nxn• r = (1-α)Pr+ α[1/n]nxnr = (1-α)Mr+ αp• p = [1/n]nx1Topic-sensitive PageRank• r = (1-α)Pr+ αp• Conventional PageRank: p is a uniform vector with values 1/n• Topic-sensitive PageRank uses a non-uniform personalization vector p• Not simply a post-processing step of the PageRankcomputation• Personalization vector p introduces bias in all iterations of the iterative computation of the PageRank vectorPersonalization vector• In the random-walk model, the personalization vector represents the addition of a set of transition edges, where the probability of an artificial edge (u,v) is αpv• Given a graph the result of the PageRankcomputation only depends on α and p : PR(α,p)Topic-sensitive PageRank: Overall approach• Preprocessing– Fix a set of k topics– For each topic cjcompute the PageRank scores of page u wrt to the j-th topic: r(u,j)• Query-time processing: – For query q compute the total score of page u wrtq as score(u,q) = Σj=1…kPr(cj|q) r(u,j)Topic-sensitive PageRank: Preprocessing• Create k different biased PageRank vectors using some pre-defined set of k categories (c1,…,ck)• Tj: set of URLs in the j-th category• Use non-uniform personalization vector p=wjsuch that:o/w ,0,1)(jjjTvTvwTopic-sensitive PageRank: Query-time processing• Dj: class term vectors consisting of all the terms appearing in the k pre-selected
View Full Document