5/4/09&1&PageRank&CISC489/689‐010,&Lecture&Wednesday,&April&29th&Ben&CartereGe&Web&Search&• Problem:&– Web&search&engines&easily&hacked&– I&want&to&sell&something;&I’ll&just&add&a&few&popular&keywords&to&my&page&over&and&over&and&over&again&– All&the&retrieval&models&we’ve&discussed&will&score&that&page&higher&for&those&keywords&• Other&problems:&– No&hackers,&but&top‐ranked&pages&are&coming&from&deep&within&a&site,&or&from&pages&that&change&oTen,&or&pages&about&very&obscure&topics&– Not&really&useful&5/4/09&2&Possible&SoluUon&• Leverage&link&structure&• Maybe&if&many&pages&are&linking&to&a&page,&that&page&is&more&“important”&• Idea:&– Count&the&number&of&inlinks&to&the&page&– Assign&it&“importance”&based&on&that&number&• Any&problem&with&this?&Hacking&Link&Counts&• I&can&just&make&a&bunch&of&pages&that&link&to&my&spam&page&• Inlink&count&will&be&high&even&though&my&page&is¬&important&• BeGer&idea:&– Recursively&use&the&importance&of&the&linking&pages&when&calculaUng&the&importance&of&the&page&5/4/09&3&PageRank&• Google’s&PageRank&is&probably&the&best&known&algorithm&• IntuiUve&idea:&&“random&surfer”&model&– If&you&start&on&a&random&page&on&the&internet&and&just&start&clicking&links&randomly,&– What&is&the&probability&you&will&land&on&page&u?&– If&one&page&has&a&higher&landing&probability,&the&pages&it&links&to&have&higher&landing&probabiliUes&as&well&– Higher&probability&=&more&authority&=&beGer&PageRank&IllustraUon&5/4/09&4&PageRank&DefiniUon&R(u)=c!v∈BuR(v)NvR(u) is the PageRank of uBuis a set of pages that link to uv is one of the pages in BuR(v) is the PageRank of vNvis the number of pages that v links toSinks&• Problem:&– I&have&two&pages&that&only&link&to&each&other,&plus&one&page&that&links&to&one&of&them&– When&the&“random&surfer”&gets&to&one&of&those&pages,&he&will&just&keep&alternaUng&between&them&– Their&PageRank&will&dominate&everything&else&• SoluUon:&&once&in&a&while&the&random&surfer&just&starts&over&at&a&new&page&– OK,&but&how&do&I&put&that&in&PageRank?&5/4/09&5&Random&Restarts&R!(u)=c!v∈BuR!(v)Nv+ cE(u)E(u) is the probability that a random surfer jumps to uCalculaUng&PageRank&• A&simple&iteraUve&algorithm:&– First,&assign&a&PageRank&to&every&page&• E.g.&R0(u)&=&1/N&– IniUalize&E(u)&=&α&for&all&u&(0.15/N&in&original&paper)&– Over&iteraUons&i=1…,&do&• Update&each&PageRank&as:&• Calculate&d&as&the&sum&of&PageRanks&from&the&previous&iteraUon&minus&the&sum&of&PageRanks&from&the¤t&iteraUon&• Update&each&PageRank&as&• Calculate&δ&as&the&sum&of&|PageRanks&from&the¤t&iteraUon&minus&PageRanks&from&the&previous&iteraUon|&• If&δ&>&ε,&PageRanks&have&converged&Ri+1(u)=!v∈BuRi(v)NvRi+1(u)=Ri+1+ dα5/4/09&6&Scalability&• CalculaUng&PageRank&requires&a&vector&for&every&URL&• Along&with&a&list&of&the&URLs&that&link&to&that&URL&and&a&list&of&URLs&that&page&links&to&• Space&usage&is&O(N2)&• Time&complexity&also&O(N2)&• Even&for&small&collecUons&(like&Wikipedia),&it&is&nearly&impossible&to&keep&all&of&this&in&memory&CalculaUng&PageRank&with&MapReduce&• First:&&extract&links&– For&every&page,&I&need&to&know&all&the&pages&that&link&to&it&– MapReduce&soluUon:&• Map&operator&takes&page&u&and&outputs&(v,&u)&for&every&URL&v&in&page&u&• Reduce&operator&takes&all&tuples&with&key&v&and&reduces&them&to&a&list&of&unique&URLs&Bu&that&link&to&v&– (v,&(u1,&u2,&u3,&…))&5/4/09&7&CalculaUng&PageRank&with&MapReduce&• Next:&&calculate&PageRank&iteraUvely&– First,&iniUalize&all&PageRanks&to&1/N&– Then&iteraUvely&MapReduce &&• Map&operator&takes&a&page&ui&and&the&URLs&vj&(j=1..n)&that&it&links&to,&and&outputs&(vj,&R(ui)/n)&• Reduce&operator&takes&pairs&(vj,&R(ui)/n)&and&outputs&• Calculate&delta&and&determine&whether&converged&• If¬,&MapReduce&again&&!vj, (1 − d)+dm"i=1R(ui)n#PageRank&for&IR&• PageRank&is&query‐independent&– The&importance&of&a&page&is¬&related&to&any&query&– We&cannot&simply&rank&pages&by&PageRank&• PageRank&can&be&used&to&re‐rank&results&that&have&been&retrieved&for&a&quer y&• It&can&also&be&used&as&a&feature&in&the&ranking&funcUon&• Or&as&a&weight&on&anchor&text&features&5/4/09&8&Example&Use&of&PageRank&From&“The&PageRank&CitaUon&Algorithm:&&Bringing&Order&to&the&Web”,&Page&et&al.&Wikipedia&PageRanks&Page%Title%PageRank%United&States&2.9&x&10‐3&France&1.3&x&10‐3&United&Kingdom&1.2&x&10‐3&England&1.0&x&10‐3&Germany&1.0&x&10‐3&Canada&0.9&x&10‐3&2007&0.8&x&10‐3&World&War&II&0.8&x&10‐3&Australia&0.7&x&10‐3&2008&0.7&x&10‐3&Based&on&links&between&Wikipedia&pages&5/4/09&9&A&Bit&of&Theory&• Markov&chain:&– N&states&with&transiUon&probability&matrix&P&– At&any&Ume&we&are&in&exactly&one&state&– Pij&indicates&probability&of&moving&from&state&i&to&state&j&– For&all&i,&n!j=1Pij=1A&Bit&of&Theory&• Ergodic&Markov&chains&– Ergodic&means&there&is&a&path&between&any&two&states&– No&maGer&what&state&you&start&in,&the&probability&of&being&in&any&other&state&aTer&T&steps&is&greater&than&zero&(aTer&a&burn‐in&Ume&T0)&– Over&many&steps&T,&each&state&has&some&“visit&rate”:&&starUng&from&any&state,&we&will&visit&each&state&according&to&its&visit&rate&– This&is&the&steady1state&for&the&chain&5/4/09&10&A&Bit&of&Theory&• Let&x0&be&a&vector&represenUng&our¤t&state&• What&is&the&probability&of&each&possible&state&we&can&transiUon&to&from&x0?&• And&the&probability&of&each&possible&state&from&x1&(two&steps&from&x0)?&1&in&posiUon&i,&0s&everywhere&else&x0=!000· · · 1 · · · 00"x1= x0Px2= x1P =(x0P )P = x0P2A&Bit&of&Theory&• ATer&k&steps,&• As&k&goes&to&infinity,&xk&converges&to&the&steady&state&&• When&xk&is&the&steady&state,&• The&steady&state&is&an&eigenvector&of&P&– As&it&turns&out,&it&is&the&principle1eigenvector&– Eigenvalue&=&1&• If&P&is&a&matrix&of&links&between&documents,&the&principle&eigenvector&holds&the&PageRanks&xk= x0PkxkP = xk5/4/09&11&PageRank&ModificaUons&• The&E(u)&quanUUes&solve&the&sink&problem,&but&can&also&be&used&to&adjust&PageRanks&• Usually,&E(u)&assigned&uniformly&– Equal&probability&to&jump&to&any&page&• Instead,&bias&to&certain&pages&–
View Full Document