Unformatted text preview:

Link Analysis RankingHow do search engines decide how to rank your query results?Naïve ranking of query resultsWhy Link Analysis?Link Analysis: IntuitionLink Analysis Ranking AlgorithmsAlgorithm inputQuery-dependent LARQuery-dependent inputQuery-dependent inputQuery dependent inputQuery dependent inputProperties of a good seed set SHow to construct a good seed set SLink FilteringHow do we rank the pages in seed set S?Hubs and Authorities [K98]HITS AlgorithmHITS and eigenvectorsSingular Value DecompositionSingular Value DecompositionHITS and the TKC effectHITS and the TKC effectHITS and the TKC effectHITS and the TKC effectHITS and the TKC effectHITS and the TKC effectHITS and the TKC effectHITS and the TKC effectQuery-independent LARInDegree algorithmPageRank algorithm [BP98]Markov chainsRandom walksAn exampleState probability vectorAn exampleStationary distributionComputing the stationary distributionThe PageRank random walkThe PageRank random walkThe PageRank random walkThe PageRank random walkEffects of random jumpA PageRank algorithmRandom walks on undirected graphsResearch on PageRankPrevious workSocial network analysisCounting paths – Katz 53BibliometricsLink Analysis RankingHow do search engines decide how to rank your query results?• Guess why Google ranks the query results the wa y it does• How would you do it?Naïve ranking of query results• Given query q• Rank the web pages p in the index based on sim(p,q)• Scenarios where this is not such a good idea?Why Link Analysis?• First genera tion search engines– view documents as flat text files– could not cope with size, spamming , user needs• Example: Honda website, keywords: automobile manufacturer• Second generation search engines– Ranking becomes critical– use of Web specific data: Link Analysis– shift from relevance to authoritativeness– a success story for the network analysisLink Analysis: Intuition• A link from page p to page q denotes endorsement– page p considers page q an authority on a subject– mine the web graph of recommendations– assign an authority value to every pageLink Analysis Ranking Algorithms• Start with a collection of web pages• Extract the underlying hyperlink graph• Run the LAR algorithm on the graph• Output: an authority weight for each nodewwwwwAlgorithm input• Query dependent: rank a small subset of pages related to a specific query– HITS (Kleinberg 98) was proposed as query dependent• Query independent: rank the whole Web– PageRank (Brin and Page 98) was proposed as query independentQuer y‐dependent LAR• Given a query q, find a subset of web pages Sthat are related to S• Rank the pages in S based on some ranking criterionQuer y‐dependent inputRoot SetQuer y‐dependent inputRoot SetINOUTQuer y dependent inputRoot SetINOUTQuer y dependent inputRoot SetINOUTBase SetProperties of a good seed set S• S is relatively small.• S is rich in relevant pages.• S contains most (or many) of the strongest authorities.How to construct a good seed set S• For query q first collect the t highest ‐ranked pages for q from a text‐based search engine to form set Γ• S = Γ• Add to S all the pages pointing to Γ• Add to S all the pages that pages from Γ point toLink Filtering• Navigational links: serve the purpose of moving within a site (or to related sites)• www.espn.com → www.espn.com/nba• www.yahoo.com → www.yahoo.it• www.espn.com → www.msn.com• Filter out navigational links– same domain name– same IP address How do we rank the pages in seed set S?• In degree?• Intuition• ProblemsHubs and Authorities [K98]• Authority is not necessarily transferred directly between authorities• Pages have double identity– hub identity– authority identity• Good hubs point to goodauthorities• Good authorities are pointed by good hubshubsauthoritiesHITS Algorithm• Initialize all weights to 1.• Repeat until convergence– O operation : hubs collect the weight of the authorities– I operation: authorities collect the weight of the hubs– Normalize weights under some norm∑→=jijjiah:∑→=ijjjiha:HITS and eigenvectors• The HITS algorithm is a power‐method eigenvector computation– in vector terms at= ATht‐1and ht= Aat‐1– so at= ATAat‐1and ht= AATht‐1– The authority weight vector a is the eigenvector of ATA and the hub weight vector h is the eigenvector of AAT– Why do we need normalization?• The vectors a and h are singular vectors of the matrix ASingular Value Decomposition• r : rank of matrix A• σ1≥σ2≥…≥σr: singular values (square roots of eig‐vals AAT, ATA)• : left singular vectors (eig‐vectors of AAT)• : right singular vectors (eig‐vectors of ATA)•[]⎥⎥⎥⎥⎦⎤⎢⎢⎢⎢⎣⎡⎥⎥⎥⎥⎦⎤⎢⎢⎢⎢⎣⎡==r21r21r21TvvvσσσuuuVΣUArMrrOrLrr[n×r] [r×r] [r×n]r21u,,u,urLrrr21v,,v,vrLrrTrrrT222T111vuσvuσvuσArrLrrrr+++=21Singular Value Decomposition• Linear trend v in matrix A:– the tendency of the row vectors of A to align with vector v– strength of the linear trend: Av• SVD discovers the linear trends in the data• ui, vi: the i‐th strongest linear trends• σi: the strength of the i‐th strongest linear trendσ1σ2v1v2 HITS discovers the strongest linear trend in the authority spaceHITS and the TKC effect• The HITS algorithm favors the most dense community of hubs and authorities– Tightly Knit Community (TKC) effectHITS and the TKC effect•


View Full Document

BU CS 565 - Link Analysis Ranking

Documents in this Course
Load more
Download Link Analysis Ranking
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Link Analysis Ranking and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Link Analysis Ranking 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?