Wright CS 707 - The Use of the Linear Algebra by Web Search Engines

Unformatted text preview:

The Use of the Linear Algebra by Web Search EnginesAmy N. Langville and Carl D. MeyerDepartment of Mathematics, North Carolina State UniversityRaleigh, NC 27695-8205October 19, 20041 IntroductionNearly all the major Web search engines today use link analysis to improve their search results. That’sexciting news for linear algebraists because link analysis, the use of the Web’s hyperlink structure, isbuilt from fundamentals of matrix theory. Link analysis and its underlying linear algebra have helpedrevolutionize Web search, so much so that the pre-link analysis search (before 1998) pales in comparisonto today’s remarkably accurate search.HITS [13] and PageRank [2, 3] are two of the most popular link analysis algorithms. Both weredeveloped around 1998 and both have dramatically improved the search business. In order to appreciatethe impact of link analysis, recall for a minute the state of search prior to 1998. Because of the immensenumber of pages on the Web, a query to an engine often produced a very long list of relevant pages,sometimes thousands of pages long. A user had to sort carefully through the list to find the most relevantpages. The order of presentation of the pages was little help because spamming was so easy then. In orderto trick a search engine into producing rankings higher than normal, spammers used meta-tags liberally,claiming their page used popular search terms that never appeared in the page. Meta-tags became uselessfor search engines. Spammers also repeated popular search terms in invisible text (white text on a whitebackground) to fool engines.2 The HITS AlgorithmHITS [13], a link analysis algorithm developed by Jon Kleinberg from Cornell University during hispostdoctoral studies at IBM Almaden, aimed to focus this long, unruly query list. The HITS algorithmis based on a pattern Kleinberg noticed among Web pages. Some pages serve as hubs or portal pages,i.e., pages with many outlinks. Other pages are authorities on topics because they have many inlinks.Kleinberg noticed that good hubs seemed to point to good authorities and good authorities were pointedto by good hubs. So he decided to give each page i both an hub score hiand an authority score ai.Infact, for every page i he defined the hub score at iteration k, h(k)i, and the authority score, a(k)i,asa(k)i=j:eji∈Eh(k−1)jand h(k)i=j:eij∈Ea(k)jfor k =1, 2, 3,...,where eijrepresents a hyperlink from page i to page j and E is the set of hyperlinks. To compute thescores for a page, he started with uniform scores for all pages, i.e., h(1)i=1/n and a(1)i=1/n where n isthe number of pages in a so-called neighborhood set for the query list. The neighborhood set consists of1all pages in the query list plus all pages pointing to or from the query pages. Depending on the query, theneighborhood set could contain just a hundred pages or a hundred thousand pages. (The neighborhoodset allows latent semantic associations to be made.) The hub and authority scores are iteratively refineduntil convergence to stationary values.Using linear algebra we can replace the summation equations with matrix equations. Let h and a becolumn vectors holding the hub and authority scores. Let L be the adjacency matrix for the neighborhoodset. That is, Lij= 1 if page i links to page j, and 0, otherwise. These definitions show thata(k)= LTh(k−1)and h(k)= La(k).Using some algebra, we havea(k)= LTLa(k−1)h(k)= LLTh(k−1).These equations make it clear that Kleinberg’s algorithm is really the power method applied to thepositive semi-definite matrices LTL and LLT. LTL is called the hub matrix and LLTis the authoritymatrix. Thus, HITS amounts to solving the eigenvector problems LTLa = λ1a and LLTh = λ1h, whereλ1is the largest eigenvalue of LTL (and LLT), and a and h are corresponding eigenvectors.While this is the basic linear algebra required by the HITS method, there are many more issues tobe considered. For example, important issues include convergence, existence, uniqueness, and numericalcomputation of these scores [5, 7, 14]. Several modifications to HITS have been suggested, each bringingvarious advantages and disadvantages [4, 6, 8]. A variation of Kleinberg’s HITS concept is at the base ofthe search engine TEOMA (http://www.teoma.com), which is owned by Ask Jeeves, Inc.3 The PageRank AlgorithmPageRank, the second link analysis algorithm from 1998, is the heart of Google. Both PageRank andGoogle were conceived by Sergey Brin and Larry Page while they were computer science graduate studentsat Stanford University. Brin and Page use a recursive scheme similar to Kleinberg’s. Their original ideawas that a page is important if it is pointed to by other important pages. That is, they decided that theimportance of your page (its PageRank score) is determined by summing the PageRanks of all pages thatpoint to yours. In building a mathematical definition of PageRank, Brin and Page also reasoned that whenan important page points to several places, its weight (PageRank) should be distributed proportionately.In other words, if YAHOO! points to your Web page, that’s good, but you shouldn’t receive the fullweight of YAHOO! because they point to many other places. If YAHOO! points to 999 pages in additionto yours, then you should only get credit for 1/1000 of YAHOO!’s PageRank.This reasoning led Brin and Page to formulate a recursive definition PageRank. They definedr(k+1)i=j∈Iir(k)j|Oj|,where r(k)iis the PageRank of page i at iteration k, Iiis the set of pages pointing into page i and |Oj| isthe number of outlinks from page j. Like HITS, PageRank starts with a uniform rank for all pages, i.e.,r(0)i=1/n and successively refines these scores, where n is the total number of Web pages.Like HITS, we can write this process using matrix notation. Let the row vector π(k)Tbe the PageRankvector at the kthiteration. As a result, the summation equation for PageRank can be written compactlyasπ(k+1)T= π(k)TH,2where H is a row normalized hyperlink matrix, i.e., hij=1/|Oi|, if there is a link from page i to page j,and 0, otherwise. Unfortunately, this iterative procedure has convergence problems—it can cycle or thelimit may be dependent on the starting vector.To fix these problems, Brin and Page revised their basic PageRank concept. Still using the hyperlinkstructure of the Web, they build an irreducible aperiodic Markov chain characterized by a primitive (irre-ducible with only one eigenvalue on the spectral


View Full Document

Wright CS 707 - The Use of the Linear Algebra by Web Search Engines

Download The Use of the Linear Algebra by Web Search Engines
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view The Use of the Linear Algebra by Web Search Engines and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view The Use of the Linear Algebra by Web Search Engines 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?