Unformatted text preview:

Chapter 6: Link AnalysisRoad mapIntroductionIntroduction (cont …)Slide 5Slide 6Slide 7Social network analysisSocial network and the WebCentralityDegree CentralityCloseness CentralityBetweenness CentralityBetweenness Centrality (cont …)Slide 15PrestigeDegree prestigeProximity prestigeProximity prestige (cont …)Rank prestigeRank prestige (cont …)Slide 22Co-citation and Bibliographic CouplingCo-citationSlide 25Bibliographic couplingBibliographic coupling (cont …)Slide 28PageRankPageRank: the intuitive ideaMore specificallyPageRank algorithmMatrix notationSolve the PageRank equationUsing Markov chainRandom surfingTransition probability matrixLet us startBack to the Markov chainState transitionStationary probability distributionPageRank againIs P =  justified?Back to the Web graphA is a not stochastic matrixAn example Web hyperlink graphFix the problem: two possible waysA is a not irreducibleA is a not aperiodicAn example: periodicDeal with irreducible and aperiodicImproved PageRankFollow our exampleThe final PageRank algorithmThe final PageRank (cont …)Compute PageRankAdvantages of PageRankSlide 58HITSAuthorities and HubsExamplesThe key idea of HITSThe HITS algorithm: Grab pagesThe link graph GThe HITS algorithmHITS in matrix formComputation of HITSThe algorithmRelationships with co-citation and bibliographic couplingStrengths and weaknesses of HITSSlide 71SummarySlide 73Chapter 6: Link AnalysisCS583, Bing Liu, UIC2Road mapIntroductionSocial network analysisCo-citation and bibliographic coupling PageRankHITSSummaryCS583, Bing Liu, UIC3IntroductionEarly search engines mainly compare content similarity of the query and the indexed pages. I.e., They use information retrieval methods, cosine, TF-IDF, ... From 1996, it became clear that content similarity alone was no longer sufficient. The number of pages grew rapidly in the mid-late 1990’s. Try “classification technique”, Google estimates: 10 million relevant pages. How to choose only 30-40 pages and rank them suitably to present to the user?Content similarity is easily spammed. A page owner can repeat some words and add many related words to boost the rankings of his pages and/or to make the pages relevant to a large number of queries.CS583, Bing Liu, UIC4Introduction (cont …)Starting around 1996, researchers began to work on the problem. They resort to hyperlinks. In Feb, 1997, Yanhong Li (Robin Li), Scotch Plains, NJ, filed a hyperlink based search patent. The method uses words in anchor text of hyperlinks.Web pages on the other hand are connected through hyperlinks, which carry important information. Some hyperlinks: organize information at the same site. Other hyperlinks: point to pages from other Web sites. Such out-going hyperlinks often indicate an implicit conveyance of authority to the pages being pointed to. Those pages that are pointed to by many other pages are likely to contain authoritative information.CS583, Bing Liu, UIC5Introduction (cont …)During 1997-1998, two most influential hyperlink based search algorithms PageRank and HITS were reported. Both algorithms are related to social networks. They exploit the hyperlinks of the Web to rank pages according to their levels of “prestige” or “authority”. HITS: Jon Kleinberg (Cornel University), at Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, January 1998PageRank: Sergey Brin and Larry Page, PhD students from Stanford University, at Seventh International World Wide Web Conference (WWW7) in April, 1998. PageRank powers the Google search engine.CS583, Bing Liu, UIC6Introduction (cont …)Apart from search ranking, hyperlinks are also useful for finding Web communities. A Web community is a cluster of densely linked pages representing a group of people with a special interest. Beyond explicit hyperlinks on the Web, links in other contexts are useful too, e.g., for discovering communities of named entities (e.g., people and organizations) in free text documents, and for analyzing social phenomena in emails..CS583, Bing Liu, UIC7Road mapIntroductionSocial network analysisCo-citation and bibliographic coupling PageRankHITSSummaryCS583, Bing Liu, UIC8Social network analysisSocial network is the study of social entities (people in an organization, called actors), and their interactions and relationships. The interactions and relationships can be represented with a network or graph, each vertex (or node) represents an actor and each link represents a relationship. From the network, we can study the properties of its structure, and the role, position and prestige of each social actor. We can also find various kinds of sub-graphs, e.g., communities formed by groups of actors.CS583, Bing Liu, UIC9Social network and the WebSocial network analysis is useful for the Web because the Web is essentially a virtual society, and thus a virtual social network, Each page: a social actor and each hyperlink: a relationship. Many results from social network can be adapted and extended for use in the Web context. We study two types of social network analysis, centrality and prestige, which are closely related to hyperlink analysis and search on the Web.CS583, Bing Liu, UIC10CentralityImportant or prominent actors are those that are linked or involved with other actors extensively. A person with extensive contacts (links) or communications with many other people in the organization is considered more important than a person with relatively fewer contacts. The links can also be called ties. A central actor is one involved in many ties.CS583, Bing Liu, UIC11Degree CentralityCS583, Bing Liu, UIC12Closeness CentralityCS583, Bing Liu, UIC13Betweenness Centrality If two non-adjacent actors j and k want to interact and actor i is on the path between j and k, then i may have some control over the interactions between j and k. Betweenness measures this control of i over other pairs of actors. Thus, if i is on the paths of many such interactions, then i is an important actor.CS583, Bing Liu, UIC14Betweenness Centrality (cont …)Undirected graph: Let pjk be the number of shortest paths between actor j and actor k. The betweenness of an actor i is defined as the number of shortest paths that pass i (pjk(i)) normalized by the total number of shortest paths. kjjkjkpip )((4)CS583, Bing Liu,


View Full Document

UIC CS 583 - Chapter 6: Link Analysis

Download Chapter 6: Link Analysis
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Chapter 6: Link Analysis and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Chapter 6: Link Analysis 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?