PowerPoint PresentationToday’s lecture – hypertext and linksLinks are everywhereExample 1: Good/Bad/UnknownSimple iterative logicSlide 6Slide 7Example 2: In-links to pages – unusual patterns Many other examples of link analysisOur primary interest in this courseThe Web as a Directed GraphAssumption 1: reputed sitesAssumption 2: annotation of targetAnchor Text WWW Worm - McBryan [Mcbr94]Indexing anchor textSlide 16Getting at all that link information InexpensivelyConnectivity ServerBoldi and Vigna 2004Adjacency listsAdjaceny list compressionMain ideas of Boldi/VignaBoldi/VignaGap encodingsMain advantages of BVSlide 26Citation AnalysisThe web isn’t scholarly citationPagerank scoringNot quite enoughTeleportingResult of teleportingMarkov chainsSlide 34Ergodic Markov chainsProbability vectorsChange in probability vectorHow do we compute this vector?Slide 39Hyperlink-Induced Topic Search (HITS)Hubs and AuthoritiesThe hopeHigh-level schemeBase setVisualizationDistilling hubs and authoritiesIterative updateScalingHow many iterations?Proof of convergenceHub/authority vectorsRewrite in matrix formIssuesResourcesIntroduction to Information RetrievalIntroduction to Information Retrieval Introduction toInformation RetrievalCS276Information Retrieval and Web SearchChris Manning and Pandu NayakLink analysisIntroduction to Information RetrievalIntroduction to Information Retrieval Today’s lecture – hypertext and linksWe look beyond the content of documentsWe begin to look at the hyperlinks between themAddress questions likeDo the links represent a conferral of authority to some pages? Is this useful for ranking?How likely is it that a page pointed to by the CERN home page is about high energy physicsBig application areasThe WebEmailSocial networksIntroduction to Information RetrievalIntroduction to Information Retrieval Links are everywherePowerful sources of authenticity and authorityMail spam – which email accounts are spammers?Host quality – which hosts are “bad”?Phone call logsThe Good, The Bad and The Unknown3????GoodBadIntroduction to Information RetrievalIntroduction to Information Retrieval Example 1: Good/Bad/UnknownThe Good, The Bad and The UnknownGood nodes won’t point to Bad nodesAll other combinations plausible4????GoodBadIntroduction to Information RetrievalIntroduction to Information Retrieval Simple iterative logicGood nodes won’t point to Bad nodesIf you point to a Bad node, you’re BadIf a Good node points to you, you’re Good5????GoodBadIntroduction to Information RetrievalIntroduction to Information Retrieval Simple iterative logicGood nodes won’t point to Bad nodesIf you point to a Bad node, you’re BadIf a Good node points to you, you’re Good6??GoodBadIntroduction to Information RetrievalIntroduction to Information Retrieval Simple iterative logicGood nodes won’t point to Bad nodesIf you point to a Bad node, you’re BadIf a Good node points to you, you’re Good7GoodBadSometimes need probabilistic analogs – e.g., mail spamIntroduction to Information RetrievalIntroduction to Information Retrieval Example 2:In-links to pages – unusual patterns 8Spammers violating power laws!Introduction to Information RetrievalIntroduction to Information Retrieval Many other examples of link analysisSocial networks are a rich source of grouping behaviorE.g., Shoppers’ affinity – Goel+Goldstein 2010Consumers whose friends spend a lot, spend a lot themselveshttp://www.cs.cornell.edu/home/kleinber/networks-book/9Introduction to Information RetrievalIntroduction to Information Retrieval Our primary interest in this courseLink analysis for most IR functionality thus far based purely on textScoring and rankingLink-based clustering – topical structure from linksLinks as features in classification – documents that link to one another are likely to be on the same subjectCrawlingBased on the links seen, where do we crawl next?10Introduction to Information RetrievalIntroduction to Information Retrieval The Web as a Directed GraphHypothesis 1: A hyperlink between pages denotes a conferral of authority (quality signal)Hypothesis 2: The text in the anchor of the hyperlink on page A describes the target page BPage AhyperlinkPage BAnchorSec. 21.1Introduction to Information RetrievalIntroduction to Information Retrieval Assumption 1: reputed sites12Introduction to Information RetrievalIntroduction to Information Retrieval Assumption 2: annotation of target13Introduction to Information RetrievalIntroduction to Information Retrieval Anchor Text WWW Worm - McBryan [Mcbr94] For ibm how to distinguish between:IBM’s home page (mostly graphical)IBM’s copyright page (high term freq. for ‘ibm’)Rival’s spam page (arbitrarily high term freq.)www.ibm.com“ibm” “ibm.com”“IBM home page”A million pieces of anchor text with “ibm” send a strong signalSec. 21.1.1Introduction to Information RetrievalIntroduction to Information Retrieval Indexing anchor textWhen indexing a document D, include (with some weight) anchor text from links pointing to D.www.ibm.comArmonk, NY-based computergiant IBM announced todayJoe’s computer hardware linksSunHPIBMBig Blue today announcedrecord profits for the quarterSec. 21.1.1Introduction to Information RetrievalIntroduction to Information Retrieval Indexing anchor textCan sometimes have unexpected effects, e.g., spam, miserable failureCan score anchor text with weight depending on the authority of the anchor page’s websiteE.g., if we were to assume that content from cnn.com or yahoo.com is authoritative, then trust (more) the anchor text from themIncrease the weight of off-site anchors (non-nepotistic scoring)Sec. 21.1.1Introduction to Information RetrievalIntroduction to Information Retrieval Getting at all that link informationInexpensivelyConnectivity servers17Introduction to Information RetrievalIntroduction to Information Retrieval Connectivity ServerSupport for fast queries on the web graphWhich URLs point to a given URL?Which URLs does a given URL point to?Stores mappings in memory fromURL to outlinks, URL to inlinksApplicationsLink analysisWeb graph analysisConnectivity, crawl optimizationCrawl controlSec. 20.4Introduction to Information RetrievalIntroduction to Information Retrieval Boldi
View Full Document