1CS276BText Retrieval and MiningWinter 2005Lecture 11This lectureWrap up pagerankAnchor textHITSBehavioral rankingPagerank: Issues and VariantsHow realistic is the random surfer model?What if we modeled the back button? [Fagi00]Surfer behavior sharply skewed towards short paths [Hube98]Search engines, bookmarks & directories make jumps non-random.Biased Surfer ModelsWeight edge traversal probabilities based on match with topic/query (non-uniform edge selection)Bias jumps to pages on topic (e.g., based on personal bookmarks & categories of interest)Topic Specific Pagerank[Have02]Conceptually, we use a random surfer who teleports, with say 10% probability, using the following rule:Selects a category (say, one of the 16 top level ODP categories) based on a query & user -specific distribution over the categoriesTeleport to a page uniformly at random within the chosen categorySounds hard to implement: can’t compute PageRank at query time!Topic Specific Pagerank[Have02]Implementationoffline:Compute pagerank distributions wrtto individualcategoriesQuery independent model as beforeEach page has multiple pagerank scores – one for each ODP category, with teleportation only to that categoryonline: Distribution of weights over categories computed by query context classificationGenerate a dynamic pagerank score for each page -weighted sum of category-specific pageranksInfluencing PageRank(“Personalization”)Input: Web graph Winfluence vector vv : (page → degree of influence)Output:Rank vector r: (page → page importance wrt v)r = PR(W, v)2Non-uniform TeleportationTeleport with 10% probability to a Sports pageSportsInterpretation of Composite ScoreFor a set of personalization vectors {vj}∑j [wj·PR(W , vj)] = PR(W, ∑j [wj· vj]) Weighted sum of rank vectors itself forms a valid rank vector, because PR() is linear wrtvjInterpretation10% Sports teleportationSportsInterpretationHealth10% Health teleportationInterpretationSportsHealthpr = (0.9 PRsports+ 0.1 PRhealth) gives you:9% sports teleportation, 1% health teleportationThe Web as a Directed GraphAssumption 1: A hyperlink between pages denotes author perceived relevance (quality signal)Assumption 2: The anchor of the hyperlink describes the target page (textual context)Page AhyperlinkPage BAnchor3Assumptions TestedA link is an endorsement (quality signal)Except when affiliatedCan we recognize affiliated links? [Davi00]1536 links manually labeled59 binary features (e.g., on-domain, meta tag overlap, common outlinks)C4.5 decision tree, 10 fold cross validation showed 98.7% accuracyAdditional surrounding text has lower probability but can be usefulAssumptions testedAnchors describe the targetTopical Locality [Davi00b]~200K pages (query results + their outlinks)Computed “page to page” similarity (TFIDF measure)Link-to-Same-Domain > Cocited > Link-to-Different-DomainComputed “anchor to page” similarity Mean anchor len = 2.690.6 mean probability of an anchor term in target page Anchor TextWWW Worm- McBryan [Mcbr94]For [ ibm] how to distinguish between:IBM’s home page (mostly graphical)IBM’s copyright page (high term freq. for ‘ibm’)Rival’s spam page (arbitrarily high term freq.)www.ibm.com“ibm”“ibm.com”“IBM home page”A million pieces of anchor text with “ibm”send a strong signalIndexing anchor textWhen indexing a document D, include anchor text from links pointing to D.www.ibm.comArmonk, NY-based computergiant IBMannounced todayJoe’s computer hardware linksCompaqHPIBMBig Blue today announcedrecord profits for the quarterIndexing anchor textCan sometimes have unexpected side effects -e.g., evil empire.Can index anchor text with less weight.Anchor TextOther applicationsWeighting/filtering links in the graphHITS [Chak98], Hilltop [Bhar01]Generating page descriptions from anchor text [Amit98, Amit00]4Hyperlink-Induced Topic Search (HITS) - Klei98In response to a query, instead of an ordered list of pages each meeting the query, find twosets of inter-related pages:Hub pagesare good lists of links on a subject.e.g., “Bob’s list of cancer-related links.”Authority pagesoccur recurrently on good hubs for the subject.Best suited for “broad topic” queries rather than for page-finding queries.Gets at a broader slice of common opinion.Hubs and AuthoritiesThus, a good hub page for a topic pointsto many authoritative pages for that topic.A good authority page for a topic is pointedto by many good hubs for that topic.Circular definition - will turn this into an iterative computation.The hope AT&T Alice SprintBob MCILong distance telephone companiesHubsAuthoritiesHigh-level schemeExtract from the web a base set of pages that couldbe good hubs or authorities.From these, identify a small set of top hub and authority pages;→iterative algorithm.Base setGiven text query (say browser), use a text index to get all pages containing browser.Call this the root set of pages. Add in any page that eitherpoints to a page in the root set, oris pointed to by a page in the root set.Call this the base set.VisualizationRootsetBase set5Assembling the base set [Klei98]Root set typically 200-1000 nodes.Base set may have up to 5000 nodes.How do you find the base set nodes?Follow out-links by parsing root set pages.Get in-links (and out-links) from a connectivity server.(Actually, suffices to text-index strings of the form href=“URL”to get in-links to URL.)Distilling hubs and authoritiesCompute, for each page xin the base set, a hub scoreh(x)and an authority scorea(x).Initialize: for all x, h(x)←1; a(x) ←1;Iteratively update all h(x), a(x);After iterationsoutput pages with highest h()scores as top hubshighest a()scores as top authorities.KeyIterative updateRepeat the following updates, for all x:∑←yxyaxha)()(∑←xyyhxaa)()(xxScalingTo prevent the h()and a()values from getting too big, can scale down after each iteration.Scaling factor doesn’t really matter:we only care about the relativevalues of the scores.How many iterations?Claim: relative values of scores will converge after a few iterations:in fact, suitably scaled, h()and a()scores settle into a steady state!proof of this comes later.We only require the
View Full Document