Unformatted text preview:

1CS276BText Retrieval and MiningWinter 2005Lecture 11This lectureWrap up pagerankAnchor textHITSBehavioral rankingPagerank: Issues and VariantsHow realistic is the random surfer model?What if we modeled the back button? [Fagi00]Surfer behavior sharply skewed towards short paths [Hube98]Search engines, bookmarks & directories make jumps non-random.Biased Surfer ModelsWeight edge traversal probabilities based on match with topic/query (non-uniform edge selection)Bias jumps to pages on topic (e.g., based on personal bookmarks & categories of interest)Topic Specific Pagerank[Have02]Conceptually, we use a random surfer who teleports, with say 10% probability, using the following rule:Selects a category (say, one of the 16 top level ODP categories) based on a query & user -specific distribution over the categoriesTeleport to a page uniformly at random within the chosen categorySounds hard to implement: can’t compute PageRank at query time!Topic Specific Pagerank[Have02]Implementationoffline:Compute pagerank distributions wrtto individualcategoriesQuery independent model as beforeEach page has multiple pagerank scores – one for each ODP category, with teleportation only to that categoryonline: Distribution of weights over categories computed by query context classificationGenerate a dynamic pagerank score for each page -weighted sum of category-specific pageranksInfluencing PageRank(“Personalization”)Input: Web graph Winfluence vector vv : (page → degree of influence)Output:Rank vector r: (page → page importance wrt v)r = PR(W, v)2Non-uniform TeleportationTeleport with 10% probability to a Sports pageSportsInterpretation of Composite ScoreFor a set of personalization vectors {vj}∑j [wj·PR(W , vj)] = PR(W, ∑j [wj· vj]) Weighted sum of rank vectors itself forms a valid rank vector, because PR() is linear wrtvjInterpretation10% Sports teleportationSportsInterpretationHealth10% Health teleportationInterpretationSportsHealthpr = (0.9 PRsports+ 0.1 PRhealth) gives you:9% sports teleportation, 1% health teleportationThe Web as a Directed GraphAssumption 1: A hyperlink between pages denotes author perceived relevance (quality signal)Assumption 2: The anchor of the hyperlink describes the target page (textual context)Page AhyperlinkPage BAnchor3Assumptions TestedA link is an endorsement (quality signal)Except when affiliatedCan we recognize affiliated links? [Davi00]1536 links manually labeled59 binary features (e.g., on-domain, meta tag overlap, common outlinks)C4.5 decision tree, 10 fold cross validation showed 98.7% accuracyAdditional surrounding text has lower probability but can be usefulAssumptions testedAnchors describe the targetTopical Locality [Davi00b]~200K pages (query results + their outlinks)Computed “page to page” similarity (TFIDF measure)Link-to-Same-Domain > Cocited > Link-to-Different-DomainComputed “anchor to page” similarity Mean anchor len = 2.690.6 mean probability of an anchor term in target page Anchor TextWWW Worm- McBryan [Mcbr94]For [ ibm] how to distinguish between:IBM’s home page (mostly graphical)IBM’s copyright page (high term freq. for ‘ibm’)Rival’s spam page (arbitrarily high term freq.)www.ibm.com“ibm”“ibm.com”“IBM home page”A million pieces of anchor text with “ibm”send a strong signalIndexing anchor textWhen indexing a document D, include anchor text from links pointing to D.www.ibm.comArmonk, NY-based computergiant IBMannounced todayJoe’s computer hardware linksCompaqHPIBMBig Blue today announcedrecord profits for the quarterIndexing anchor textCan sometimes have unexpected side effects -e.g., evil empire.Can index anchor text with less weight.Anchor TextOther applicationsWeighting/filtering links in the graphHITS [Chak98], Hilltop [Bhar01]Generating page descriptions from anchor text [Amit98, Amit00]4Hyperlink-Induced Topic Search (HITS) - Klei98In response to a query, instead of an ordered list of pages each meeting the query, find twosets of inter-related pages:Hub pagesare good lists of links on a subject.e.g., “Bob’s list of cancer-related links.”Authority pagesoccur recurrently on good hubs for the subject.Best suited for “broad topic” queries rather than for page-finding queries.Gets at a broader slice of common opinion.Hubs and AuthoritiesThus, a good hub page for a topic pointsto many authoritative pages for that topic.A good authority page for a topic is pointedto by many good hubs for that topic.Circular definition - will turn this into an iterative computation.The hope AT&T Alice SprintBob MCILong distance telephone companiesHubsAuthoritiesHigh-level schemeExtract from the web a base set of pages that couldbe good hubs or authorities.From these, identify a small set of top hub and authority pages;→iterative algorithm.Base setGiven text query (say browser), use a text index to get all pages containing browser.Call this the root set of pages. Add in any page that eitherpoints to a page in the root set, oris pointed to by a page in the root set.Call this the base set.VisualizationRootsetBase set5Assembling the base set [Klei98]Root set typically 200-1000 nodes.Base set may have up to 5000 nodes.How do you find the base set nodes?Follow out-links by parsing root set pages.Get in-links (and out-links) from a connectivity server.(Actually, suffices to text-index strings of the form href=“URL”to get in-links to URL.)Distilling hubs and authoritiesCompute, for each page xin the base set, a hub scoreh(x)and an authority scorea(x).Initialize: for all x, h(x)←1; a(x) ←1;Iteratively update all h(x), a(x);After iterationsoutput pages with highest h()scores as top hubshighest a()scores as top authorities.KeyIterative updateRepeat the following updates, for all x:∑←yxyaxha)()(∑←xyyhxaa)()(xxScalingTo prevent the h()and a()values from getting too big, can scale down after each iteration.Scaling factor doesn’t really matter:we only care about the relativevalues of the scores.How many iterations?Claim: relative values of scores will converge after a few iterations:in fact, suitably scaled, h()and a()scores settle into a steady state!proof of this comes later.We only require the


View Full Document

Stanford CS 276B - Lecture 11

Download Lecture 11
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture 11 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 11 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?