Stanford CS 276B - Lecture 11 - Text Retrieval and Mining

Unformatted text preview:

CS276B Text Retrieval and Mining Winter 2005This lecturePagerank: Issues and VariantsTopic Specific Pagerank [Have02]Slide 5Influencing PageRank (“Personalization”)Non-uniform TeleportationInterpretation of Composite ScoreInterpretationSlide 10Slide 11The Web as a Directed GraphAssumptions TestedAssumptions testedAnchor Text WWW Worm - McBryan [Mcbr94]Indexing anchor textSlide 17Anchor TextHyperlink-Induced Topic Search (HITS) - Klei98Hubs and AuthoritiesThe hopeHigh-level schemeBase setVisualizationAssembling the base set [Klei98]Distilling hubs and authoritiesIterative updateScalingHow many iterations?Japan Elementary SchoolsThings to noteProof of convergenceHub/authority vectorsRewrite in matrix formIssuesSolutionsSolutions (contd)Hilltop [Bhar01]Behavior-based rankingSlide 40Query-doc popularity matrix BIssues to considerVector space implementationSlide 44Basic AssumptionValidity of Basic AssumptionVariantsCS276BText Retrieval and MiningWinter 2005Lecture 11This lectureWrap up pagerankAnchor textHITSBehavioral rankingPagerank: Issues and VariantsHow realistic is the random surfer model?What if we modeled the back button? [Fagi00]Surfer behavior sharply skewed towards short paths [Hube98]Search engines, bookmarks & directories make jumps non-random.Biased Surfer ModelsWeight edge traversal probabilities based on match with topic/query (non-uniform edge selection)Bias jumps to pages on topic (e.g., based on personal bookmarks & categories of interest)Topic Specific Pagerank [Have02]Conceptually, we use a random surfer who teleports, with say 10% probability, using the following rule:Selects a category (say, one of the 16 top level ODP categories) based on a query & user -specific distribution over the categoriesTeleport to a page uniformly at random within the chosen categorySounds hard to implement: can’t compute PageRank at query time!Topic Specific Pagerank [Have02]Implementationoffline:Compute pagerank distributions wrt to individual categoriesQuery independent model as beforeEach page has multiple pagerank scores – one for each ODP category, with teleportation only to that categoryonline: Distribution of weights over categories computed by query context classificationGenerate a dynamic pagerank score for each page - weighted sum of category-specific pageranksInfluencing PageRank(“Personalization”)Input: Web graph Winfluence vector vv : (page  degree of influence)Output:Rank vector r: (page  page importance wrt v)r = PR(W , v)Non-uniform TeleportationTeleport with 10% probability to a Sports pageSportsInterpretation of Composite ScoreFor a set of personalization vectors {vj} j [wj · PR(W , vj)] = PR(W , j [wj · vj]) Weighted sum of rank vectors itself forms a valid rank vector, because PR() is linear wrt vjInterpretation10% Sports teleportationSportsInterpretationHealth10% Health teleportationInterpretationSportsHealthpr = (0.9 PRsports + 0.1 PRhealth) gives you:9% sports teleportation, 1% health teleportationThe Web as a Directed GraphAssumption 1: A hyperlink between pages denotes author perceived relevance (quality signal)Assumption 2: The anchor of the hyperlink describes the target page (textual context)Page AhyperlinkPage BAnchorAssumptions TestedA link is an endorsement (quality signal)Except when affiliated Can we recognize affiliated links? [Davi00]1536 links manually labeled59 binary features (e.g., on-domain, meta tag overlap, common outlinks)C4.5 decision tree, 10 fold cross validation showed 98.7% accuracyAdditional surrounding text has lower probability but can be usefulAssumptions testedAnchors describe the targetTopical Locality [Davi00b]~200K pages (query results + their outlinks)Computed “page to page” similarity (TFIDF measure)Link-to-Same-Domain > Cocited > Link-to-Different-DomainComputed “anchor to page” similarity Mean anchor len = 2.690.6 mean probability of an anchor term in target pageAnchor Text WWW Worm - McBryan [Mcbr94] For [ ibm] how to distinguish between:IBM’s home page (mostly graphical)IBM’s copyright page (high term freq. for ‘ibm’)Rival’s spam page (arbitrarily high term freq.)www.ibm.com“ibm” “ibm.com”“IBM home page”A million pieces of anchor text with “ibm” send a strong signalIndexing anchor textWhen indexing a document D, include anchor text from links pointing to D.www.ibm.comArmonk, NY-based computergiant IBM announced todayJoe’s computer hardware linksCompaqHPIBMBig Blue today announcedrecord profits for the quarterIndexing anchor textCan sometimes have unexpected side effects - e.g., evil empire.Can index anchor text with less weight.Anchor TextOther applicationsWeighting/filtering links in the graphHITS [Chak98], Hilltop [Bhar01]Generating page descriptions from anchor text [Amit98, Amit00]Hyperlink-Induced Topic Search (HITS) - Klei98In response to a query, instead of an ordered list of pages each meeting the query, find two sets of inter-related pages:Hub pages are good lists of links on a subject.e.g., “Bob’s list of cancer-related links.”Authority pages occur recurrently on good hubs for the subject.Best suited for “broad topic” queries rather than for page-finding queries.Gets at a broader slice of common opinion.Hubs and AuthoritiesThus, a good hub page for a topic points to many authoritative pages for that topic.A good authority page for a topic is pointed to by many good hubs for that topic.Circular definition - will turn this into an iterative computation.The hope AT&T Alice SprintBob MCILong distance telephone companiesHubsAuthoritiesHigh-level schemeExtract from the web a base set of pages that could be good hubs or authorities.From these, identify a small set of top hub and authority pages;iterative algorithm.Base setGiven text query (say browser), use a text index to get all pages containing browser.Call this the root set of pages. Add in any page that eitherpoints to a page in the root set, oris pointed to by a page in the root set.Call this the base set.VisualizationRootsetBase setAssembling the base set [Klei98]Root set typically 200-1000 nodes.Base set may have up to 5000 nodes.How do you find the base set


View Full Document

Stanford CS 276B - Lecture 11 - Text Retrieval and Mining

Download Lecture 11 - Text Retrieval and Mining
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture 11 - Text Retrieval and Mining and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 11 - Text Retrieval and Mining 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?