Stanford CS 276B - Lecture 11 - Text Retrieval and Mining - D441350

Home> Schools> Stanford University> Computer Science (CS) > CS 276B> Lecture 11 - Text Retrieval and Mining

Stanford CS 276B - Lecture 11 - Text Retrieval and Mining

Course Cs 276b- Text Information Retrieval, Mining, and Exploitation

Pages 47

Download Save

Unformatted text preview:

CS276B Text Retrieval and Mining Winter 2005This lecturePagerank: Issues and VariantsTopic Specific Pagerank [Have02]Slide 5Influencing PageRank (“Personalization”)Non-uniform TeleportationInterpretation of Composite ScoreInterpretationSlide 10Slide 11The Web as a Directed GraphAssumptions TestedAssumptions testedAnchor Text WWW Worm - McBryan [Mcbr94]Indexing anchor textSlide 17Anchor TextHyperlink-Induced Topic Search (HITS) - Klei98Hubs and AuthoritiesThe hopeHigh-level schemeBase setVisualizationAssembling the base set [Klei98]Distilling hubs and authoritiesIterative updateScalingHow many iterations?Japan Elementary SchoolsThings to noteProof of convergenceHub/authority vectorsRewrite in matrix formIssuesSolutionsSolutions (contd)Hilltop [Bhar01]Behavior-based rankingSlide 40Query-doc popularity matrix BIssues to considerVector space implementationSlide 44Basic AssumptionValidity of Basic AssumptionVariantsCS276BText Retrieval and MiningWinter 2005Lecture 11This lectureWrap up pagerankAnchor textHITSBehavioral rankingPagerank: Issues and VariantsHow realistic is the random surfer model?What if we modeled the back button? [Fagi00]Surfer behavior sharply skewed towards short paths [Hube98]Search engines, bookmarks & directories make jumps non-random.Biased Surfer ModelsWeight edge traversal probabilities based on match with topic/query (non-uniform edge selection)Bias jumps to pages on topic (e.g., based on personal bookmarks & categories of interest)Topic Specific Pagerank [Have02]Conceptually, we use a random surfer who teleports, with say 10% probability, using the following rule:Selects a category (say, one of the 16 top level ODP categories) based on a query & user -specific distribution over the categoriesTeleport to a page uniformly at random within the chosen categorySounds hard to implement: can’t compute PageRank at query time!Topic Specific Pagerank [Have02]Implementationoffline:Compute pagerank distributions wrt to individual categoriesQuery independent model as beforeEach page has multiple pagerank scores – one for each ODP category, with teleportation only to that categoryonline: Distribution of weights over categories computed by query context classificationGenerate a dynamic pagerank score for each page - weighted sum of category-specific pageranksInfluencing PageRank(“Personalization”)Input: Web graph Winfluence vector vv : (page  degree of influence)Output:Rank vector r: (page  page importance wrt v)r = PR(W , v)Non-uniform TeleportationTeleport with 10% probability to a Sports pageSportsInterpretation of Composite ScoreFor a set of personalization vectors {vj} j [wj · PR(W , vj)] = PR(W , j [wj · vj]) Weighted sum of rank vectors itself forms a valid rank vector, because PR() is linear wrt vjInterpretation10% Sports teleportationSportsInterpretationHealth10% Health teleportationInterpretationSportsHealthpr = (0.9 PRsports + 0.1 PRhealth) gives you:9% sports teleportation, 1% health teleportationThe Web as a Directed GraphAssumption 1: A hyperlink between pages denotes author perceived relevance (quality signal)Assumption 2: The anchor of the hyperlink describes the target page (textual context)Page AhyperlinkPage BAnchorAssumptions TestedA link is an endorsement (quality signal)Except when affiliated Can we recognize affiliated links? [Davi00]1536 links manually labeled59 binary features (e.g., on-domain, meta tag overlap, common outlinks)C4.5 decision tree, 10 fold cross validation showed 98.7% accuracyAdditional surrounding text has lower probability but can be usefulAssumptions testedAnchors describe the targetTopical Locality [Davi00b]~200K pages (query results + their outlinks)Computed “page to page” similarity (TFIDF measure)Link-to-Same-Domain > Cocited > Link-to-Different-DomainComputed “anchor to page” similarity Mean anchor len = 2.690.6 mean probability of an anchor term in target pageAnchor Text WWW Worm - McBryan [Mcbr94] For [ ibm] how to distinguish between:IBM’s home page (mostly graphical)IBM’s copyright page (high term freq. for ‘ibm’)Rival’s spam page (arbitrarily high term freq.)www.ibm.com“ibm” “ibm.com”“IBM home page”A million pieces of anchor text with “ibm” send a strong signalIndexing anchor textWhen indexing a document D, include anchor text from links pointing to D.www.ibm.comArmonk, NY-based computergiant IBM announced todayJoe’s computer hardware linksCompaqHPIBMBig Blue today announcedrecord profits for the quarterIndexing anchor textCan sometimes have unexpected side effects - e.g., evil empire.Can index anchor text with less weight.Anchor TextOther applicationsWeighting/filtering links in the graphHITS [Chak98], Hilltop [Bhar01]Generating page descriptions from anchor text [Amit98, Amit00]Hyperlink-Induced Topic Search (HITS) - Klei98In response to a query, instead of an ordered list of pages each meeting the query, find two sets of inter-related pages:Hub pages are good lists of links on a subject.e.g., “Bob’s list of cancer-related links.”Authority pages occur recurrently on good hubs for the subject.Best suited for “broad topic” queries rather than for page-finding queries.Gets at a broader slice of common opinion.Hubs and AuthoritiesThus, a good hub page for a topic points to many authoritative pages for that topic.A good authority page for a topic is pointed to by many good hubs for that topic.Circular definition - will turn this into an iterative computation.The hope AT&T Alice SprintBob MCILong distance telephone companiesHubsAuthoritiesHigh-level schemeExtract from the web a base set of pages that could be good hubs or authorities.From these, identify a small set of top hub and authority pages;iterative algorithm.Base setGiven text query (say browser), use a text index to get all pages containing browser.Call this the root set of pages. Add in any page that eitherpoints to a page in the root set, oris pointed to by a page in the root set.Call this the base set.VisualizationRootsetBase setAssembling the base set [Klei98]Root set typically 200-1000 nodes.Base set may have up to 5000 nodes.How do you find the base set

View Full Document


School:
Email:
New Password:
Confirm Password:

Stanford CS 276B - Lecture 11 - Text Retrieval and Mining

Sign up for free to view:

Please select your school