Link AnalysisEvolution of Search Engines(cont’d)Meta-Search EnginesHTML Structure & Feature WeightingBibliometrics: Citation AnalysisBibliographic Coupling: Similarity MeasureCo-Citation : Similarity MeasureImpact Factor (of a journal)AuthoritiesHubsHubs and AuthoritiesToday’s lecture: Basics of 3rd Generation Search EnginesThe Web as a Directed GraphAnchor Text WWW Worm - McBryan [Mcbr94]Indexing anchor textMotivation and Introduction to PageRankInitial PageRank IdeaSample Stable FixpointPagerank scoring : More detailsNot quite enoughTeleportingResult of teleportingMarkov chainsSlide 29Ergodic Markov chainsSlide 31Probability vectorsChange in probability vectorSteady state exampleHow do we compute this vector?One way of computing aLink Structure of the WebA Simple Version of PageRankAn example of Simplified PageRank (Transposed version)Slide 40Slide 41A Problem with Simplified PageRankAn example of the ProblemSlide 44Slide 45Random Walks in GraphsModified Version of PageRankAn example of Modified PageRankPagerank summaryThe realityPagerank: Issues and VariantsNon-uniform TeleportationSlide 60Hyperlink-Induced Topic Search (HITS) - Klei98Slide 62The hopeHigh-level schemeBase setVisualizationAssembling the base set [Klei98]Distilling hubs and authoritiesIllustrated Update RulesIterative updateScalingOld ResultsSimilar Page Results from HubsJapan Elementary SchoolsThings to noteProof of convergenceHub/authority vectorsRewrite in matrix formIssuesPrasad L21LinkAnalysis 1Link AnalysisAdapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney (UT, Austin)Evolution of Search Engines1st Generation : Retrieved documents that matched keyword-based queries based on boolean model. 2nd Generation : Incorporated content-specific relevance ranking based on vector space model (TF-IDF), to deal with high recall. 3rd Generation: Incorporated content-independent source ranking, to overcome spamming, and to exploit “collective web wisdom”.Prasad L21LinkAnalysis 2(cont’d)3rd Generation: Tried to glean relative semantic emphasis of various words based on syntactic features such as fonts, span of query term hits, etc. to enhance the efficacy of VSM.Future search engines will incorporate context, profile, and past query history associated with a user, to personalize search, and apply additional reasoning and heuristics to improve satisfaction of information need.Prasad L21LinkAnalysis 34Meta-Search EnginesSearch engine that passes query to several other search engines and integrates results.Submit queries to host sites.Parse resulting HTML pages to extract search results.Integrate multiple rankings into a “consensus” ranking.Present integrated results to user.Examples: Metacrawler, SavvySearch ,Dogpile5HTML Structure & Feature WeightingWeight tokens under particular HTML tags more heavily:<TITLE> tokens (Google seems to like title matches)<H1>,<H2>… tokens<META> keyword tokensParse page into conceptual sections (e.g. navigation links vs. page content) and weight tokens differently based on section.6Bibliometrics: Citation AnalysisMany standard documents include bibliographies (or references), explicit citations to other previously published documents.Using citations as links, standard corpora can be viewed as a graph.The structure of this graph, independent of the content, can provide interesting information about the similarity of documents and the structure of information.7Bibliographic Coupling: Similarity MeasureMeasure of similarity of documents introduced by Kessler in 1963.The bibliographic coupling of two documents A and B is the number of documents cited by both A and B.Size of the intersection of their bibliographies.Maybe want to normalize by size of bibliographies?A B8Co-Citation : Similarity MeasureAn alternate citation-based measure of similarity introduced by Small in 1973.Number of documents that cite both A and B.Maybe want to normalize by total number of documents citing either A or B ?A B9Impact Factor (of a journal)Developed by Garfield in 1972 to measure the importance (quality, influence) of scientific journals.Measure of how often papers in the journal are cited by other scientists.Computed and published annually by the Institute for Scientific Information (ISI).The impact factor of a journal J in year Y is the average number of citations (from indexed documents published in year Y) to a paper published in J in year Y1 or Y2.Does not account for the quality of the citing article.10AuthoritiesAuthorities are pages that are recognized as providing significant, trustworthy, and useful information on a topic.In-degree (number of pointers to a page) is one simple measure of authority.However in-degree treats all links as equal.Should links from pages that are themselves authoritative count for more?11HubsHubs are index pages that provide lots of useful links to relevant content pages (topic authorities).Hub pages for IR are included in the course home page:http://www.cs.utexas.edu/users/mooney/ir-course12Hubs and AuthoritiesTogether they tend to form a bipartite graph:Hubs Authorities13Today’s lecture:Basics of 3rd Generation Search EnginesRole of anchor textLink analysis for rankingPagerank and variants (Page and Brin)BONUS: Illustrates how to solve an important problem by adapting topology/statistics of large dataset for mathematically sound analysis and a practical scalable algorithmHITS (Klienberg)Prasad L21LinkAnalysis14The Web as a Directed GraphAssumption 1: A hyperlink between pages denotes author perceived relevance (quality signal)Assumption 2: The anchor text of the hyperlink describes the target page (textual context)Page AhyperlinkPage BAnchorPrasad L21LinkAnalysis15Anchor Text WWW Worm - McBryan [Mcbr94] For ibm how to distinguish between:IBM’s home page (mostly graphical)IBM’s copyright page (high term freq. for ‘ibm’)Rival’s spam page (arbitrarily high term freq.)www.ibm.com“ibm” “ibm.com”“IBM home page”A million pieces of anchor text with “ibm” send a strong signalPrasad L21LinkAnalysis16Indexing anchor textWhen indexing a document D, include anchor text from links pointing to D.www.ibm.comArmonk, NY-based computergiant IBM announced todayJoe’s computer hardware linksCompaqHPIBMBig Blue today
View Full Document