Unformatted text preview:

Link AnalysisEvolution of Search Engines(cont’d)Meta-Search EnginesHTML Structure & Feature WeightingBibliometrics: Citation AnalysisBibliographic Coupling: Similarity MeasureCo-Citation : Similarity MeasureImpact Factor (of a journal)AuthoritiesHubsHubs and AuthoritiesToday’s lecture: Basics of 3rd Generation Search EnginesThe Web as a Directed GraphAnchor Text WWW Worm - McBryan [Mcbr94]Indexing anchor textMotivation and Introduction to PageRankInitial PageRank IdeaSample Stable FixpointPagerank scoring : More detailsNot quite enoughTeleportingResult of teleportingMarkov chainsSlide 29Ergodic Markov chainsSlide 31Probability vectorsChange in probability vectorSteady state exampleHow do we compute this vector?One way of computing aLink Structure of the WebA Simple Version of PageRankAn example of Simplified PageRank (Transposed version)Slide 40Slide 41A Problem with Simplified PageRankAn example of the ProblemSlide 44Slide 45Random Walks in GraphsModified Version of PageRankAn example of Modified PageRankPagerank summaryThe realityPagerank: Issues and VariantsNon-uniform TeleportationSlide 60Hyperlink-Induced Topic Search (HITS) - Klei98Slide 62The hopeHigh-level schemeBase setVisualizationAssembling the base set [Klei98]Distilling hubs and authoritiesIllustrated Update RulesIterative updateScalingOld ResultsSimilar Page Results from HubsJapan Elementary SchoolsThings to noteProof of convergenceHub/authority vectorsRewrite in matrix formIssuesPrasad L21LinkAnalysis 1Link AnalysisAdapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney (UT, Austin)Evolution of Search Engines1st Generation : Retrieved documents that matched keyword-based queries based on boolean model. 2nd Generation : Incorporated content-specific relevance ranking based on vector space model (TF-IDF), to deal with high recall. 3rd Generation: Incorporated content-independent source ranking, to overcome spamming, and to exploit “collective web wisdom”.Prasad L21LinkAnalysis 2(cont’d)3rd Generation: Tried to glean relative semantic emphasis of various words based on syntactic features such as fonts, span of query term hits, etc. to enhance the efficacy of VSM.Future search engines will incorporate context, profile, and past query history associated with a user, to personalize search, and apply additional reasoning and heuristics to improve satisfaction of information need.Prasad L21LinkAnalysis 34Meta-Search EnginesSearch engine that passes query to several other search engines and integrates results.Submit queries to host sites.Parse resulting HTML pages to extract search results.Integrate multiple rankings into a “consensus” ranking.Present integrated results to user.Examples: Metacrawler, SavvySearch ,Dogpile5HTML Structure & Feature WeightingWeight tokens under particular HTML tags more heavily:<TITLE> tokens (Google seems to like title matches)<H1>,<H2>… tokens<META> keyword tokensParse page into conceptual sections (e.g. navigation links vs. page content) and weight tokens differently based on section.6Bibliometrics: Citation AnalysisMany standard documents include bibliographies (or references), explicit citations to other previously published documents.Using citations as links, standard corpora can be viewed as a graph.The structure of this graph, independent of the content, can provide interesting information about the similarity of documents and the structure of information.7Bibliographic Coupling: Similarity MeasureMeasure of similarity of documents introduced by Kessler in 1963.The bibliographic coupling of two documents A and B is the number of documents cited by both A and B.Size of the intersection of their bibliographies.Maybe want to normalize by size of bibliographies?A B8Co-Citation : Similarity MeasureAn alternate citation-based measure of similarity introduced by Small in 1973.Number of documents that cite both A and B.Maybe want to normalize by total number of documents citing either A or B ?A B9Impact Factor (of a journal)Developed by Garfield in 1972 to measure the importance (quality, influence) of scientific journals.Measure of how often papers in the journal are cited by other scientists.Computed and published annually by the Institute for Scientific Information (ISI).The impact factor of a journal J in year Y is the average number of citations (from indexed documents published in year Y) to a paper published in J in year Y1 or Y2.Does not account for the quality of the citing article.10AuthoritiesAuthorities are pages that are recognized as providing significant, trustworthy, and useful information on a topic.In-degree (number of pointers to a page) is one simple measure of authority.However in-degree treats all links as equal.Should links from pages that are themselves authoritative count for more?11HubsHubs are index pages that provide lots of useful links to relevant content pages (topic authorities).Hub pages for IR are included in the course home page:http://www.cs.utexas.edu/users/mooney/ir-course12Hubs and AuthoritiesTogether they tend to form a bipartite graph:Hubs Authorities13Today’s lecture:Basics of 3rd Generation Search EnginesRole of anchor textLink analysis for rankingPagerank and variants (Page and Brin)BONUS: Illustrates how to solve an important problem by adapting topology/statistics of large dataset for mathematically sound analysis and a practical scalable algorithmHITS (Klienberg)Prasad L21LinkAnalysis14The Web as a Directed GraphAssumption 1: A hyperlink between pages denotes author perceived relevance (quality signal)Assumption 2: The anchor text of the hyperlink describes the target page (textual context)Page AhyperlinkPage BAnchorPrasad L21LinkAnalysis15Anchor Text WWW Worm - McBryan [Mcbr94] For ibm how to distinguish between:IBM’s home page (mostly graphical)IBM’s copyright page (high term freq. for ‘ibm’)Rival’s spam page (arbitrarily high term freq.)www.ibm.com“ibm” “ibm.com”“IBM home page”A million pieces of anchor text with “ibm” send a strong signalPrasad L21LinkAnalysis16Indexing anchor textWhen indexing a document D, include anchor text from links pointing to D.www.ibm.comArmonk, NY-based computergiant IBM announced todayJoe’s computer hardware linksCompaqHPIBMBig Blue today


View Full Document

Wright CS 707 - Link Analysis

Download Link Analysis
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Link Analysis and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Link Analysis 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?