Artificial Intelligence 15-381 Web Spidering & HW1 PreparationSearch Engines on the WebAcquiring a Document Collection: Properties of Graphs I (1)Properties of Graphs I (2)Properties of Graphs I (3)Properties of Graphs I (4)Graph G = <n, v>Directed Graph ExampleTreeWeb GraphMore Properties of GraphsProperties of GraphsProperties of Web GraphsGraph-Search Algorithms IDepth-first SearchGraph-Search Algorithms II (1)Graph-Search Algorithms II (2)A Correct Spidering AlgorithmA More Efficient Correct AlgorithmGraph-Search Algorithms V A More Complete Correct AlgorithmCompleteness ObservationsSlide 23To Spider or not to Spider? (1)To Spider or not to Spider? (2)Current Status of Web SpidersCurrent Status of Web Spiders )Current Status of Web SpidersArtificial Intelligence 15-381Web Spidering & HW1 PreparationJaime [email protected] January 2002Today's AgendaFinish A*, B*, MacrooperatorsWeb Spidering as SearchHow to acquire the web in a boxGraph theory essentialsAlgorithms for web spideringSome practical issuesSearch Engines on the WebRevising the Total IR Scheme1. Acquire the collection, i.e. all the documents[Off-line process]2. Create an inverted index (IR lecture, later)[Off-line process]3. Match queries to documents (IR lecture)[On-line process, the actual retrieval]4. Present the results to user[On-line process: display, summarize, ...]Acquiring a Document CollectionDocument Collections and SourcesFixed, pre-existing document collectione.g., the classical philosophy worksPre-existing collection with periodic updatese.g., the MEDLINE biomedical collectionStreaming data with temporal decaye.g., the Wall-Street financial news feedDistributed proprietary document collectionsDistributed, linked, publicly-accessible documentse.g. the Web:Properties of Graphs I (1)Definitions:Grapha set of nodes n and a set of edges (binary links) v between the nodes.Directed graph a graph where every edge has a pre-specified direction.Properties of Graphs I (2)Connected graph a graph where for every pair of nodes there exists a sequence of edges starting at one node and ending at the other.The web graph the directed graph where n = {all web pages} and v = {all HTML-defined links from one web page to another}.Properties of Graphs I (3)Tree a connected graph without any loops and with a unique path between any two nodes Spanning tree of graph Ga tree constructed by including all n in G, and a subset of v such that G remains connected, but all loops are eliminated.Properties of Graphs I (4)Forest a set of trees (without inter-tree links)k-Spanning forest Given a graph G with k connected subgraphs, the set of k trees each of which spans a different connected subgraphs.Graph G = <n, v>Directed Graph ExampleTreeWeb Graph<href …><href …><href …><href …><href …><href …><href …>HTML references are linksWeb Pages are nodesMore Properties of GraphsTheorem 1: For every connected graph G, there exists a spanning tree.Proof: Depth-first search starting at any node in G builds the spanning tree.Properties of GraphsTheorem 2: For every G with k disjoint connected subgraphs, there exists a k-spanning forest.Proof: Each connected subgraph has a spanning tree (Theorem 1), and the set of k spanning trees (being disjoint) define a k-spanning forest.Properties of Web GraphsAdditional ObservationsThe web graph at any instant of time contains k-connected subgraphs (but we do not know the value of k, nor do we know a-priori the structure of the web-graph).If we knew every connected web subgraph, we could build a k-web-spanning forest, but this is a very big "IF."Graph-Search Algorithms IPROCEDURE SPIDER1(G)Let ROOT := any URL from GInitialize STACK <stack data structure>Let STACK := push(ROOT, STACK)Initialize COLLECTION <big file of URL-page pairs>While STACK is not empty,URLcurr := pop(STACK)PAGE := look-up(URLcurr)STORE(<URLcurr, PAGE>, COLLECTION)For every URLi in PAGE,push(URLi, STACK)Return COLLECTIONWhat is wrong with the above algorithm?Depth-first Search1234567numbers = order inwhich nodes arevisitedGraph-Search Algorithms II (1)SPIDER1 is IncorrectWhat about loops in the web graph?=> Algorithm will not haltWhat about convergent DAG structures?=> Pages will replicated in collection=> Inefficiently large index=> Duplicates to annoy userGraph-Search Algorithms II (2)SPIDER1 is IncompleteWeb graph has k-connected subgraphs.SPIDER1 only reaches pages in the the connected web subgraph where ROOT page lives.A Correct Spidering AlgorithmPROCEDURE SPIDER2(G)Let ROOT := any URL from GInitialize STACK <stack data structure>Let STACK := push(ROOT, STACK)Initialize COLLECTION <big file of URL-page pairs>While STACK is not empty,| Do URLcurr := pop(STACK)| Until URLcurr is not in COLLECTIONPAGE := look-up(URLcurr)STORE(<URLcurr, PAGE>, COLLECTION)For every URLi in PAGE,push(URLi, STACK)Return COLLECTIONA More Efficient Correct AlgorithmPROCEDURE SPIDER3(G)Let ROOT := any URL from GInitialize STACK <stack data structure>Let STACK := push(ROOT, STACK)Initialize COLLECTION <big file of URL-page pairs>| Initialize VISITED <big hash-table>While STACK is not empty,| Do URLcurr := pop(STACK)| Until URLcurr is not in VISITED| insert-hash(URLcurr, VISITED)PAGE := look-up(URLcurr)STORE(<URLcurr, PAGE>, COLLECTION)For every URLi in PAGE,push(URLi, STACK)Return COLLECTIONGraph-Search Algorithms VA More Complete Correct AlgorithmPROCEDURE SPIDER4(G, {SEEDS})| Initialize COLLECTION <big file of URL-page pairs>| Initialize VISITED <big hash-table>| For every ROOT in SEEDS| Initialize STACK <stack data structure>| Let STACK := push(ROOT, STACK)While STACK is not empty,Do URLcurr := pop(STACK)Until URLcurr is not in VISITEDinsert-hash(URLcurr, VISITED)PAGE := look-up(URLcurr)STORE(<URLcurr, PAGE>, COLLECTION)For every URLi in PAGE,push(URLi, STACK)Return COLLECTIONCompleteness ObservationsCompleteness is not guaranteedIn k-connected web G, we do not know kImpossible to guarantee each connected subgraph is sampledBetter: more seeds, more diverse seedsCompleteness ObservationsSearch Engine PracticeWish to maximize subset of web indexed.Maintain (secret) set of diverse seeds(grow this set opportunistically, e.g. when X complains his/her page not indexed).Register new web sites on demandNew registrations are seed candidates.To Spider or not to Spider? (1) User PerceptionsMost annoying: Engine finds nothing(too small an index, but not an issue since 1997 or so).Somewhat annoying: Obsolete links=> Refresh
View Full Document