DOC PREVIEW
CMU CS 15381 - Web Spidering

This preview shows page 1-2-3-26-27-28 out of 28 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Artificial Intelligence 15-381 Web Spidering & HW1 PreparationSearch Engines on the WebAcquiring a Document Collection: Properties of Graphs I (1)Properties of Graphs I (2)Properties of Graphs I (3)Properties of Graphs I (4)Graph G = <n, v>Directed Graph ExampleTreeWeb GraphMore Properties of GraphsProperties of GraphsProperties of Web GraphsGraph-Search Algorithms IDepth-first SearchGraph-Search Algorithms II (1)Graph-Search Algorithms II (2)A Correct Spidering AlgorithmA More Efficient Correct AlgorithmGraph-Search Algorithms V A More Complete Correct AlgorithmCompleteness ObservationsSlide 23To Spider or not to Spider? (1)To Spider or not to Spider? (2)Current Status of Web SpidersCurrent Status of Web Spiders )Current Status of Web SpidersArtificial Intelligence 15-381Web Spidering & HW1 PreparationJaime [email protected] January 2002Today's AgendaFinish A*, B*, MacrooperatorsWeb Spidering as SearchHow to acquire the web in a boxGraph theory essentialsAlgorithms for web spideringSome practical issuesSearch Engines on the WebRevising the Total IR Scheme1. Acquire the collection, i.e. all the documents[Off-line process]2. Create an inverted index (IR lecture, later)[Off-line process]3. Match queries to documents (IR lecture)[On-line process, the actual retrieval]4. Present the results to user[On-line process: display, summarize, ...]Acquiring a Document CollectionDocument Collections and SourcesFixed, pre-existing document collectione.g., the classical philosophy worksPre-existing collection with periodic updatese.g., the MEDLINE biomedical collectionStreaming data with temporal decaye.g., the Wall-Street financial news feedDistributed proprietary document collectionsDistributed, linked, publicly-accessible documentse.g. the Web:Properties of Graphs I (1)Definitions:Grapha set of nodes n and a set of edges (binary links) v between the nodes.Directed graph a graph where every edge has a pre-specified direction.Properties of Graphs I (2)Connected graph a graph where for every pair of nodes there exists a sequence of edges starting at one node and ending at the other.The web graph the directed graph where n = {all web pages} and v = {all HTML-defined links from one web page to another}.Properties of Graphs I (3)Tree a connected graph without any loops and with a unique path between any two nodes Spanning tree of graph Ga tree constructed by including all n in G, and a subset of v such that G remains connected, but all loops are eliminated.Properties of Graphs I (4)Forest a set of trees (without inter-tree links)k-Spanning forest Given a graph G with k connected subgraphs, the set of k trees each of which spans a different connected subgraphs.Graph G = <n, v>Directed Graph ExampleTreeWeb Graph<href …><href …><href …><href …><href …><href …><href …>HTML references are linksWeb Pages are nodesMore Properties of GraphsTheorem 1: For every connected graph G, there exists a spanning tree.Proof: Depth-first search starting at any node in G builds the spanning tree.Properties of GraphsTheorem 2: For every G with k disjoint connected subgraphs, there exists a k-spanning forest.Proof: Each connected subgraph has a spanning tree (Theorem 1), and the set of k spanning trees (being disjoint) define a k-spanning forest.Properties of Web GraphsAdditional ObservationsThe web graph at any instant of time contains k-connected subgraphs (but we do not know the value of k, nor do we know a-priori the structure of the web-graph).If we knew every connected web subgraph, we could build a k-web-spanning forest, but this is a very big "IF."Graph-Search Algorithms IPROCEDURE SPIDER1(G)Let ROOT := any URL from GInitialize STACK <stack data structure>Let STACK := push(ROOT, STACK)Initialize COLLECTION <big file of URL-page pairs>While STACK is not empty,URLcurr := pop(STACK)PAGE := look-up(URLcurr)STORE(<URLcurr, PAGE>, COLLECTION)For every URLi in PAGE,push(URLi, STACK)Return COLLECTIONWhat is wrong with the above algorithm?Depth-first Search1234567numbers = order inwhich nodes arevisitedGraph-Search Algorithms II (1)SPIDER1 is IncorrectWhat about loops in the web graph?=> Algorithm will not haltWhat about convergent DAG structures?=> Pages will replicated in collection=> Inefficiently large index=> Duplicates to annoy userGraph-Search Algorithms II (2)SPIDER1 is IncompleteWeb graph has k-connected subgraphs.SPIDER1 only reaches pages in the the connected web subgraph where ROOT page lives.A Correct Spidering AlgorithmPROCEDURE SPIDER2(G)Let ROOT := any URL from GInitialize STACK <stack data structure>Let STACK := push(ROOT, STACK)Initialize COLLECTION <big file of URL-page pairs>While STACK is not empty,| Do URLcurr := pop(STACK)| Until URLcurr is not in COLLECTIONPAGE := look-up(URLcurr)STORE(<URLcurr, PAGE>, COLLECTION)For every URLi in PAGE,push(URLi, STACK)Return COLLECTIONA More Efficient Correct AlgorithmPROCEDURE SPIDER3(G)Let ROOT := any URL from GInitialize STACK <stack data structure>Let STACK := push(ROOT, STACK)Initialize COLLECTION <big file of URL-page pairs>| Initialize VISITED <big hash-table>While STACK is not empty,| Do URLcurr := pop(STACK)| Until URLcurr is not in VISITED| insert-hash(URLcurr, VISITED)PAGE := look-up(URLcurr)STORE(<URLcurr, PAGE>, COLLECTION)For every URLi in PAGE,push(URLi, STACK)Return COLLECTIONGraph-Search Algorithms VA More Complete Correct AlgorithmPROCEDURE SPIDER4(G, {SEEDS})| Initialize COLLECTION <big file of URL-page pairs>| Initialize VISITED <big hash-table>| For every ROOT in SEEDS| Initialize STACK <stack data structure>| Let STACK := push(ROOT, STACK)While STACK is not empty,Do URLcurr := pop(STACK)Until URLcurr is not in VISITEDinsert-hash(URLcurr, VISITED)PAGE := look-up(URLcurr)STORE(<URLcurr, PAGE>, COLLECTION)For every URLi in PAGE,push(URLi, STACK)Return COLLECTIONCompleteness ObservationsCompleteness is not guaranteedIn k-connected web G, we do not know kImpossible to guarantee each connected subgraph is sampledBetter: more seeds, more diverse seedsCompleteness ObservationsSearch Engine PracticeWish to maximize subset of web indexed.Maintain (secret) set of diverse seeds(grow this set opportunistically, e.g. when X complains his/her page not indexed).Register new web sites on demandNew registrations are seed candidates.To Spider or not to Spider? (1) User PerceptionsMost annoying: Engine finds nothing(too small an index, but not an issue since 1997 or so).Somewhat annoying: Obsolete links=> Refresh


View Full Document

CMU CS 15381 - Web Spidering

Documents in this Course
Planning

Planning

19 pages

Planning

Planning

19 pages

Lecture

Lecture

42 pages

Lecture

Lecture

27 pages

Lecture

Lecture

19 pages

FOL

FOL

41 pages

lecture

lecture

34 pages

Exam

Exam

7 pages

Lecture

Lecture

22 pages

Handout

Handout

11 pages

Midterm

Midterm

14 pages

lecture

lecture

83 pages

Handouts

Handouts

38 pages

mdp

mdp

37 pages

HW2

HW2

7 pages

nn

nn

25 pages

lecture

lecture

13 pages

Handout

Handout

5 pages

Lecture

Lecture

27 pages

Lecture

Lecture

62 pages

Lecture

Lecture

5 pages

Load more
Download Web Spidering
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Web Spidering and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Web Spidering 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?