K-State CIS 764 - Anatomy of a Large-Scale Hyper textual Web Search Engine - D2488831

Home> Schools> Kansas State University> Computing and Information Sciences (CIS) > CIS 764> Anatomy of a Large-Scale Hyper textual Web Search Engine

DOC PREVIEW

K-State CIS 764 - Anatomy of a Large-Scale Hyper textual Web Search Engine

School name Kansas State University

Course Cis 764- Database Design

Pages 21

This preview shows page 1-2-20-21 out of 21 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 21 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 21 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 21 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 21 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 21 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Slide 1Presentation OverviewProblemSolution = Search EngineSpecific Design GoalsGoogle Search Engine FeaturesPageRank For LaymanPageRank For TechiesWhy do we need d?Justifications for using PageRankGoogle ArchitecturePreliminaryGoogle Architecture (cont.)Google Architecture (cont.)Slide 15Google Query EvaluationSingle Word Query RankingMulti-word Query RankingScalabilitySummary of Key Optimization TechniquesReferences:Presented By:Vamsee Raja Jarugula for the sake of CIS 764 presentation . Kansas State University.Presentation OverviewProblem Definition.Design GoalsGoogle Search Engine Characteristics.Google ArchitectureScalability Conclusions Vamsee Raja Jarugula CIS 764ProblemWeb is vast and ever expanding. It is getting flooded with data.This data is heterogeneous and consists of all forms TextImagesAsciiJava applets Lists maintained by Humans cannot keep track of this.Human attention is confined to 10-1000 documentsPrevious search methodologies relied on keyword matching producing inferior quality results. Vamsee Raja Jarugula CIS 764Solution = Search Engine•Search engines facilitate users to get the text or documents of their choice within a click of mouse. ”Some examples of Search engines:Google,Altavista,MetaCrawler,Kosmix.For comprehensive list of search engines do visit:http://en.wikipedia.org/wiki/List_of_search_engines Vamsee Raja Jarugula CIS 764Specific Design GoalsDeliver results that have very high precision even at the expense of recallBring search engine technology into academic realm in order to support novel research activities Make search engine technology transparent, i.e. advertising shouldn’t bias resultsMake system user friendly . Vamsee Raja Jarugula CIS 764Google Search Engine FeaturesUses link structure of web (PageRank)Uses text surrounding hyperlinks to improve accurate document retrievalOther features include:Takes into account word proximity in documentsUses font size, word position, etc. to weight wordStorage of full raw html pages Vamsee Raja Jarugula CIS 764PageRank For LaymanImagine a web surfer doing a simple random walk on the entire web for an infinite number of steps. Occasionally, the surfer will get bored and instead of following a link pointing outward from the current page will jump to another random page.At some point, the percentage of time spent at each page will converge to a fixed value. This value is known as the PageRank of the page. Vamsee Raja Jarugula CIS 764PageRank For TechiesN(p): # outgoing links from page pB(p): set of pages that point to pd: tendency to get “bored” .R(p): PageRank of pR(p) = [(1-d)+d*R(q)/N(q)] . Vamsee Raja Jarugula CIS 764Why do we need d?In the real world virtually all web graphs are not connected, i.e. they have dead-ends, islands, etc.If we don’t have d we get “ranks leaks”for graphs that are not connected, i.e. leads to numerical instability. Vamsee Raja Jarugula CIS 764Justifications for using PageRankAttempts to model user behaviorCaptures the notion that the more a page is pointed to by “important” pages, the more it is worth looking atTakes into account global structure of web Vamsee Raja Jarugula CIS 764Google ArchitectureImplemented in C and C++ on Solaris and Implemented in C and C++ on Solaris and LinuxLinux Reference from Anatomy of a large scale search engine –Sergy Brin, Larry Page.Preliminary“Hitlist” is defined as list of occurrences of a particular word in a particular document including additional meta info:- position of word in doc - font size - capitalization - descriptor type, e.g. title, anchor, etc. Vamsee Raja Jarugula CIS 764Google Architecture (cont.)core figure referred from Sery Brin and Larry Page.----Anatomy of a large scale search engine.Google Architecture (cont.)Core figure referred from Sergy Brin and Larry Page.Google Architecture (cont.)Core figure reference from Sergy Brin and Larry page.Google Query Evaluation1.1.Parse the query. Parse the query. 2.2.Convert words into wordIDs. Convert words into wordIDs. 3.3.Seek to the start of the doclist in the short barrel Seek to the start of the doclist in the short barrel for every word. for every word. 4.4.Scan through the doclists until there is a Scan through the doclists until there is a document that matches all the search terms. document that matches all the search terms. 5.5.Compute the rank of that document for the query. Compute the rank of that document for the query. 6.6.If we are in the short barrels and at the end of any If we are in the short barrels and at the end of any doclist, seek to the start of the doclist in the full doclist, seek to the start of the doclist in the full barrel for every word and go to step 4. barrel for every word and go to step 4. 7.7.If we are not at the end of any doclist go to step If we are not at the end of any doclist go to step 4.4.8.8.Sort the documents that have matched by rank Sort the documents that have matched by rank and return the top k.and return the top k. Vamsee Raja Jarugula CIS 764Single Word Query RankingHitlist is retrieved for single wordEach hit can be one of several types: title, anchor, URL, large font, small font, etc.Each hit type is assigned its own weightType-weights make up vector of weights# of hits of each type is counted to form count vectorDot product of two vectors is used to compute IR scoreIR score is combined with PageRank to compute final rank Vamsee Raja Jarugula CIS 764Multi-word Query RankingSimilar to single-word ranking except now must analyze proximityHits occurring closer together are weighted higherEach proximity relation is classified into 1 of 10 values ranging from a phrase match to “not even close”Counts are computed for every type of hit and proximity Vamsee Raja Jarugula CIS 764ScalabilityCluster architecture combined with Moore’s Law make for high scalability. At time of writing:~ 24 million documents indexed in one week~518 million hyperlinks indexedFour crawlers collected 100 documents/sec Vamsee Raja Jarugula CIS 764Summary of Key Optimization TechniquesEach crawler maintains its own DNS lookup cacheUse flex to generate lexical analyzer with own stack for parsing documentsParallelization of indexing phaseIn-memory lexiconCompression of repositoryCompact encoding of

View Full Document