DOC PREVIEW
Duke CPS 049s - Discussion Report

This preview shows page 1-2 out of 6 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Discussion Report: The Anatomy of a Large scaleHypertextual Web Search Engine (part 2)Azbayar DemberelDepartment of Computer [email protected] 24, 20071 System design1.1 Data structuresA good search engine is not just a clever search algorithm. It is a system consistingof several parts that collaborate with each other to produce high quality resultsefficiently. Google uses several special data structures that are optimized to makethe Google parts, namely crawling, indexing and searching, work efficiently.Google was especially designed to avoid disk seeks whenever p ossible, as thetime to search and fetch data from a disk remained nearly constant throughoutthe years, whereas the processing spe ed and I/O rate have steadily increased, andin addition they can always be improved by adding more CPU’s or by allocatingmore bandwidth. Therefore, the design of many Google’s data structures reflectedthis property.Bigfile Bigfiles are virtual files located on multiple file systems.Repository Repository is the database of web pages. Google s tores full HTMLcodes of every web pages that it crawls in the repository, compressed byzlib.Document index The document index stores information about each document.It contains current document status, a pointer into the repository, a docu-ment checksum, and various statistics. Also, depending whether the doc-ument has been crawled or not, either a pointer to the docinfo or URLlist is also stored. The document index is ordered by docID so that theinformation retrieval can be done quickly.Additionally, there is a file which is used to convert URLs into docIDs. Itis a list of URL checksums with their corresponding docIDs and is sortedby the checksum. Therefore we can quickly find the docID of a particular1URL by computing the URL’s checksum and doing a binary search on thechecksums.Lexicon Every possible word (not only English) is assigned a unique ID. ID’s arepreferred to the real values because they occupy less space in storage, and itis faster to search by numbers than strings. For e xample let’s assume thata word “Database” was assigned an ID 155,015,109. Although 155,015,109is a huge number, it would require only one comparison to check if a givennumber equals to it. On the other hand to check if 2 strings are equal wehave to do a comparison for every single character, i.e. to check if a givenstring equals to “Database” we would need to do around 8 comparisons.Lexicon is essentially a dictionary that maps words to their wordID’s. Cur-rently Google’s lexicon holds 14 million words and fits in a 256MB mainmemory.Hit lists Hit list is a list of occurrences of a particular word in a particular doc-ument including its position, font, and capitalization information. Becausehit lists are used both in forward and inverted indices, and hence accountsfor most of the space, it was important to store it efficiently.Every word in a web page is categorized into either fancy (if it is a URL, title,anchor text, or meta tag) or plain (everything else). A plain hit consists of:1 capitalization bit, 3 font size bits, and 12 word position bits. A fancy hitconsists of: 1 capitalization bit, 3 font size bits set to 7 to indicate it is afancy hit, 4 bits to encode the type of fancy hit, and 8 word position bits.For anchor hits, the 8 bits of position are split into 4 bits for position inanchor and 4 bits for a hash of the docID the anchor occurs in. To savemore space, instead of the actual size, a relative font size (relative to therest of the document) is stored.Barrels Barrels are used to store the hit lists. Each barrel holds a set of wordID’s.If a documents contains a word that falls into a particular barrel, the docIDis recorded in the barrel followed by hit lists for each occurences of theword in the document. Since the words are added to barrels in order ofdocuments, the barrel is sorted on do cI D’s. You c an see an example howbarrel is stored in the following section.Forward index Forward index is a mapping of web pages to the words, e.g If weare given a web page, we can find which words this web page contains usingthe forward index. It is stored in a number of barrels. Hence it is alreadypartially sorted.Since each barrel contains a certain range of wordID’s, instead of the wholewordID, a relative difference from the minimum wordID is stored in theforward index. For example let’s assume that a barrel corresponds to a setof wordID’s ranging from 100,000,000 to 109,999,999. If we store the actualwordID’s we would need 27 bits. Instead if we use relative wordID’s from 02to 9,999,999 we would use 24 bits, saving 3 bits. Google uses 24 bits for thewordID’s in the unsorted barrels, leaving 8 bits for the hit list length.Inverted index Inverted index is the opposite of the forward index. It mapswords to web pages, e.g. If we are a given word, we can find which webpages contain that word using the inverted index. For every wordID, thelexicon contains a pointer into the barrel that wordID falls into. It points toa doclist of docID’s together with their corresponding hit lists. This doclistrepresents all the occurrences of that word in all documents.Google uses 2 sets of inverted barrels - one set for hit lists which includetitle or anchor hits and another set for all hit lists. This way, Google firstchecks the first set of barrels which contain only title and anchor hits andif there are not enough matches within those barrels checks the larger ones.Anchors file Anchors file contains all the necessary information about links onweb pages. It stores information such as where the link points from and toand the text of the link.1.2 Google architectureHere is the high level overview of how Google works:1. First of all, Google needs to obtain all the data from which it will do thesearch. So a URL server creates a list of URL’s that need to be fetchedand forwards the list to web crawlers.2. Several distributed web crawlers receive the list and start downloadingweb pages. After they finish downloading web pages, they send them to thestore server.3. The store server then compresses the web pages and stores them in therepository.4. The indexer then reads the repository, uncompresses the documents andparses them. Each document is then converted into a set of words calledhits. If the hit is an anchor text, the indexer parses out its link and storesit in anchors file with some other information to determine where each linkpoints


View Full Document
Download Discussion Report
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Discussion Report and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Discussion Report 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?