Rutgers University CS 417 - Distributed Systems Case study - Google Cluster Architecture - D2470347

Home> Schools> Rutgers University- The State University of New Jersey> (CS) > CS 417> Distributed Systems Case study - Google Cluster Architecture

DOC PREVIEW

Rutgers University CS 417 - Distributed Systems Case study - Google Cluster Architecture

School name Rutgers University- The State University of New Jersey

Course Cs 417- Distributed Systems

Pages 17

This preview shows page 1-2-3-4-5-6 out of 17 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 17 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 17 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 17 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 17 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 17 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 17 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 17 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Distributed Systems 20. Case study: Google Cluster Architecture Paul Krzyzanowski [email protected] 11/29/2011 1 © 2011 Paul KrzyzanowskiA note about relevancy This describes the Google search cluster architecture in the mid 2000s. The search infrastructure was overhauled in 2010 (we’ll get to this at the end). Nevertheless, the lessons are still valid and this demonstrates how incredible scalability has been achieved using commodity computers by exploiting parallelism. 11/29/2011 2 © 2011 Paul KrzyzanowskiNeeds • A single Google search query – Reads hundreds of megabytes of data – Uses tens of billions of CPU cycles • Environment needs to support tens of thousands of queries per second • Environment must be – Fault tolerant – Economical (price-performance ratio matters) – Energy efficient (this affects price; watts per unit of performance matters) • Workload should be highly parallelizable – CPU performance matters less than price/performance ratio 11/29/2011 3 © 2011 Paul KrzyzanowskiKey design principles • Have reliability reside in software, not hardware – Use low-cost (unreliable) commodity PCs to build a high-end cluster – Replicate services across machines & detect failures • Design for best total throughput, not peak server response time – Response time can be controlled by parallelizing requests – Rely on replication: this helps with availability too • Price/performance ratio more important than peak performance 11/29/2011 4 © 2011 Paul KrzyzanowskiStep 1: DNS • User’s browser must map google.com to an IP address • “google.com” comprises – Multiple clusters distributed worldwide – Each cluster contains thousands of machines • DNS-based load balancing – Select cluster by taking user’s geographic proximity into account – Load balance across clusters – [similar to Akamai’s approach] DNS Google’s Load-balanced DNS 11/29/2011 5 © 2011 Paul KrzyzanowskiStep 2: Send HTTP request • IP address corresponds to a load balancer within a cluster • Load balancer – Monitors the set of Google Web Servers (GWS) – Performs local load balancing of requests among available servers • GWS machine receives the query – Coordinates the execution of the query – Formats results into an HTML response to the user Hardware Load Balancer Google Web Server Google Web Server Google Web Server Google Web Server Query Coordination Data center 11/29/2011 6 © 2011 Paul KrzyzanowskiStep 3. Find documents via inverted index • Index Servers – Map each query word → {list of documents} (hit list) • Inverted index generated from web crawlers → MapReduce – Intersect the hit lists of each per-word query • Compute relevance score for each document • Determine set of documents • Sort by relevance score 11/29/2011 7 © 2011 Paul KrzyzanowskiParallelizing the inverted index • Inverted index is 10s of terabytes • Search is parallelized – Index is divided into index shards • Each index shard is built from a randomly chosen subset of documents • Pool of machines serves requests for each shard • Pools are load balanced – Query goes to one machine per pool responsible for a shard • Final result is ordered list of document identifiers (docids) Index server: shard N Index server: shard N Index server: shard N Index server: shard 1 Index server: shard 1 Index server: shard 1 Google Web Server Index server: shard 0 Index server: shard 0 Index server: shard 0 11/29/2011 8 © 2011 Paul KrzyzanowskiIndex Servers PC PC PC Shard 0 PC PC PC PC PC Shard 1 PC PC PC PC PC Shard 2 PC PC PC PC PC Shard 3 PC PC PC PC PC Shard N PC PC 11/29/2011 9 © 2011 Paul KrzyzanowskiStep 4. Get title & URL for each docid • For each docid, the GWS looks up – Page title – URL – Relevant text: document summary specific to the query • Handled by document servers (docservers) 11/29/2011 10 © 2011 Paul KrzyzanowskiDocserver: shard N Docserver: shard N Docserver: shard N Docserver: shard 1 Docserver: shard 1 Docserver: shard 1 Parallelizing document lookup • Like index lookup, document lookup is partitioned & parallelized • Documents distributed into smaller shards – Each shard = subset of documents • Pool of load-balanced servers responsible for processing each shard • Together, document servers access a cached copy of the entire web! Google Web Server Docserver: shard 0 Docserver: shard 0 Docserver: shard 0 11/29/2011 11 © 2011 Paul KrzyzanowskiAdditional operations • In parallel with search: – Send query to a spell-checking system – Send query to an ad-serving system to generate ads • When all the results are in, GWS generate HTML output: – Sorted query results • With page titles, summaries, and URLs • Ads • Spelling correction suggestions Google Web Server Index server Index server Index server Index server Docserver Docserver Docserver Docserver Spell checker Ad server Hardware Load Balancer 11/29/2011 12 © 2011 Paul KrzyzanowskiLesson: exploit parallelism • Instead of looking up matching documents in a large index – Do many lookups for documents in smaller indices – Merge results together: a merge is simple & inexpensive • Divide the stream of incoming queries – Among geographically-distributed clusters – Load balance among query servers within a cluster • Linear performance improvement with more machines – Shards don’t need to communicate with each other – Increase # of shards across more machines to improve performance 11/29/2011 13 © 2011 Paul KrzyzanowskiChange to Caffeine • In 2010, Google remodeled its search infrastructure • Old system – Based on MapReduce (on GFS) to generate index files – Batch process: next phase of MapReduce cannot start until first is complete • Web crawling → MapReduce → propagation – Initially, Google updated its index every 4 months. Around 2000, it reindexed and propagated changes every month • Process took about 10 days • Users hitting different servers might get different results • New system, named Caffeine – Fully incremental system: Based on BigTable running on GFS2 – Support indexing many more documents: ~100 petabytes – High degree of interactivity: web crawlers can update tables dynamically – Analyze web continuously in small chunks • Identify pages that are likely to change frequently – BTW, MapReduce is not dead. Caffeine uses it in some places, as do lots of other services.

View Full Document