UCF COT 4810 - The Google File System - D2593190

Home> Schools> University of Central Florida> Computer Theory (COT) > COT 4810> The Google File System

DOC PREVIEW

UCF COT 4810 - The Google File System

School name University of Central Florida

Course Cot 4810- Topics in Computer Science

Pages 15

This preview shows page 1-2-3-4-5 out of 15 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 15 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 15 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 15 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 15 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 15 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 15 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

The Google File SystemSanjay Ghemawat, Howard Gobioff, and Shun-Tak LeungGoogle∗ABSTRACTWe have designed and implemented the Google File Sys-tem, a scalable distributed ﬁle system for large distributeddata-intensive applications. It provides fault tolerance whilerunning on inexpensive commodity hardware, and it delivershigh aggregate performance to a large number of clients.While sharing many of the same goals as previous dis-tributed ﬁle systems, our design has been driven by obser-vations of our application workloads and technological envi-ronment, both current and anticipated, that reﬂect a markeddeparture from some earlier ﬁle system assumptions. Thishas led us to reexamine traditional choices and explore rad-ically diﬀerent design points.The ﬁle system has successfully met our storage needs.It is widely deployed within Google as the storage platformfor the generation and processing of data used by our ser-vice as well as research and development eﬀorts that requirelarge data sets. The largest cluster to date provides hun-dreds of terabytes of storage across thousands of disks onover a thousand machines, and it is concurrently accessedby hundreds of clients.In this paper, we present ﬁle system interface extensionsdesigned to support distributed applications, discuss manyaspects of our design, and report measurements from bothmicro-benchmarks and real world use.Categories and Subject DescriptorsD[4]: 3—Distributed ﬁle systemsGeneral TermsDesign, reliability, performance, measurementKeywordsFault tolerance, scalability, data storage, clustered storage∗The authors can be reached at the following addresses:{sanjay,hgobioﬀ,shuntak}@google.com.Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.SOSP’03, October 19–22, 2003, Bolton Landing, New York, USA.Copyright 2003 ACM 1-58113-757-5/03/0010 ...$5.00.1. INTRODUCTIONWe have designed and implemented the Google File Sys-tem (GFS) to meet the rapidly growing demands of Google’sdata processing needs. GFS shares many of the same goalsas previous distributed ﬁle systems such as performance,scalability, reliability, and availability. However, its designhas been driven by key observations of our application work-loads and technological environment, both current and an-ticipated, that reﬂect a marked departure from some earlierﬁle system design assumptions. We have reexamined tradi-tional choices and explored radically diﬀerent points in thedesign space.First, component failures are the norm rather than theexception. The ﬁle system consists of hundreds or eventhousands of storage machines built from inexpensive com-modity parts and is accessed by a comparable number ofclient machines. The quantity and quality of the compo-nents virtually guarantee that some are not functional atany given time and some will not recover from their cur-rent failures. We have seen problems caused by applicationbugs, operating system bugs, human errors, and the failuresof disks, memory, connectors, networking, and power sup-plies. Therefore, constant monitoring, error detection, faulttolerance, and automatic recovery must be integral to thesystem.Second, ﬁles are huge by traditional standards. Multi-GBﬁles are common. Each ﬁle typically contains many applica-tion objects such as web documents. When we are regularlyworking with fast growing data sets of many TBs comprisingbillions of objects, it is unwieldy to manage billions of ap-proximately KB-sized ﬁles even when the ﬁle system couldsupport it. As a result, design assumptions and parameterssuch as I/O operation and block sizes have to be revisited.Third, most ﬁles are mutated by appending new datarather than overwriting existing data. Random writes withina ﬁle are practically non-existent. Once written, the ﬁlesare only read, and often only sequentially. A variety ofdata share these characteristics. Some may constitute largerepositories that data analysis programs scan through. Somemay be data streams continuously generated by running ap-plications. Some may be archival data. Some may be in-termediate results produced on one machine and processedon another, whether simultaneously or later in time. Giventhis access pattern on huge ﬁles, appending becomes the fo-cus of performance optimization and atomicity guarantees,while caching data blocks in the client loses its appeal.Fourth, co-designing the applications and the ﬁle systemAPI beneﬁts the overall system by increasing our ﬂexibility.For example, we have relaxed GFS’s consistency model tovastly simplify the ﬁle system without imposing an onerousburden on the applications. We have also introduced anatomic append operation so that multiple clients can appendconcurrently to a ﬁle without extra synchronization betweenthem. These will be discussed in more details later in thepaper.Multiple GFS clusters are currently deployed for diﬀerentpurposes. The largest ones have over 1000 storage nodes,over 300 TB of disk storage, and are heavily accessed byhundreds of clients on distinct machines on a continuousbasis.2. DESIGN OVERVIEW2.1 AssumptionsIn designing a ﬁle system for our needs, we have beenguided by assumptions that oﬀer both challenges and op-portunities. We alluded to some key observations earlierand now lay out our assumptions in more details.• The system is built from many inexpensive commoditycomponents that often fail. It must constantly monitoritself and detect, tolerate, and recover promptly fromcomponent failures on a routine basis.• The system stores a modest number of large ﬁles. Weexpect a few million ﬁles, each typically 100 MB orlarger in size. Multi-GB ﬁles are the common caseand should be managed eﬃciently. Small ﬁles must besupported, but we need not optimize for them.• The workloads primarily consist of two kinds of reads:large streaming reads and small random reads. Inlarge streaming reads, individual operations typicallyread hundreds of KBs, more commonly 1 MB or more.Successive operations from the same client often readthrough a contiguous region of a ﬁle. A small ran-dom read typically reads a

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3-4-5 out of 15 pages.

UCF COT 4810 - The Google File System

Sign up for free to view:

Please select your school