Berkeley COMPSCI 252 - Deep scientific computing requires deep data - D1662443

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI 252> Deep scientific computing requires deep data

DOC PREVIEW

Berkeley COMPSCI 252 - Deep scientific computing requires deep data

School name University of California, Berkeley

Course Compsci 252- Graduate Computer Architecture

Pages 24

This preview shows page 1-2-23-24 out of 24 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 24 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 24 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 24 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 24 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 24 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Deep scientificcomputingrequiresdeep dataW. T. C. KramerA. ShoshaniD. A. AgarwalB. R. DraneyG. JinG. F. ButlerJ. A. HulesIncreasingly, scientific advances require the fusion of largeamounts of complex data with extraordinary amounts ofcomputational power. The problems of deep science demanddeep computing and deep storage resources. In addition toteraflop-range computing engines with their own local storage,facilities must provide large data repositories of the order of10–100 petabytes, and networking to allow the movement ofmulti-terabyte files in a timely and secure manner. This paperexamines such problems and identifies associated challenges.The paper discusses some of the storage systems and datamanagement methods that are needed for computing facilitiesto address the challenges and describes some ongoingimprovements.IntroductionDeep scientific computing has evolved to the integration ofsimulation, theory development, and experimental analysisas equally important components. The integration of thesecomponents is facilitating the investigation of heretoforeintractable problems in many scientific domains. Oftenin the past, only two of the components were present:Computations were used to analyze theoretical ideasand to assist experimentalists with data analysis. Today,however, beyond each component informing theothers, the techniques in each domain are being closelyinterleaved so that science investigations increasinglyrely on simulations, observational data analyses, andtheoretical hypotheses virtually simultaneously inorder to make progress.High-performance computing is now being integrateddirectly into some experiments, analyzing data while theexperiment is in progress, to allow real-time adaptationand refinement of the experiment and to allow theinsertion of human intuition into the process, thus makingit very dynamic. When computational models operate inconcert with experiments, each can be refined andcorrected on the basis of the interplay of the two. Theintegration of computing with the other investigativemethods is improving research productivity and openingnew avenues of exploration.In many cases, investigations have been limited by thecomputational power and data storage available, and theseconstraints, rather than the scale of the question beingstudied, have determined the resolution of a simulation orthe complexity of an analysis. As available computationalpower, memory, and storage capacity increase,investigations can be expanded at a natural scale ratherthan being constrained by resources. But deep scientificcomputing can still be constrained by an inadequatecapability to cope with massive datasets. In order tohandle massive amounts of data, attention must be paid tothe management of temporary and long-term storage, atthe computing facility and elsewhere, and to networkingcapabilities to move the data between facilities.An important aspect of the challenge of deep computingis the fact that today and in the foreseeable future nocomputational system can hold all needed data usingon-line, local disk storage. As discussed later, for manyapplications, each step of a simulation produces gigabytes(GB) to terabytes (TB) of data. A deep computing systemis used by multiple applications and for many time steps,so any delay in being able to move and access the datameans under-utilizing the computational resource. Thus, akey subsystem in every facility involved in deep computingis a large data archive or repository that holds hundredsof terabytes to petabytes (PB) of storage. These archives娀Copyright 2004 by International Business Machines Corporation. Copying in printed form for private use is permitted without payment of royalty provided that (1) eachreproduction is done without alteration and (2) the Journal reference and IBM copyright notice are included on the first page. The title and abstract, but no other portions, of thispaper may be copied or distributed royalty free without further permission by computer-based and other information-service systems. Permission to republish any other portion ofthis paper must be obtained from the Editor.0018-8646/04/$5.00 © 2004 IBMIBM J. RES. & DEV. VOL. 48 NO. 2 MARCH 2004 W. T. C. KRAMER ET AL.209are composed of a hierarchy of storage methods rangingfrom primary parallel disk storage to secondary robotictape storage to possibly tertiary shelf-based tape storage.The efficient management of data on such systems isessential to making the computational systems effective.While large-scale computational and storage systemshave been in place for decades, it is only in the past 20years that networking began to change the way in whichcomputing is performed. Initially, this was done viaremote log-in access over connections that were relativelyslow compared to the computing power and storage of thetime. Since the mid-1990s, networking capabilities haveevolved to the point that they have significantly changedthe way in which large-scale resources are used. Indeed,the explosion of raw Internet bandwidth is enablingpeople to envision new paradigms of computing. Onesuch new paradigm is Grid computing with the Open GridService Architecture [1]. Flexible access to computing andstorage systems is now being implemented as a part of theGrid. This paper does not deal specifically with Gridissues, but concentrates on the underlying functions andmethods required to enable distributed systems to reachtheir full potential.Network capabilities have seen manyfold fundamentalimprovements in hardware, such as the change fromcopper-based networking to optical-fiber-basednetworking. Although these hardware improvements areexpected to continue into the future, the performance ofthe networking protocols that were designed to operateon significantly lower-speed networks has not grown withthe network capacity. End host paths from memory tothe network also have often not kept pace with theimprovements in the network capabilities. These lags havecaused serious limitations in the end-to-end efficiency andutilization of applications running on the network. End-to-end networking technology must now keep pace, or itwill not be able to match the exponentially increasingcomputational power of new systems and the dramaticincreases in storage capacity. Middleware associated withthe Grid introduces even more demands on the underlyingdata and network infrastructure. Furthermore, theprotection of intellectual and physical assets in anetworked environment is

View Full Document