Unformatted text preview:

Data Management in an International Data Grid Project Wolfgang Hoschek1 3 Javier Jaen Martinez1 Asad Samar1 4 Heinz Stockinger1 2 and Kurt Stockinger1 2 1 2 CERN European Organization for Nuclear Research Geneva Switzerland Inst for Computer Science and Business Informatics University of Vienna Austria 3 Inst of Applied Computer Science University of Linz Austria 4 California Institute of Technology Pasadena CA USA Abstract In this paper we report on preliminary work and architectural design carried out in the Data Management work package in the International Data Grid project Our aim within a time scale of three years is to provide Grid middleware services supporting the I Ointensive world wide distributed next generation experiments in HighEnergy Physics Earth Observation and Bioinformatics The goal is to specify develop integrate and test tools and middleware infrastructure to coherently manage and share Petabyte range information volumes in high throughput production quality Grid environments The middleware will allow secure access to massive amounts of data in a universal namespace to move and replicate data at high speed from one geographical site to another and to manage synchronisation of remote copies We put much attention on clearly specifying and categorising existing work on the Grid especially in data management in Grid related projects Challenging use cases are described and how they map to architectural decisions concerning data access replication meta data management security and query optimisation 1 1 Introduction In the year 2005 a new particle accelerator the Large Hadron Collider LHC is scheduled to be in operation at CERN the European Organization for Nuclear Research Four High Energy Physics HEP experiments will start to produce several Petabytes of data per year over a life time of 15 to 20 years Since this amount of data was never produced before special e orts concerning data management and data storage are required One characteristic of these data is that most of it is read only In general data are written by the experiment stored at very high data rates from 100 MB sec to 1GB sec and are normally not changed any more afterwards This is true for about 90 of the total amount of data Furthermore since CERN experiments are collaborations of over a thousand physicists from many di erent 1 to appear in the IEEE ACM International Workshop on Grid Computing Grid 2000 17 20 Dec 2000 Bangalore India universities and institutes the experiment s data are not only stored locally at CERN but there is also an intention to store parts of the data at world wide distributed sites in so called Regional Centres RCs and also in some institutes and universities The computing model of a typical LHC experiment is shown in Figure 1 CERN data store computing facilities Regional Centre RC 1 RC 2 RC 3 data store data store data store computing facilities computing facilities computing facilities University University data store computing facilities University data store computing facilities data store computing facilities desk tops Fig 1 Example of the network of one experiment s computing model These RCs are part of the distributed computing model and should complement the functionality of the CERN Centre The aim is to use computing power and data storage in these Regional Centres and allow physicists to do their analysis work outside of CERN with a reasonable response time rather than accessing all the data at CERN This should also help the collaboration to have many scientists working spread around the world Regional Centres will be set up for instance in Italy France Great Britain USA and Japan By its nature this is a typical Grid application which combines two aspects of today s Grid technology Computational and Data Grids In order to meet this challenge the HEP community has established a project called Research and Technological Development for an International Data Grid The objectives of this project are the following Firstly establish a Research Network which will enable the development of the technology components essential for the implementation of a new world wide Data Grid on a scale not previously attempted Secondly demonstrate the e ectiveness of this new technology through the large scale deployment of end to end application experiments involving real users Finally demonstrate the ability to build connect and e ectively manage large general purpose data intensive computer clusters constructed from lowcost commodity components Furthermore the project does not only cover HEP but also other scienti c communities like Earth Observation and Bioinformatics The entire project consists of several work packages for middleware development computing fabric and mass storage management testbeds and applications In this paper we present the data management aspects of the project The objectives are to implement and compare di erent distributed data management approaches including caching le replication and le migration Such middleware is critical for the success of heterogeneous Data Grids since they rely on e cient uniform and transparent access methods Issues to be tackled within three years include the management of a universal namespace e cient data transfer between sites synchronisation of remote copies wide area data access caching interfacing to mass storage management systems A major aim of the project is to build on existing experience and available software systems For the startup phase we have chosen the Globus toolkit as the starting point for our middleware research and development Globus is a promising toolkit and has already proved several times that it is applicable for large Grid projects 12 The paper is organised in the following way The section on related work gives an overview of data management in current data Grid projects and discusses related issues of distributed database management systems and distributed le systems Section 3 emphases the challenging requirements of data intensive Grid applications In sections 4 and 5 we present the overall architecture of the data management middleware components and give details on the individual components Finally conclusions and future work are presented 2 Survey and Discussion of Related Work Traditional distributed le systems like Network File System NFS 18 and Andrew File System AFS 15 provide a convenient interface for remote I O with a uniform le name space However this approach does not support multisite


View Full Document

UMD CMSC 818S - Data Management in an International Data Grid Project

Loading Unlocking...
Login

Join to view Data Management in an International Data Grid Project and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Data Management in an International Data Grid Project and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?