Data Grids Christian Almazan Qin Wang CMSC 818S Grid Computing 26 February 2004 and 02 March 2004 Introduction to Data Grids Chapter 22 of The Grid Data Access Integration and Management Atkinson Chervenak Kunszt Narang Paton Pearson Shoshani Watson The Data Grid Towards an Architecture for the Distributed Management and Analysis of Large Scientific Datasets Chervenak Foster Kesselman Salisbury Tuecke The Grid Story So Far Why Do We Need a Grid PVM MPI Grid Architectures Services and Standards Basic Anatomy OGSA OGSI General Grid and Grid like Frameworks Globus Legion Condor PUNCH So Why Do We Care About Data Scientific Applications Discoveries Aided by Analyzing Data Terabytes to Petabytes of Distributed Data Who is Involved experimentalists theoreticians data curators database engineers computer scientists developers Process of Scientific Data Exploration Data Generation running large and lengthy simulation generates and stores enormous amounts of data without delaying a simulation 1 gigabyte sec Postprocessing data is validated formatted and associated with metadata Analysis data mining statistical analysis simulation and analysis Business Applications Expected to Emerge with Various Usages scientific in nature engineering and pharmaceutical financial services patterns in usage Data Mining need more computational power Integrated Data Access within and between enterprises Business versus Scientific Data is Different usually structured and stored in databases not binary With Exceptions Smaller Datasets Commercial Dataset Updates Frequent harder to replicate need transactional support coherency is a must Structured Data What data with an explicitly designed structure easily exploited by software operations Why structuring leads to understanding data required for sharing data across collaborations Why Not Unstructured Data harder to understand and perform operations on Structured Data Representations Relational Model XML Documents with Schemas stored and managed by database management systems Binary Representations with Explicit Structure software HDF NetCDF Semistructured Data Formatted ASCII Text integrated and queried through wrappers Types of Structured Data 1 Primary Structured Data observations with metadata recorded directly into databases Ancillary Data structured metadata for bulk binary or structured data metadata is expressed through schemas z z logical schema logical structure of the data data model physical schema organization of the database access policies and integrity constraints More on Ancillary Data Organizational Metadata support management of Grid operations Technical Metadata organizes interpretation of primary data indexing Application Metadata guides interpretation of primary data searching Data Products derived information from primary data summaries catalogs and indexes Types of Structured Data 2 Collaboration Data aids in sharing scientific information quickly precisely useful for legacy data Personal Data data assembled by or about individual users profile data usually private Service Data used to support Grid operations We Have Data and a Grid Digital Data is Everywhere Science Engineering ubiquitous network connectivity need data to solve modern challenges Dataset Sizes Enormous need teraflops of computing power access to terabytes now petabytes of distributed data Existing Architectural Structure and Mechanisms i e OGSA for the Grid Helps But Managing Data is Hard Diverse Usage Scenarios updateable versus read only data consistency Heterogeneity at All Levels storage systems data formats data access mechanisms policies Performance Demands access manipulation analysis Why General Purpose Data Access and Integration Software Amortize Costs many categories of structured data Different Usage Patterns technology is hard to design Performance different and dynamically varying operations loads adaptive optimization Principles for the Data Grid Mechanism Neutrality access to data independent of low level components Policy Neutrality performance decisions exposed to the users Compatibility with Grid Infrastructure use existing Grid tools i e authentication resource management and information Uniformity of Information Infrastructure uniform and convenient access to everything Data Management Challenge Some Factors for system 1 Diverse Usage read write data consistency access method 2 Heterogeneity storage system data format access mechanism and policy etc 3 Performance demands access analysis processing Data Management Challenge Data Source Grid contains diverse data resources from different storage system data types models and access mechanism The goal is to achieve uniform data accessing methods and the ability to transform data among different data types and models Also security measure to prevent unauthorized access Data Management Challenge Data Manipulation 1 Discover desired data based on metadata attribution like name location 2 Data movement among storage systems and programs support replication caching and bulky access 3 Create replicas to reduce latency improve reliability and load balancing the ability locate and select existing replicas Data Management Challenge Data Analysis and Processing impose difficulty in planning scheduling monitoring 1 Adopt a virtual data abstraction to specify these operations by name and attributes 2 Based on execution graph the operations are mapped onto available services and resources by the Grid system for optimization Architectural Approaches Current approaches Not yet to provide a complete set of reusable data services but define and construct a layered data oriented services 1 Data source services Data access and movement 2 Collective managing data service primitive data transfer discovery transformation and filtering scheduling 3 Federation data service a virtual database replication naming location mediation consistency query managing 4 Domain specific service specialized data management Data Source Service Data Access 1 File oriented file or subset of file as basic unit GridFTP provides a uniform interface to access and move data among various storage systems 2 Query oriented a series of queries Existing standard by ODBC JDBC Proposed standard by OGSA DAI for GT3 See architecture graph next OGSA DAI Architecture Data Source Service Managing Data Source Functionalities storage space reservation monitoring and auditing services 1 NeST by Conder incorporate functions into storage system providing storage
View Full Document
Unlocking...