Data Grids Lidan Wang April 5 2007 4 6 2007 Outline Data intensive applications Challenges in data access integration and management in Grid setting Grid services for these data intensive application Architectural approaches The Data Grid 4 6 2007 Data Intensive Applications Bioinformatics tasks in deciphering genes large scale collaboration Creation management and exploiting structured collections Comparisons between species and integration with different protein databases Virtual observatory combining observations from different sources to create one unified view Large scale movement of data and information integration is essential in these tasks 4 6 2007 Challenges Diverse usage cases e g updatable vs read only data binary formatted vs relational data Different storage systems data formats Access manipulation and analysis of large quantities of data 4 6 2007 Analysis may require teraflops of computing power and access to data distributed across files and databases Applications may require integration and querying of large quantities of data How to discover the relevant data Challenges contin Need an infrastructure Shared data storage and computing resources can be delivered to data analysis Should have an integrated and flexible manner Types of data Structured and unstructured data We focus on structured data So collaborators know the structure of the data and operation on the data can be carried out Grid itself uses structure data collection for its operation and administration 4 6 2007 Architectural Approaches Data oriented services are partitioned into four classes Resource level services for data sources Data access service structured data Data movement service oblivious to the structure of data Collective services for managing data Managing data across more than 1 data source E g Data discovery data transformation data transfer Collective services that federate data sources Integrating two or more data sources at functionality level E g virtual database Domain specific services Data management processing and analysis operations for specific application domains 4 6 2007 Data Source Services Data Access Interested in interfaces and data access performance characteristics Format of Data Sources Data sources as either file or database We want integrated services that accommodate both Integrating Grid and Databases Remote DB access is a challenging task in Grid setting 4 6 2007 Grid has a uniform model in creating tools and services Individual database management can be vastly different so need extend this uniform technology to data management components Integrating Grid and Databases contin Need to develop consistent authentication and authorization mechanisms DBMS may use Grid s management regimes to recover resources DBMS may also use Grid infrastructure to implement distributed databases and Grid services to expose data and functionality DBMS security management diagnostic facilities may in turn influence the development of Grid services 4 6 2007 Integrating Grid and Databases contin Another motivation needs for functionality not yet provided by simple extensions to DB technology E g Combining computations with operations on data drives needs for new optimizations Short term optimizations data location and movement scheduling of data operations and computations can be brought into the same framework as longer term optimizations on how and where to store data If DB needs to be created moved as part of a Grid application then DBMS must have their location and lifetime managed by Grid technology 4 6 2007 File Access Service GridFTP GridFTP a data access and data transport service Uniform interface to different storage systems disk systems Assumptions incompatible data access protocols used by different storage systems partition of the datasets on the Grid As a result applications need to specify a subset of storage systems or use multiple methods to retrieve data GridFTP functionalities GridFTP includes and extends FTP As a data access protocol user written code that processes data prior to transmission storage As a data transport protocol a third party initiates and monitors data transfer between two other sites 4 6 2007 Database Oriented Access GridFTP is file oriented but we want direct query interface that facilitates more complex specifications and more uniform access to data sources Example query get all temperature readings that are between 10F and 30F only Some standard mechanisms JDBC Remote connection OGSA DAI an open source implementation of the specifications for accessing and integrating structured data Relational or XML databases to be accessed via web services An OGSA DAI web service allows data to be queried updated transformed and delivered 4 6 2007 More Details on OGSA DAI Follow a sequence of steps to get services 1 2 3 4 4 6 2007 Client uses data registry to locate a Grid data service factory GDSF Client activates a GDSF with its Grid service handle GSH Ask GDSF to produce a Grid data service GDS that provide the required access to data resources Ask GDS to perform a sequence of operations update query load etc The GDS can be the data resource itself or a proxy for the data resource Collective Data Management Services Collective services define functions whose operations can span multiple resources or services including storage management and computational services Examples Data transport services Data discovery services Workflow management planning and scheduling 4 6 2007 Data Transport Services some examples Multiple data object transfer service Users can submit and monitor a large number of simultaneous data transfer operations Reliable data transfer service Closely monitor the status of data transfer operations restart failed transfers Augment basic data transfer service GridFTP Globus Toolkit s reliable file transfer service Able to monitor and control third party data transfer between two GridFTP servers 4 6 2007 Data Discovery Services First step is to discover relevant data before we can access integrate and analyze it Use attributes that describe the data to discover the relevant data Names for the logical physical data can also be used Some example attributes creator data size how the data were generated OGSA mechanisms for publication and discovering service via registries can be utilized but it can be complicated How to describe query match service properties Potentially large number of data objects we may want to discover How to model
View Full Document
Unlocking...