1The Data GridTowards an Architecture for the Distibuted Management of Large Scientific Databasesby Chernevak, Foster, Kesselman, Salisbury,and Tueckepresented by Asad B. Sayeed on this twentiethday of the month of October, anno domini 20052Introduction■Large data collectionsAppear in many scientific domains: climate studies, physics, genomics.Terabyte, nearly petabyte sizes.Users AND resources often very distributed geographically, large community.■Problematic combination: large data, geography, computationally intensive analysis with many queries.■Numerous solutions to individual issues exist, but no overarching architecture.■Hence, data grid.3Data Grid■Specialization and extension of “Grid” paradigm:Integrating infrastructure for distributed computation.■Complements other activitiesHigh speed disk cachesTertiary storage, cache managementMetadata standards and retrieval mechanisms: from digital library community.Etc■Structure of paperDiscusses principles in designing data grid architecture Basic services: storage systems and metadata management.Higher-level services and implementation issues.4Data Grid Design■Mechanism NeutralityDG architecture to be independent of lower-level storage/access systems. Encapsulation of local peculiarities.■Policy NeutralityExpose design decisions with performance implications to the user; ie let user set priorities.■Compatibility with Grid infrastructureExploit Grid services like authentication and resource management.■Uniformity of information infrastructureUse the same data model and interface as required to handle the underlying Grid infrastructure.5Data Grid Architecture■Layered architecture based on these four principles.■Lower layers provide high-performance access without imposing any policy choices—promote reuse of basic efficient mechanisms.6Core Data Grid Services■Two fundamental services: data access vs. metadata access.■Data Grid: keep data and metadata separate. This maximizes flexibility (applications that want them to be the same can still pretend).■Storage systemsHeterogeneous systems, locations: mechanism neutral.Unit of info: file instances. Can be represented as OS files, DB entries, etc.Logical notion of storage systems: can be implemented as HTTP servers and other diverse things.7Core Data Grid Services■Data accessObvious basic functionality: read/write file instances, determine and set file instance characteristics.Support replica management optimizations—allow transfer of data directly between storage systems.Data grid complications➔Heterogeneous security environments.➔Reservation capabilities on storage systems and environments when increased performance is required.➔Performance monitoring and self-optimization.➔Error detection and reporting.8Core Data Grid Services■Metadata serviceMetadata: information about the data grid itself, ie storage systems, file instance locations, file instance contents, and so on.Types of metadata➔Application metadata: info from eg. Scientific applications➔Replica metadata: location of data, data replicas, and criteria for selecting storage and access locations.➔System configuration metadata: information about grid. Eg, connectivity, storage capacity, usage policy.Service: single interface.Application queries sent to repository/catalog. Repository consists of references to logical files, which can be mapped to actual replicas.9Core Data Grid Services■Metadata complicationsDifficult to select uniform representation and interface: numerous existing metadata representations reflecting different needs and philosophies (XML, indexing data structures, etc).Large scale metadata issues: scalability, heterogeneity, distributed environment, ownership and local control over data access, robustness in the face of partial failures.Hence: hierarchical and distributed system. Similar systems: LDAP, existing Grid metadata systems.10Core Data Grid Services■Other basic servicesAuthorization/authentication—already exists in Grid.Resource reservation, co-allocation—predictable performancePerformance measurements, estimation techniquesInstrumentation services for storage transfers and other operations.11Higher-Level Data Grid Components■Potentially unlimited number of possible components.■Two important components: replica management and replica selection.■Replica managementReplica Manager—creates/deletes replicas of file instances.Replicas are a “user-asserted” data correspondence between files: not necessarily byte equality.Use of repository/catalog mentioned previously—map logical files to physical replicas.Replica Manager does not contain storage location criteria—leave policy matters to the application.12Higher-Level Data Grid Components■Replica Selection and Data FilteringMust furnish applications with replica with the best access performance characteristics.Criteria: speed, cost, securityCan trigger creation of better replicas.Use Grid information to obtain network performance data, rank replicas.Possible feature: extract important data subsets out of large file instances as replicas of their own.➔This requires special filtering programs that understands how to analyze the large files.13Implementation Experiences■LDAP implementationUsed LDAP to construct catalogs/repositories.Catalog: tree structure known as Directory Information Tree (DIT).Two applications prototyped: Climate modeling, data visualizations.Climate modeling:➔DIT: root node, node for each collection (four), node for each logical file per collection.➔Metadata : XML. Data accessed via URLS.➔Prototype: user manually chooses replica.14Implementation Experiences■LDAP implementationData visualization application➔Desktop client streams remote data➔Each file: a timestep in a series of eg astronomical data.-Can be of different resolutions.-Can be of different data layouts (endianness, etc).➔Several thousand files.➔Each file has multiple replicas listed in catalog.➔Their implementation scales poorly: too many objects in replica catalog, requires distributed catalog implementation.➔Solution: organize logical files into collections.-Associate location info with collection.-Single lookup finds path to logical file for collection, which the can be mapped to replica.-Greatly reduces storage space.-(Apparently
View Full Document