DOC PREVIEW
Bloomberg School BIO 751 - the stashR package

This preview shows page 1-2-3 out of 8 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Interacting with local and remote data repositories using the stashR packageAbstractOverviewThe stashR packageDesign rationaleRepositoriesVersioningCaching and synchronizationInterfaceCreating a `localDB' repository or a cache of a `remoteDB' repositoryAccessing a `localDB' or `remoteDB' databaseExamplesDiscussionComputational StatisticsDOI 10.1007/s00180-008-0124-xORIGINAL PAPERInteracting with local and remote data repositoriesusing the stashR packageSandrah P. Eckel · Roger D. PengReceived: 14 March 2007 / Accepted: 8 May 2008© Springer-Verlag 2008Abstract The stashR package (a Set of Tools for Administering SharedRepositories) for R implements a basic versioned key-value style database where char-acter string keys are associated with data values. Using the S4 classes ‘localDB’ and‘remoteDB’, and associated methods, versioned key-value databases can be either cre-ated locally on the user’s computer or accessed remotely via the Internet. The stashRpackage can enhance reproducible research by providing a ‘localDB’ database formatfor the caching of computations which can subsequently be stored on the Internet. Toreproduce a particular computation, a reader can access the ‘remoteDB’ database andobtain the associated R objects.Keywords Reproducible research · Database · Data distribution · Version control1 OverviewScientific research is conducted by collecting data, analyzing and summarizing theevidence in the data, and publishing substantive results in a paper. To verify scientificresults we must replicate or, at a minimum, reproduce the findings of a previous study.Replication is the act of collecting an independent dataset in a similar manner to theoriginal study and then using the data to address a similar scientific question of interest.This work was supported in part by the Johns Hopkins Training Program in the Epidemiology andBiostatistics of Aging (NIA T32 AG00247) and the Faculty Innovation Fund, Johns Hopkins BloombergSchool of Public Health.S. P. Eckel (B) · R. D. PengJohns Hopkins Bloomberg School of Public Health,615 N. Wolfe Street E3527, Baltimore, MD 21205, USAe-mail: [email protected]. P. Eckel, R. D. PengReproduction of scientific research is the act of using the same data as the originalstudy and performing additional statistical analyses (Gentleman and Temple Lang2004). Replication is generally considered the highest standard of verification becauseresearchers can address the uncertainty inherent in collecting a data sample from alarger population and improve upon any shortcomings of previous data collectiondesigns. Reproducibility allows researchers to evaluate the sensitivity of results fromthe initial statistical analysis of a study. Although replication is ideal, in cases such aslarge epidemiologic studies, reproduction is often the only way to verify the scientificfindings of a study (Peng et al. 2006).For fully reproducible research we need to facilitate prospective analysts’ access tothe data used for a statistical analysis. Hence we need to be able to distribute potentiallylarge datasets and computations. To disseminate accurate data, we need a system forphysical data distribution that manages data updates by automatically synchronizingeach prospective analyst’s local copy of the data to the remote, master copy of the dataand that keeps track of version changes to the data. It follows that we need softwareto help manage both the distribution and synchronization/caching of this data.2 The stashR packageThe stashR package is an extension to local and remote databases of the filehashpackage (Peng 2006), which uses a key-value database to allow for interactive workwith datasets too large to be loaded into R as a single object. A key-value database is acollection of data files, each indexed by a character string key. One example of a key-value database is a multi-center study consisting of data from 4 cities (New York, LosAngeles, Chicago and Seattle) where data for each city is stored in a file named ‘ny’,‘la’, ‘chicago’, and ‘seattle’ respectively. In this case, the key-value database allows aresearcher to download data from a particular city of interest instead of downloadingthe entire four city dataset at once.The stashR package adds important functionalities to data handling and distributionin R. These contributions include:− a set of tools for creating a versioned local database for subsequent exporting toa remotely accessible webserver− the ability to access versioned remote databases efficiently− a tool for synchronizing local copies of a database to the remote version− an abstract interface for interacting with local and remote databases.The local and remote features of stashR address the need to manage data distributionand synchronize cached data copies for reproducible research.3 Design rationale3.1 RepositoriesThe stashR package is designed for interacting with both local and remote data repos-itories. A repository is a directory of files containing data and metadata. Each user123The stashR packageinterface function in stashR is a generic function with specific methods defined forrepository objects of the S4 classes ‘localDB’ and ‘remoteDB’. A ‘localDB’ repositoryis created and stored on a user’s local disk. A ‘remoteDB’ repository is a ‘localDB’repository that has been stored on a remote webserver. A ‘localDB’ repository consistsof a root directory that contains a data directory and a text file ‘version.’ The ‘version’file lists information on the current version of the repository and all previous reposi-tory versions. The data directory contains compressed data files labeled according totheir corresponding character key and the version number of the data indexed by thiskey (the key version). Each data file has a corresponding ‘.SIG’ text file that lists the32-byte MD5 checksum from running md5sum() on the data file (see the R packagetools for more details), the data file’s character key and key version number. The localcache associated with a ‘remoteDB’ repository consists of a root directory containinga data directory and a text file ‘url’ that lists the ‘remoteDB’ repository’s URL on awebserver. The data directory contains the cached data and ‘.SIG’ files correspondingto user specified keys. We will not discuss the ‘.SIG’ files further since they will beemployed in the future to safeguard against data corruption, but they are


View Full Document

Bloomberg School BIO 751 - the stashR package

Download the stashR package
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view the stashR package and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view the stashR package 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?