Bloomberg School BIO 751 - the stashR package - D2068064

Home> Schools> Johns Hopkins Bloomberg School of Public Health> (BIO) > BIO 751> the stashR package

DOC PREVIEW

Bloomberg School BIO 751 - the stashR package

School name Johns Hopkins Bloomberg School of Public Health

Course Bio 751- Teacher’s Corner

Pages 8

This preview shows page 1-2-3 out of 8 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Interacting with local and remote data repositories using the stashR packageAbstractOverviewThe stashR packageDesign rationaleRepositoriesVersioningCaching and synchronizationInterfaceCreating a `localDB' repository or a cache of a `remoteDB' repositoryAccessing a `localDB' or `remoteDB' databaseExamplesDiscussionComputational StatisticsDOI 10.1007/s00180-008-0124-xORIGINAL PAPERInteracting with local and remote data repositoriesusing the stashR packageSandrah P. Eckel · Roger D. PengReceived: 14 March 2007 / Accepted: 8 May 2008© Springer-Verlag 2008Abstract The stashR package (a Set of Tools for Administering SharedRepositories) for R implements a basic versioned key-value style database where char-acter string keys are associated with data values. Using the S4 classes ‘localDB’ and‘remoteDB’, and associated methods, versioned key-value databases can be either cre-ated locally on the user’s computer or accessed remotely via the Internet. The stashRpackage can enhance reproducible research by providing a ‘localDB’ database formatfor the caching of computations which can subsequently be stored on the Internet. Toreproduce a particular computation, a reader can access the ‘remoteDB’ database andobtain the associated R objects.Keywords Reproducible research · Database · Data distribution · Version control1 OverviewScientific research is conducted by collecting data, analyzing and summarizing theevidence in the data, and publishing substantive results in a paper. To verify scientificresults we must replicate or, at a minimum, reproduce the findings of a previous study.Replication is the act of collecting an independent dataset in a similar manner to theoriginal study and then using the data to address a similar scientific question of interest.This work was supported in part by the Johns Hopkins Training Program in the Epidemiology andBiostatistics of Aging (NIA T32 AG00247) and the Faculty Innovation Fund, Johns Hopkins BloombergSchool of Public Health.S. P. Eckel (B) · R. D. PengJohns Hopkins Bloomberg School of Public Health,615 N. Wolfe Street E3527, Baltimore, MD 21205, USAe-mail: [email protected]. P. Eckel, R. D. PengReproduction of scientific research is the act of using the same data as the originalstudy and performing additional statistical analyses (Gentleman and Temple Lang2004). Replication is generally considered the highest standard of verification becauseresearchers can address the uncertainty inherent in collecting a data sample from alarger population and improve upon any shortcomings of previous data collectiondesigns. Reproducibility allows researchers to evaluate the sensitivity of results fromthe initial statistical analysis of a study. Although replication is ideal, in cases such aslarge epidemiologic studies, reproduction is often the only way to verify the scientificfindings of a study (Peng et al. 2006).For fully reproducible research we need to facilitate prospective analysts’ access tothe data used for a statistical analysis. Hence we need to be able to distribute potentiallylarge datasets and computations. To disseminate accurate data, we need a system forphysical data distribution that manages data updates by automatically synchronizingeach prospective analyst’s local copy of the data to the remote, master copy of the dataand that keeps track of version changes to the data. It follows that we need softwareto help manage both the distribution and synchronization/caching of this data.2 The stashR packageThe stashR package is an extension to local and remote databases of the filehashpackage (Peng 2006), which uses a key-value database to allow for interactive workwith datasets too large to be loaded into R as a single object. A key-value database is acollection of data files, each indexed by a character string key. One example of a key-value database is a multi-center study consisting of data from 4 cities (New York, LosAngeles, Chicago and Seattle) where data for each city is stored in a file named ‘ny’,‘la’, ‘chicago’, and ‘seattle’ respectively. In this case, the key-value database allows aresearcher to download data from a particular city of interest instead of downloadingthe entire four city dataset at once.The stashR package adds important functionalities to data handling and distributionin R. These contributions include:− a set of tools for creating a versioned local database for subsequent exporting toa remotely accessible webserver− the ability to access versioned remote databases efficiently− a tool for synchronizing local copies of a database to the remote version− an abstract interface for interacting with local and remote databases.The local and remote features of stashR address the need to manage data distributionand synchronize cached data copies for reproducible research.3 Design rationale3.1 RepositoriesThe stashR package is designed for interacting with both local and remote data repos-itories. A repository is a directory of files containing data and metadata. Each user123The stashR packageinterface function in stashR is a generic function with specific methods defined forrepository objects of the S4 classes ‘localDB’ and ‘remoteDB’. A ‘localDB’ repositoryis created and stored on a user’s local disk. A ‘remoteDB’ repository is a ‘localDB’repository that has been stored on a remote webserver. A ‘localDB’ repository consistsof a root directory that contains a data directory and a text file ‘version.’ The ‘version’file lists information on the current version of the repository and all previous reposi-tory versions. The data directory contains compressed data files labeled according totheir corresponding character key and the version number of the data indexed by thiskey (the key version). Each data file has a corresponding ‘.SIG’ text file that lists the32-byte MD5 checksum from running md5sum() on the data file (see the R packagetools for more details), the data file’s character key and key version number. The localcache associated with a ‘remoteDB’ repository consists of a root directory containinga data directory and a text file ‘url’ that lists the ‘remoteDB’ repository’s URL on awebserver. The data directory contains the cached data and ‘.SIG’ files correspondingto user specified keys. We will not discuss the ‘.SIG’ files further since they will beemployed in the future to safeguard against data corruption, but they are

View Full Document