DOC PREVIEW
Ensuring Reliable Datasets for Environmental Models and Forecasts

This preview shows page 1-2-24-25 out of 25 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 25 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 25 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 25 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 25 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 25 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1Ensuring Reliable Datasets for Environmental Models and Forecasts Emery R. Boose1*, Aaron M. Ellison1, Leon J. Osterweil2, Lori A. Clarke2, Rodion Podorozhny3, Julian L. Hadley1, Alexander Wise2, David R. Foster1 1Harvard University, United States 2University of Massachusetts, United States 3Texas State University, United States *Corresponding author. E-mail address: [email protected] Article History: Received 31 October 2006 Received in revised form 10 July 2007 Accepted 30 July 2007 Keywords: Analytic web, Little-JIL, metadata, process, sensor network, water flux Citation: Boose, E. R., A. M. Ellison, L. J. Osterweil, L. A. Clarke, R. Podorozhny, J. L. Hadley, A. Wise, and D. R. Foster. 2007. Ensuring reliable datasets for environmental models and forecasts. Ecological Informatics 2: 237-247. DOI: 10.1016/j.ecoinf.2007.07.006 Note: Elsevier prohibits others from distributing online the published version of this article. This is an author’s version that contains the same material with different formatting.2ABSTRACT At the dawn of the 21st century, environmental scientists are collecting more data more rapidly than at any time in the past. Nowhere is this change more evident than in the advent of sensor networks able to collect and process (in real time) simultaneous measurements over broad areas and at high sampling rates. At the same time there has been great progress in the development of standards, methods, and tools for data analysis and synthesis, including a new standard for descriptive metadata for ecological datasets (Ecological Metadata Language) and new workflow tools that help scientists to assemble datasets and to diagram, record, and execute analyses. However these developments (important as they are) are not yet sufficient to guarantee the reliability of datasets created by a scientific process – the complex activity that scientists carry out in order to create a dataset. We define a dataset to be reliable when the scientific process used to create it is (1) reproducible and (2) analyzable for potential defects. To address this problem we propose the use of an analytic web, a formal representation of a scientific process that consists of three coordinated graphs (a data-flow graph, a dataset-derivation graph, and a process-derivation graph) originally developed for use in software engineering. An analytic web meets the two key requirements for ensuring dataset reliability: (1) a complete audit trail of all artifacts (e.g., datasets, code, models) used or created in the execution of the scientific process that created the dataset, and (2) detailed process metadata that precisely describe all sub-processes of the scientific process. Construction of such metadata requires the semantic features of a high-level process definition language. In this paper we illustrate the use of an analytic web to represent the scientific process of constructing estimates of ecosystem water flux from data gathered by a complex, real-time multi-sensor network. We use Little-JIL, a high-level process definition language, to precisely and accurately capture the analytical processes involved. We believe that incorporation of this approach into existing tools and evolving metadata specifications (such as EML) will yield significant benefits to science. These benefits include: complete and accurate representations of scientific processes; support for rigorous evaluation of such processes for logical and statistical errors and for propagation of measurement error; and assurance of dataset reliability for developing sound models and forecasts of environmental change.3INTRODUCTION At the dawn of the 21st century, more environmental scientists are collecting more data at more rapid rates than at any time in the past. The pace of environmental change is also accelerating. Policy analysts and decision-makers require robust and reliable models of ongoing ecological dynamics and forecasts of environmental change (Clark et al., 2001). Uncertainty in these forecasts will decline as more data are accumulated and models are updated and run again (Ellison, 1996). Economic markets are evolving that will take advantage of these forecasts to set prices for scarce natural resources, emissions levels, and restoration efforts (Cacho et al., 2003). The lone ecologist studying obscure organisms in pristine sites who analyzes simple datasets and publishes results in technical journals read by few is quickly being replaced by teams of investigators who are studying “real-world” environmental problems, analyzing terabytes of data in (near) real time, and communicating their results to vastly broader audiences, all with the aid of new information technologies. Great progress has been made in recent years in the development of standards, methods, and tools for data analysis and synthesis (Jones et al., 2006). Descriptive metadata that provide essential information about the contents of individual or aggregated datasets (Michener et al., 1997) can now be represented using a community standard, Ecological Metadata Language (EML1). New scientific workflow tools help scientists to assemble datasets and to diagram, record, and execute analyses. Perhaps the most notable of such tools for ecologists is the open-source Kepler2, which is able to utilize EML directly (Altintas et al., 2004; Ludäscher et al., 2005). There is also growing interest in developing permanent archives of environmental models to facilitate verification of past results and development of future models (Thornton et al., 2005). Nevertheless these important developments are not sufficient to ensure the reliability of datasets created through a scientific process – the complex activity that scientists carry out in order to create a dataset. We define a dataset to be reliable when the scientific process used to create it (both the overall process and the specific execution trace) is (1) reproducible and (2) analyzable for potential defects. By reproducible, we mean that the reported results could be replicated exactly by an independent entity (Schwab et al., 2000; NRC, 2003). By analyzable, we mean that the process could be rigorously evaluated for potential defects such as undesirable outcomes or logical and statistical errors (Dwyer et al., 2004). EML does not have a formal structure for accurately capturing the processes used by scientists to analyze and synthesize data, update existing


Ensuring Reliable Datasets for Environmental Models and Forecasts

Download Ensuring Reliable Datasets for Environmental Models and Forecasts
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Ensuring Reliable Datasets for Environmental Models and Forecasts and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Ensuring Reliable Datasets for Environmental Models and Forecasts 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?