Issues in Automatic Provenance Collection

Home> Academic Documents> Issues in Automatic Provenance Collection

DOC PREVIEW

This preview shows page 1-2-3-4 out of 13 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 13 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 13 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 13 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 13 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 13 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Issues in Automatic Provenance CollectionUri Braun, Simson Garfinkel, David A. Holland,Kiran-Kumar Muniswamy-Reddy, Margo I. SeltzerHarvard University, C ambridge, [email protected]. Automatic provenance collection describes systems that ob-serve processes and data transformations inferring, collecting, and main-taining provenance about them. Automatic collection is a powerful toolfor analysis of objects and processes, providing a level of transparencyand pervasiveness not found in more conventional provenance systems.Unfortunately, automatic collection is also difficult. We discuss the chal-lenges we encountered and the issues we exp osed as we developed anautomatic provenance collector that runs at the operating system level.1 IntroductionToday’s provenance management systems usually take one of two approaches toprovenance collection: Either users enter it manually or applications explicitlycollect provenance and enter it into a database. There is, however, a third model:automatic provenance collection. In automatic collection, the system observes theactions of users and programs and derives provenance, storing it without user orapplication involvement.Automatic collection is a powerful approach, because it eliminates user er-ror, consistently collects provenance across all applications, and captures morecomplete provenance than systems relying on a user’s or application developer’sassumptions about provenance. For example, automatic provenance collection inthe operating system allows us to: identify system configuration changes (e.g.,new tools or libraries), identify environment variable modifications that alterprogram behavior, debug faulty builds that are miss ing dependencies, identifythe source and creation of unusual files, and create scripts that produce objects.In earlier work we des cribed a prototype Provenance-Aware Storage System(PASS), built on Linux, that automatically collects provenance at the operatingsystem level [18]. PASS is similar to systems such as ClearCase [5], GenePat-tern [9], and Vesta [10]. These systems observe users’ and applications’ activities,recording the provenance captured in these activities. PASS takes this one stepfarther, observing all processes that run on a PASS-enabled operating system,generating provenance for objects that do not have provenance (i.e., are un-provenanced), and attaching complete system-level provenance to objects thatare created on a provenance-aware file system. We capture low-level details likethe operating system, kernel modules loaded, installed libraries, and process en-vironment.We found that automatic OS-level provenance collection is useful, comple-menting existing approaches. However, we exposed a number of challengingissues that arise from automatic collection. This paper introduces automaticprovenance collection and discusses the more interesting challenges that arise inbuilding these systems.In Section 2 we define automatic provenance collection, placing it in the con-text of existing provenance solutions. In Section 3 we discuss the problems thatarise when designing and building systems with automatic provenance collection.In Section 4, we present use cases where disclosed and observed provenance sys-tems together provide more powerful solutions than either independently. In Sec-tion 5 we introduce provenance pruning, the deletion of provenance, and discussstrategies for implementing it. In Section 6 we discuss the privacy implications ofautomatic provenance collection. In Section 7 we discuss automatic provenancesystems and technologies from which we can borrow in constructing automaticprovenance systems, and in Section 8 we conclude.2 What is Automatic Collection?An automatic collecting syste m transparently records provenance for all activi-ties it performs by observing the sequence of operations executed and translatingrelevant ones into provenance. For example, when a process b e gins running, theoperating system identifies several pro ce ss prove nance attributes1, such as theexecutable, operating system, loaded kernel modules, libraries, environment, andcommand line.The system continues to collect provenance about the running process, record-ing (for example) input sources. Whenever the process creates or modifies an ob-ject, the proc ess ’s provenance is assigned to the written object. We call this formof provenance collection observed, because the system derives provenance fromthe events that it observes. An observed provenance system does not necessarilyunderstand the semantics of its observations, so it must record everything thatis potentially part of an object’s provenance. This can lead to false provenanceif the observing system does not perform detailed information flow analysis.Section 2.1 discusses existing systems that use observed provenance.Most existing provenance systems use disclosed provenance. In disclosedprovenance systems, users or applications present provenance to the system,using the provenance system merely as a storage and query engine. There areseveral kinds of disclosed provenance. Manual provenance, sometimes called an-notation, is entered by users. For example, the provenance of data entered man-ually by a user must itself be manually entered. Specified provenance describesan object’s intended provenance in a structured way, typically by directing thesystem to produce the object via various transformations or workflows. For exam-ples, workflow-based systems [23, 32] and m akefiles capture intended provenanceby describing how a target is created from its sources. The workflow systems1We think of processes as having provenance so that we can transfer a process’sprovenance to objects it creates.frequently then generate true provenance from the execution of these workflows,while make does not. Instead, the canonical software development environmentrelies on a separate component, a source code control system, to store semanticprovenence. In these systems, the difference between succes sive versions preciselyidentifies what changed, but the rationale is entered as manual provenance in theform of commit messages.2.1 Observed-Provenance SystemsThere are several domain-specific observed provenance systems. GenePattern [9]is a working environment for computational biology and biomedical research.It tracks provenance for the objects created in the environment. Clearcase [5]and Vesta [10] are environments designed for software


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3-4 out of 13 pages.

Please select your school