Providing Provenance Through Workflows and Databases

Executive Summary


Data provenance is a fundamental issue in the processing of scientific information and beyond. Two lines of research have been pursued in recent years with direct bearing on the issues of data provenance. In one of them, provenance in workflows, the emphasis has been on extracting provenance from logs of events marking the execution of different modules over various initial and derived datasets. In the other line of research, provenance in databases, the emphasis has been on the propagation of provenance through the operators that make up database views, or on propagation of provenance through copy/cut-and-paste operations within and among databases.

These two bodies of work have employed different techniques and at first glance their results appear quite different. However, in many scientific applications database manipulations co-exist with the execution of workflow modules and the provenance of the resulting data should integrate both kinds of processing into a usable paradigm.

An analysis of existing work on data provenance in workflows and in databases shows that the main difficulties in unifying these two different kinds of data provenance are:

  1. The lack of a data model that is rich enough to capture the interaction
    between the structure of the data and the structure of the workflow.
  2. The lack of a high-level specification framework in which
    database operators and workflow modules can be treated uniformly.

The objective of this work is to provide a framework for overcoming these difficulties, and to provide tools that allow a truly comprehensive approach to defining, manipulating, managing and querying the provenance of scientific data.

The method that will be followed is to use a data model that supports nested collections, and a functional language (the Nested Relational Calculus, NRC) to describe workflow specifications and database transformation over nested collections. Using this model and language, the theoretical underpinnings of a joint framework for defining, manipulating, managing and querying data provenance will be developed, along with
algorithms for managing provenance and reducing provenance overload. Techniques for opening up the "black box" style of provenance in workflow systems will also be explored. While the theoretical foundation of the framework will be based on NRC, the results will be transitioned to an analogous foundation based on XML and XQuery, which will be used for the implementation. A prototype will be developed, and the feasibility of the approach evaluated.

The work builds on the PIs' expertise and past work on provenance in workflows and provenance
summarization techniques, provenance in databases, and NRC query and update languages.

Some references


Software


PDiffView is a software system that takes as input two runs of the same specification and shows how their executions differs. This can be used to understand why the results of workflow runs differ. [prototype] [video]

Project Members

Partner organizations


Arizona State University

University Paris-Sud, Orsay

Funding

This material is based upon work supported by the National Science
Foundation under Grant No. 0803524.

Any opinions, findings, and conclusions or recommendations expressed in this material are those of
the author(s) and do not necessarily reflect the views of the National Science Foundation.