Data provenance is a fundamental issue in the processing of scientific information and beyond. Two lines of research have been pursued in recent years with direct bearing on the issues of data provenance. In one of them, provenance in workflows, the emphasis has been on extracting provenance from logs of events marking the execution of different modules over various initial and derived datasets. In the other line of research, provenance in databases, the emphasis has been on the propagation of provenance through the operators that make up database views, or on propagation of provenance through copy/cut-and-paste operations within and among databases.
These two bodies of work have employed different techniques and at first glance their results appear quite different. However, in many scientific applications database manipulations co-exist with the execution of workflow modules and the provenance of the resulting data should integrate both kinds of processing into a usable paradigm.
An analysis of existing work on data provenance in workflows and in databases shows that the main difficulties in unifying these two different kinds of data provenance are:
- The lack of a data model that is rich enough to capture the interaction
between the structure of the data and the structure of the workflow.
- The lack of a high-level specification framework in which
database operators and workflow modules can be treated uniformly.
The objective of this work is to provide a framework for overcoming these difficulties, and to provide tools that allow a truly comprehensive approach to defining, manipulating, managing and querying the provenance of scientific data.
The method that will be followed is to use a data model that supports nested collections, and a functional language (the Nested Relational Calculus, NRC) to describe workflow specifications and database transformation over nested collections. Using this model and language, the theoretical underpinnings of a joint framework for defining, manipulating, managing and querying data provenance will be developed, along with
algorithms for managing provenance and reducing provenance overload. Techniques for opening up the "black box" style of provenance in workflow systems will also be explored. While the theoretical foundation of the framework will be based on NRC, the results will be transitioned to an analogous foundation based on XML and XQuery, which will be used for the implementation. A prototype will be developed, and the feasibility of the approach evaluated.
The work builds on the PIs' expertise and past work on provenance in workflows and provenance
summarization techniques, provenance in databases, and NRC query and update languages.
- An Optimal Labeling Scheme for Workflow Provenance Using Skeleton Labels. Proceedings of ACM SIGMOD International Conference on Management of Data (SIGMOD) (2010). Zhuowei Bao Susan Davidson Sanjeev Khanna Sudeepa Roy
- Reconcilable Differences. International Conference on Database Theory (ICDT) (2009). Todd J. Green Zachary Ives Val Tannen.
- Containment of conjunctive queries on annotated relations. International Conference on Database Theory (ICDT) (2009). Todd J. Green.
- Differencing Provenance in Scientific Workflows. International Conference on Data Engineering (ICDE) (2009). Zhuowei Bao Sarah Cohen Boulakia Susan Davidson Anat Eyal Sanjeev Khanna
- Optimizing User Views for Workflows.
International Conference on Database Theory (ICDT) (2009). Olivier Biton Susan Davidson Sanjeev Khanna Sudeepa Roy
- Detecting and Resolving Unsound Workflow Views for Correct Provenance Analysis. Proceedings of ACM SIGMOD International Conference on Management of Data (SIGMOD) (2009). Peng Sun Ziyang Liu Susan Davidson Yi Chen
- WOLVES: Achieving Correct Provenance Analysis by Detecting and Resolving Unsound Workflow Views. International Conference on Very Large Databases (VLDB)(demo) (2009)
Peng Sun Ziyang Liu Susan Davidson Yi Chen
- PDiffView: Viewing the Difference in Provenance of Workflow Results. International Conference on Very Large Databases (VLDB)(demo) (2009). Zhuowei Bao Sarah Cohen Boulakia Susan Davidson Pierrick Girard
- Annotated XML: Queries and Provenance. Proceedings of ACM Symposium on Principles of Database Systems (PODS) (2008). Nate Foster Todd J. Green Val Tannen.
PDiffView is a software system that takes as input two runs of the same specification and shows how their executions differs. This can be used to understand why the results of workflow runs differ. [prototype] [video]
Arizona State University
University Paris-Sud, Orsay
This material is based upon work supported by the National Science
Foundation under Grant No. 0803524.
Any opinions, findings, and conclusions or recommendations expressed in this material are those of
the author(s) and do not necessarily reflect the views of the National Science Foundation.