Data Citation

Citation is an essential part of scientific publishing and, more generally, of scholarship. It is used to gauge the trust placed in published information and, for better or for worse, is an important factor in judging academic reputation. Now that so much scientific publishing involves data and takes place through a database rather than conventional journals, how is some part of a database to be cited? More generally, how should data stored in a repository that has complex internal structure and that is subject to change be cited?

The goal of this research is to develop a framework for data citation which takes into account the increasingly large number of possible citations; the need for citations to be both human and machine readable; and the need for citations to conform to various specifications and standards. A basic assumption is that citations must be generated, on the fly, from the database. The framework is validated by a prototype system in which citations conforming to pre-specified standards are automatically generated from the data, and tested on operational databases of pharmacological (IUPHAR) and Earth science data (ES3).

The broader impact of this research is on scientists who publish their findings in organized data collections or databases; data centers that publish and preserve data; businesses and government agencies that provide on-line reference works; and on various organizations who formulate data citation principles. The research also tackles the issue of how to enrich linked data so that it can be properly cited.

In addition to IUPHAR and ES3, we are working with the following data sources:

  • Eagle-i, a resource discovery dataset for translational science research. Eagle-i has clearly specified data citation requirements, and automatically serves up persistent identifiers (Eagle-i IDs) for resources but does not automatically generate the citation. We have downloaded the RDF dataset, and have created an interface which, given the Eagle-i ID, will render the citation in human readable format, with optional XML/BibTEX/RIS exports. We have hosted this on AWS and are testing with eagle-i developers.
  • Reactome, a curated and peer reviewed pathway database whose goal is to support basic research, genome analysis, modeling, systems biology and education. Reactome also has clearly specified data citation requirements, but does not automatically generate the citation. We have downloaded XML versions of the dataset, and have developed citation rules reflecting these requirements.

More details can be found on our project page.