Data and Software Citation

Citation is an essential part of scientific publishing and, more generally, of scholarship. It is used to gauge the trust placed in published information and, for better or for worse, is an important factor in judging academic reputation. Now that so much scientific publishing involves data as well as software, the question arises as to how it should be cited, and in particular, how citation can be automated to serve up the citation along with the extracted data/software?

Data Citation:
Data citation addresses the question of how data that stored in a repository with complex internal structure and that is subject to change should be cited?

The goal of this research is to develop a framework for data citation which takes into account the increasingly large number of possible citations; the need for citations to be both human and machine readable; and the need for citations to conform to various specifications and standards. A basic assumption is that citations must be generated, on the fly, from the database. The framework is validated by a prototype system in which citations conforming to pre-specified standards are automatically generated from the data, and tested on operational databases of pharmacological (IUPHAR) and Earth science data (ES3).

The broader impact of this research is on scientists who publish their findings in organized data collections or databases; data centers that publish and preserve data; businesses and government agencies that provide on-line reference works; and on various organizations who formulate data citation principles. The research also tackles the issue of how to enrich linked data so that it can be properly cited.

In addition to IUPHAR and ES3, we are working with the following data sources:

  • Eagle-i, a resource discovery dataset for translational science research. Eagle-i has clearly specified data citation requirements, and automatically serves up persistent identifiers (Eagle-i IDs) for resources but does not automatically generate the citation. We have downloaded the RDF dataset, and have created an interface which, given the Eagle-i ID, will render the citation in human readable format, with optional XML/BibTEX/RIS exports. We have hosted this on AWS and are testing with eagle-i developers.
  • Reactome, a curated and peer reviewed pathway database whose goal is to support basic research, genome analysis, modeling, systems biology and education. Reactome also has clearly specified data citation requirements, but does not automatically generate the citation. We have downloaded XML versions of the dataset, and have developed citation rules reflecting these requirements.

Software citation:
Software is another important new form of research product which should be cited. For citation to be effective, we need tools to automatically generate citations. Our model for software citation with version control based on a notion of a citation function, and an implementation (browser extension and local executable tool) that integrates with Git and GitHub.

  • The browser extension allow citations to be generated for any file/directory in any version of a software repository, and added/modified/deleted in the current version by project collaborators.
  • The local executable tool enables citations to be added/modified/deleted and managed through Git functions such as fork/merge/copy.
  • More details can be found on our project page.