Code

Contributors: An T. Nguyen, Tim Whiteaker

Introduction

This document describes best practices for archiving software, code, or scripts, such as a simulation model, data visualization package, or data manipulation scripts. The intention of these recommendations is to make research based on modeling or software more transparent rather than achieve exact reproducibility, i.e., provide sufficient documentation so that a knowledgeable person can understand algorithms, programming decisions, and their ramifications for the results, rather than run the model and obtain the same results.

Examples of candidate archives for code include CoMSES Net, which focuses on sharing models related to social and ecological sciences, and Zenodo, a popular DOI-minting all-purpose repository, that can conveniently archive a specific version of code in a GitHub repository. Alternatively, code may be archived in the EDI repository, either by itself or as part of a data package. The best practices in this document cover both archiving code in EDI and referencing code archived elsewhere.

While metadata for software may be described in detail using the EML <software> tree, there exists a project called CodeMeta which is specifically designed for software metadata. Therefore, one of the key recommendations in this document is to include a CodeMeta file when archiving software or code in EDI.

Recommendations for data packages

Considerations for archiving software or code

  • If it is a model and/or a model-based dataset, please see the best practices for archiving model-based datasets.
  • How likely is it that the code will be well maintained into the future? For example, code packages submitted to established code repositories may stay there only while they comply with all testing requirements and may be removed if not well maintained (e.g., the R package repository CRAN). If that commitment to code maintenance is unlikely, such a package should be archived in a repository without maintenance requirements.
  • Should the code be archived as a separate package or with the data?
    • If the code is used to generate several independent datasets it should be archived as a separate package.
    • The software authors wishing to place it under a different license from that of the associated data, or to obtain a DOI for only the code, may be reasons to separate code and data packages.
    • If deciding to package code separately, it may be archived on EDI or another repository. If archiving code outside of EDI, see section 2.2.4 for instructions on how to reference that code from related data packages in EDI.
    • In most other cases, it is recommended to archive code and data together for context.
  • Large community software packages are usually maintained and available elsewhere. However, they may undergo significant updates and it may make sense to archive the code of a certain version with the data for transparency reasons. Consider whether prior versions of a software package are available wherever that software is distributed.
  • When choosing a repository for the code, consider the ease of the archiving process and how well the code can be described. For example, Zenodo offers an easy pathway to archive code that is currently in GitHub, though metadata requirements are very light. Following the best practices described herein, you would create a CodeMeta file if you were going to archive with EDI. This is more rigorous than Zenodo, but then your code is better described, and in a machine-readable way.

Documenting software/code

When describing the code with EML, include the code as an otherEntity in a data package. Although a well documented human readable text format of the code is preferred, in case of multiple scripts, and/or where directory structure is important, a zip archive may be used. For the formatName and entityType elements in EML, we recommend using format names from the DataONE format list when possible. Some format names are included in examples below. Always check the list for the most up-to-date version of these names.

Example 1: EML otherEntity snippet for a script file.

<otherEntity>
   <entityName>R script to process CTD data</entityName>
   <entityDescription>Annotated RMarkdown script to process, calibrate, and flag raw CTD data.</entityDescription>
   <physical>
      <objectName>BLE_LTER_CTD_QAQC.Rmd</objectName>
      <size unit="byte">9674</size>
      <authentication method="MD5">8547b7a63fcf6c1f0913a5bd7549d9d1</authentication>
      <dataFormat>
         <externallyDefinedFormat>
            <formatName>R Markdown file</formatName>
         </externallyDefinedFormat>
      </dataFormat>
   </physical>
   <entityType>script</entityType>
</otherEntity>
Software License

It is important to include a use license to make it clear how others can use your work. We recommend the Creative Commons “no copyright reserved” (CC0) license, which places the software in the public domain and makes it easiest for end users to adapt and use your work. If a more restrictive license is required, we recommend the Apache License, Version 2.0 license, a permissive license that allows others to reuse, modify, and redistribute your software.

If a mix of data and code needs to be archived, and they each fall under different licenses, then separating them into different packages is advisable to eliminate ambiguity on which license applies to which portion of a data package. When a license other than a public domain dedication is used, then in addition to specifying the license in the metadata (see the “intellectualRights” element in EML), consider including a copy of the license at the beginning of the code files themselves so that the license is readily apparent to end users who peruse the code.

CodeMeta

Include a CodeMeta JSON file for all code that is archived in EDI. The CodeMeta file should be named “codemeta.json” and listed as an EML otherEntity. The formatName should be “JavaScript Object Notation (JSON) file,” the entityType should be “metadata,” and the entityDescription should indicate that this is a CodeMeta file for a given software or script in the data package.

For unnamed projects, e.g., one-off scripts for data processing, analysis, and/or visualisation, a CodeMeta file might appear to be overkill; however, CodeMeta files are simple to generate, and we recommend the below bare minimum. If there are multiple scripts each in their own otherEntity tag, we recommend aggregating information about them into one codemeta.json.

Example 2: Minimum recommended codemeta.json example for unnamed projects.

{
   "@context": ["https://doi.org/10.5063/schema/codemeta-2.0",
      "http://schema.org"
   ],
   "@type": "SoftwareSourceCode",
   "description": "RMarkdown script to calibrate and flag raw CTD data.",
   "author": {
      "@type": "Person",
      "givenName": "Christina",
      "familyName": "Bonsell",
      "email": "cbonsell@utexas.edu",
      "@id": "https://orcid.org/0000-0002-8564-0618"
   },
   "keywords": ["calibration", "CTD", "RMarkdown"],
   "license": "https://unlicense.org/",
   "dateCreated": "2013-10-19",
   "programmingLanguage": {
      "@type": "ComputerLanguage",
      "name": "R",
      "version": "3.6.2",
      "url": "https://r-project.org"
   }
}

Example 3: sample otherEntity metadata for example 2’s codemeta.json.

<otherEntity>
   <entityName>CodeMeta file for BLE_LTER_CTD_QAQC.Rmd</entityName>
   <entityDescription>CodeMeta file for annotated RMarkdown script to process, calibrate, and flag raw CTD data.</entityDescription>
   <physical>
      <objectName>codemeta.json</objectName>
      <size unit="byte">702</size>
      <authentication method="MD5">8547b7a63abc6c1f0913a5bd7549d9d1</authentication>
      <dataFormat>
         <externallyDefinedFormat>
            <formatName>application/json</formatName>
         </externallyDefinedFormat>
      </dataFormat>
   </physical>
   <entityType>CodeMeta</entityType>
</otherEntity>

For named projects, also include the software name, and the version if applicable. The example below shows some additional metadata you can include. See also the more complete codemetar example and the available CodeMeta terms.

Example 4: A more complete CodeMeta example for named projects. Example taken from the CodeMeta project Github with edits for brevity.

{
   "@context": ["https://doi.org/10.5063/schema/codemeta-2.0",
      "http://schema.org"
   ],
   "@type": "SoftwareSourceCode",
   "name": "codemetar: Generate 'CodeMeta' Metadata for R Packages",
   "description": "A JSON-LD format for software metadata",
   "author": [{
         "@type": "Person",
         "givenName": "Carl",
         "familyName": "Boettiger",
         "email": "cboettig@gmail.com",
         "@id": "https://orcid.org/0000-0002-1642-628X"
      },
      {
         "@type": "Person",
         "givenName": "Maëlle",
         "familyName": "Salmon",
         "@id": "https://orcid.org/0000-0002-2815-0399"
      }
   ],
   "codeRepository": "https://github.com/ropensci/codemetar",
   "dateCreated": "2013-10-19",
   "license": "https://spdx.org/licenses/GPL-3.0",
   "version": "0.1.8",
   "programmingLanguage": {
      "@type": "ComputerLanguage",
      "name": "R",
      "version": "3.5.3",
      "url": "https://r-project.org"
   },
   "softwareRequirements": [{
         "@type": "SoftwareApplication",
         "identifier": "R",
         "name": "R",
         "version": ">= 3.0.0"
      },
      {
         "@type": "SoftwareApplication",
         "identifier": "git2r",
         "name": "git2r",
         "provider": {
            "@id": "https://cran.r-project.org",
            "@type": "Organization",
            "name": "Comprehensive R Archive Network (CRAN)",
            "url": "https://cran.r-project.org"
         }
      }
   ],
   "keywords": ["metadata", "codemeta", "ropensci"]
}
Metadata to enable reproducibility

When archiving software, we strongly recommend including a user guide with installation and usage instructions if such would not already be apparent to the typical user. Take into account that the user might not have access to certain inputs that the software/scripts require. Include when feasible at least some example data, and configure the script so that it is ready to run with the example data.

Aside from the software/code itself and its dependencies, other pieces of information may be important should a user wish to reproduce results, such as the operating system and version and the system locale. Include this information in the data package’s methods/methodStep/description. For certain tools, there are ways to easily generate this information, e.g., a call to sessionInfo() in the R console. If the system outputs this information in a standardly formatted plain text file, that might be included as an otherEntity.

Linking code and data

There are a few solutions for providing explicit machine-readable linkages between different entities/packages (the distinction between code/data doesn’t matter too much here). For most cases we recommend the simplest approach, which is to use the methods/methodStep/description element of EML. More advanced users may wish to utilize the other solutions described herein.

Descriptive approach

In the dataset methods/methodStep/description element, include verbal descriptions such as “results.csv was derived from raw_data.csv using script.R” and repeat for all entities. If code and data reside in different packages, be sure to specify that.

The EML dataSource element

Nested under methods/methodStep, dataSource elements describe other data packages that serve as source for the current package. dataSource looks like a mini-EML tree describing the source data. Example: ecocomDP packages list the original packages under dataSource. dataSource does not describe relationships between entities in the same package, and as far as we know there is no explicit way in EML to do so.

ProvONE

ProvONE is a model developed by DataONE affiliates for provenance or denoting relationships between data entities. Each package on DataONE is described by a science metadata document (e.g., EML, ISO, FGDC) and a resource map document following ProvONE. The resource map powers a nice display of data relationships (see this package on the Arctic Data Center). This handles both relationships between entities in the same package and entities residing in different packages. However, note that EDI currently does not utilize this model.

External software

Large community-backed tools or proprietary software such as ArcGIS Pro or Microsoft Excel do not need to be archived. However, if they have had any impact on the final data (e.g., ArcGIS Pro was used to modify spatial rasters), the EML methods section should describe the routines performed. Within the data package, indicate linkage to external software as follows.

  • Briefly describe the software/code and its relationship to the data in EML’s methods/methodStep/description element.
  • Names of all software used. Include both the common acronym and the full spelling.
  • The URL(s) to all models/software used. Stable, persistent URLs pointing to exact version(s) are preferable, rather than generic links such as a project homepage. If the archived model has a DOI, then include a full citation to the model in the methods/methodStep/description text. The exception to this is when referencing tools such as Excel that have achieved global household name status.
  • Broadly, the system setup used, if relevant.
  • Information on exact versions for all code used (including dependencies). This is important, e.g., ArcGIS Pro 2.4.1 is very different from ArcGIS for Desktop 10.7.1. Different systems have methods to easily generate this information, e.g. a call to sessionInfo() in the R console.
  • Consider, if applicable, to archive the “runfile” as its own data entity within the data package, i.e., the script(s) that sets parameters and/or calls on functions imported from external software.

Example 5: EML method description referring to external software.

<methods>
   <methodStep>
      <description>
         <para>           
            The seagrass coverage raster was created in ArcGIS Pro (version
            2.4.3, by Esri) using the IDW geoprocessing tool on
            sampling_points.csv with a power of 2 and the nearest
            12 points.
         </para>
         <para>
            The raster was then refined using the seagrass-refiner package
            with the auto-refine option checked (Smith, 2017).
         </para>
         <para>
            Smith, J. (2017). seagrass-refiner: a package that does the cool
            seagrass stuff, Version 1.2, Zenodo.  https://doi.org/this-is/a-fake-doi,
            2017.
         </para>
      </description>
   </methodStep>
</methods>