6 Data in Other Repositories
Contributors: Greg Maurer (lead), Stace Beaulieu, RenÃ©e Brown, Sarah Elmendorf, Hap Garritt, Gastil Gastil-Buhl, Corinna Gries, Li Kui, An Nguyen, John Porter, Margaret O’Brien, Tim Whiteaker
A wide variety of data repositories are available for publishing biological, environmental, and Earth observation data, and the choice of where to publish a particular dataset is determined by many competing factors. For example, a funding agency or journal may require a certain repository (e.g., NSF BCO-DMO, NSF ADC, USDA ADC, DOE ESS-DIVE); the research subject or data type may be best served by a specialized repository (e.g., AmeriFlux, GenBank); or datasets may be submitted to a general purpose repository with minimal metadata requirements to simplify and speed data publishing (e.g., DRYAD, Figshare, Zenodo). For these and other reasons related datasets are sometimes published in disparate data repositories, the same data needs to be discoverable in more than one repository, or multiple datasets from one or more repositories may be used to create a new, derived dataset. In such cases, it can be advantageous to establish links between datasets in different repositories such that provenance, supplementation, duplication or other relationships are explicit. Clearly, this subject goes well beyond the single repository and better standards and approaches for linking resources and documenting data provenance are being developed elsewhere (e.g. DataONE, ProvONE, WholeTale). Here we concentrate on specific cases in the context of large and multidisciplinary projects, such as LTER sites, that wish to enhance data discovery and preserve data relationships across multiple repositories.
6.2 Recommendations for data packages
6.2.3 Common use cases and their structure in EDI
There are several common use cases for creating a new linked data package in EDI. The new package may establish either a one-to-one link from EDI to a dataset in another repository, or a one-to-many relationship that is more complex. Three possible cases are described below in terms of what entities to publish, where to publish them, metadata elements to be created in EML, and the contents of included data entities. There are likely to be other use cases for linking EDI data packages to other repositories.
Case 1: One dataset needs to be discoverable in more than one data repository. The data remain the same, but the metadata in the new data package at EDI may be upgraded beyond what exists in the other repository.
- The metadata in EDI must clearly state, preferably in the abstract or another obvious location, that this data package is already published in another repository. Include the original unique identifier and instruct users to cite that original data, if appropriate.
- Include instructions on how to access and cite the original data if the original repository is lacking in such guidance.
- If data are duplicated (which is not recommended), metadata should include information on how versions in different repositories are kept synchronized. If such synchronization is not feasible, users should be warned to inspect both sources for the latest data..
- In EML the <additionalIdentifier> field may be used to store the persistent identifier (DOI), or a link (URL) that refers to the data held in another repository to make the link machine readable. Where an external repository supplies both a URL and DOI, use the DOI as URLs may not be maintained through time.
Case 2: A list of data records held in a specialized repository needs to be linked to ancillary or supporting data that are being published in EDI (for derived data see Case 3).
- This case applies when a collection of datasets, or similar scientific resources, is held in a specialized repository and closely related ancillary or supporting data and metadata needs to be archived in a more generalist data repository like EDI. For example, ancillary environmental data or laboratory analyses held in EDI could be linked to collections of sequence reads held in NCBI GenBank or museum voucher specimens archived with Darwin Core metadata. See complete examples in Table 5.3.
- The new EDI data package should include a ‘data inventory’ (or manifest of holdings) file as a data entity. This is most likely a simple tabular data file, such as a CSV, that lists and describes the repository records held in the specialized data repository and has its column attributes described in EML as a dataTable entity.
- The inventory table must have a row for each outside repository record (or some meaningful grouping of records, e.g., project in NCBI) being linked to, with columns that include persistent unique identifiers of the data in the other repository, and relevant descriptors of the data. The complete content of the inventory will be dictated by the structure of the other repository and the data entities and metadata held there. Suggested columns are presented in Table 5.1.
- The inventory table may also provide additional contextual information for each individual data resource in another repository. Table 5.2 presents examples of these contextual columns. They are, however, subject dependent and may vary for different projects. For more examples, see the discussion on sequencing and genomic data later in this document.
Table 1: Suggested columns for identifying the external data in the data inventory table.
|External unique ID||Unique identifier for the data resource in the other repository. E.g. Accession number|
|External access URL||A unique, persistent link to the data resource in the other repository.|
|Title/description||Title and/or brief description of the data resource|
|Filename(s)||Dataset or file name at the other repository|
|Format||File format of above|
|Repository URL||URL of the repository being linked to|
Table 2: Examples of additional contextual columns in the data inventory table.
|Latitude/Longitude||Latitude and longitude in standard format for each data resource in the other repository.|
|Location name||Locally used name of collection site|
|Treatment level||Experimental treatment applied to the outside dataset|
|Start/End datetime||Starting/ending datetime of the data resource (NA for End if data collection is ongoing)|
|Reference publication||DOI of publication providing in-depth context for data|
Case 3: One or more datasets in other repositories are used to create derived data products that need to be archived in EDI.
- In this case the new dataset is directly or indirectly derived from the ‘source’ dataset(s) in other repositories. Such derived data may serve a wide range of research purposes, including use in cross-site synthesis, re-analysis, or meta-analysis studies.
- Provenance metadata should be used to describe the relationship between the source and derived datasets, which ensures reproducibility and preserves data lineage. In a new EDI data package that archives derived data, the provenance metadata should be inserted in the EML file utilizing <dataSource> elements. The <dataSource> elements should be nested within a <methodStep> element and will establish the links to any source datasets located in another repository. An example snippet of provenance EML is shown in Figure 1.
- Other cross-repository standards for provenance metadata are still being developed and are not widely adopted, e.g., ProvONE.
- The EDI portal interface provides automatic generation of provenance metadata EML snippets for datasets in EDI. The EMLassemblyline and MetaEgress (in connection with LTER-core-metabase) R packages for EML creation will also generate provenance metadata.
Example 1: EML snippet with a data provenance methodStep:
<methodStep> <description> <para>This methodStep contains data provenance information as specified in the LTER EML Best Practices. Each dataSource element here lists entity-specific information and links to source data used in the creation of this derivative data package.</para> </description> <dataSource> <title>Source dataset title</title> <creator> <individualName> <givenName>first name</givenName> <surName>last name</surName> </individualName> <organizationName>organization name</organizationName> <electronicMailAddress>email@example.com</electronicMailAddress> </creator> <distribution> <online> <onlineDescription>This is a link to an external online data resource (describe resource and repository location).</onlineDescription> <url function="information">https://pasta.lternet.edu/package/metadata/eml/knb-lter-ntl/80/2</url> </online> </distribution> <contact> <positionName>Information Manager</positionName> <organizationName>organization name</organizationName> <electronicMailAddress>firstname.lastname@example.org</electronicMailAddress> </contact> </dataSource> </methodStep>
6.3 Nucleotide sequence and genomic data
Nucleotide sequence data consists of the order and arrangement of DNA or RNA bases extracted from individual organisms or environmental samples. Similarly, genomic data refers to the complete genetic information (either DNA or RNA) of an organism, while metagenomic data refers to the study of genomes recovered from environmental samples. Sequencing, genomic and metagenomic datasets can be very large and complex, and researchers in these fields benefit from particular methods of data access, analysis, and collaboration. Therefore, these data have specialized requirements for data archiving.
Archiving nucleotide sequence and genomic (or other ‘omics’) data are a common use case for creating linked datasets. Data that originate from nucleotide sequencing techniques are most often stored in specialized repositories such as National Center for Biotechnology Information (NCBI) GenBank and the European Nucleotide Archive. However, while sequences or assembled genomes constitute important raw data, ancillary and derived data products related to these raw data are frequently published in repositories specializing in ecological data. For example, data derived from sequence data, such as operational taxonomic units (OTUs) or functional assignments, and ancillary data that describe the environmental, biochemical, or experimental context of the sequencing data, are often included in scientific publications, and do not always fit within the scope of a specialized sequence or genome data repository.
6.3.1 Recommendations for sequencing or genomic datasets
Linking to genomics data is an example of Case 2 described above. Summaries or inventories of data records held in a repository like NCBI GenBank are linked to their derived products or additional measurements published in a more generalist repository such as EDI.
In addition to the metadata typically included with any data package published by the site or research group, include metadata that is descriptive specifically of sequencing and genomics datasets. It is recommended to refer to the MixS templates for standard terminology, especially in the keyword section:
Keywords that can help users discover the sequencing or genomic dataset include:
- General data type descriptions (‘nucleotide sequence’, ‘genomics’, ‘metagenomics’)
- Names of target genes or subfragments (‘16S rRNA’, ‘18S rRNA’, ‘nif’, ‘amoA’, ‘rpo’, ‘ITS’)
- Names of the sequencing technique (‘Sanger’, ‘pyrosequencing’, ‘ABI-solid’)
- Names of the linked repository (‘SRA’, ‘EMBL’, ‘Ensembl’)
- Descriptors of included ancillary data (‘nitrogen’, ‘soil’, ‘drought’)
- Descriptors of derived data products (‘OTU’, ‘functional annotation’, ‘population’)
Inventory tables are of central importance to datasets that index data resources in a sequencing or genomics repository. It is recommended that this inventory should have the columns described in Table 5.1. Note that the unique identifiers included will depend on the granularity of the links to the outside repository. For example, in NCBI, there are accession numbers and URLs for a project, samples within the project, and sequence datasets from a given sample.
External unique ID and URL: For NCBI GenBank this would be the accession number for a collection. For most sequence and genomic datasets an access URL would include an accession number (e.g. https://www.ncbi.nlm.nih.gov/nuccore/AY741555). Referring to a range of accession numbers, may involve providing a search URL that will return the desired list, e.g. (https://www.ncbi.nlm.nih.gov/popset/?term=AY741555). The recommendation is to link to the widest level of sequence or genomic granularity that is useful to interpret data being archived in the new dataset. The following are suggestions for additional contextual columns in the inventory table. This information is generally associated with the data in the genomics repository and should only be duplicated if deemed useful for reuse, or if missing in the original data.
- Sequencing method: the name of the sequencing method used; e.g., Sanger, pyrosequencing, ABI-solid. This attribute is used in MIxS templates, where it is called seq_meth.
- Environment (biome, feature, or material) descriptors: These are descriptors of the environmental context and are standardized by the genomics community in the MixS templates and EnvO.
- Taxon description: If applicable, e.g., Binomial name, or taxonomic group
Data packages of metadata and inventory tables will aid in discovering genomic data within an ecological data repository (EDI) and will aid in clarifying the context in which they were collected. Most use cases, however, employ this inventory table to link specific genetic data to derived data. Such products frequently are community or population metrics where species, OTUs or traits have been determined from the sequence data.
6.4 Example data packages in EDI
Each of the EDI data packages below are linked to data in outside repositories. Some contain data inventory tables (as dataTable entities) that link to the datasets held in outside repositories and are described in the EML metadata. The EML abstract and methods elements in each give detailed access and citation instructions.
Table 3: Linked data packages at EDI that provide examples of the best practices in this document.
|Mass and energy fluxes from the US-Jo2 AmeriFlux eddy covariance tower in Tromble Weir experimental watershed at the Jornada Basin LTER site, 2010-ongoing||This data package links to eddy covariance data from a Jornada Basin LTER tower. The data are held at the AmeriFlux data repository (https://ameriflux.lbl.gov)||knb-lter-jrn.210338005|
|Catalog of GenBank sequence read archive (SRA) entries of 16S and 18S rRNA genes from bacterial and protistan planktonic communities along the Eastern Beaufort Sea coast, North Slope, Alaska, 2011-2013||Data inventory of runs, samples, and experiments held at GenBank.||knb-lter-ble.10|
|Correlation of native and exotic species richness: a global meta-analysis finds no invasion paradox across scales||This data package re-publishes data held in a package in Dryad. The metadata has been substantially enriched relative to the original dataset.||edi.548.1|
|Vascular Flora of the Harvard Farm at Harvard Forest since 2014||This data package includes an inventory table with information on voucher specimens held in the Harvard Herbarium.||knb-lter-hfr.236.3|
|Biological responses to landscape change in the McMurdo Dry Valleys, Antarctica||This data package links to genomic data in NCBI, and includes additional data from biogeochemical analyses performed on each sample.||knb-lter-mcm.262.1|
6.5 Appendix: Tips and repository information
This section aggregates information helpful at the time this document was written, particularly regarding nucleotide sequence and genomic data repositories in widespread use at this time. Given the rapid rate of change in the field, this info may fall out of date quickly.
6.5.1 Sequence and genomic repository information
It is generally preferable that sequencing and genomic data are archived in community repositories that are specialized for their data type, rather than in a generalist repository such as the Environmental Data Initiative (EDI). There are many such specialized repositories; a fairly comprehensive listing is provided by the journal Nucleic Acids Research (summarized on this page). Metadata standards and collaborative structures among these repositories are governed by the International Nucleotide Sequence Database Collaboration (INSDC, more guidance here).Often these repositories provide or are accessible to specialized tools for searching, accessing, and analyzing the data (e.g., BLAST, MG-RAST). Furthermore, some products derived from sequence or genomic data are best archived in another specialized repository (e.g., metagenome-assembled genomes, or MAGs). As a general rule, these specialized repositories assign unique identifiers to projects, samples, and/or single sequences (often referred to as accession numbers) that can be used to locate sequences or genomic data. Note that each repository may have its own mechanism for reverse linking to related data held in another repository (such as EDI), and these mechanisms are beyond the scope of this document.
- NCBI Databases - list of various databases with search capabilities. See also How to submit data to GenBank.
- NCBI Accession Number prefixes - Explanation of accession number prefix codes.
- DNA DataBank of Japan (DDBJ) - list of various databases with search capabilities. See also Submissions.
- European Nucleotide Archive (ENA) - list of various databases with search capabilities. See also Submit and update.
- Integrated Microbial Genomes & Microbiomes (IMG/M) system from the Joint Genome Institute
- MG-RAST (technically an analysis pipeline not a primary repository, but replicates to primary repositories) Replicates to the European Bioinformatics Institute (EMBL-EBI), which in turn replicates to the NCBI Sequence Read Archive (such that data submitted on MG-RAST will automatically appear on all three).
- Barcode of Life DataSystems (BOLD) DNA barcoding is a taxonomic method that uses one or more standardized short genetic markers in an organism’s DNA to identify it as belonging to a particular species. Through this method unknown DNA samples are identified to registered species based on comparison to a reference library. The Centre for Biodiversity Genomics in Canada maintains the BOLD public data portal, a cloud-based data storage and analysis platform.
6.5.2 Tips for locating metadata in sequence and genomic data repositories
Where information for populating metadata in EML has not been supplied directly to the IM from the research group, metadata that investigators provided when submitting data may be found in the genomics repository.
- For data in NCBI, go to the NCBI website and search using the accession number. Or search by accession number in a specific NCBI Database, for example Genes PopSet (the PopSet database is a collection of related DNA sequences derived from population, phylogenetic, mutation and ecosystem studies that have been submitted to NCBI).
- For sequences submitted to the NCBI Sequence Read Archive, there are some easily accessible online tools for generating tables of linked sequence data and their metadata. For an example, go to the example dataset at https://www.ncbi.nlm.nih.gbov/bioproject/305753, and click the number next to SRA Experiments to see a list of all experiments. Then click Send results to Run selector to see a table summarizing geolocations and associated metadata which could be archived at EDI or used to extract metadata for EML preparation.
- A full Data Carpentry tutorial on accessing data on the NCBI SRA database can be found here: Examining Data on the NCBI SRA Database
- BCO-DMO examples for contributing sequence accession numbers.
6.5.3 Darwin Core standard for sequence data
For sequence data to conform with the Darwin Core standard, a column header ‘associatedSequences’ (https://dwc.tdwg.org/terms/#dwc:associatedSequences) may be used in the inventory table populated with a unique identifier (or list of identifiers) for the sequence data (e.g., SNLBE002-17, a sequence in Barcode Of Life Data system, aka BOLD) or full URL (e.g., http://www.boldsystems.org/index.php/Public_RecordView?processid=SNLBE002-17).