Ecological community surveys

DRAFT DRAFT DRAFT

Introduction

We know from experience that primary research data sets cannot be easily reused until all data are completely understood. Community survey data in particular, can be highly complex, often with location-specific methods. Some general principles can be summarized from:

  • synthesis scientists working with this type of data
  • lessons learned from harmonization of primary data, e.g., EDI’s ecocomDP project.

Anticipated use cases

Synthesis

Synthesis requires that primary research data sets be understood and combined into a similar format. EDI has examined the needs of synthesis scientists using community survey data, and that feedback is incorporated here.

Harmonization

Although a prescribed format would make data easily reused, the complexity of community surveys generally means that a prescribed format cannot be imposed on research studies. Therefore, EDI is harmonizing this data type in a workflow framework, and the lessons learned from examining raw data are included here. For more information on EDI’s harmonization of community survey data, see: - https://github.com/EDIorg/ecocomDP - https://edirepository.org/resources/thematic-standardization#ecocomdp

External applications

Community survey data are of great interest to the broader biodiversity community, particularly through their support for portals such as GBIF, and the use of the Darwin Core Vocabulary. Harmonization is the first step in this process.

Definitions and conventions

The recommendations and examples below are organized according to how easily the data are reused. In all cases Best is preferred, and recommended. Marginal and not useable may overlap in some situations. - Best: Data are easily understood (do not require manual handling or further investigation) - OK: Some manual, custom or specialized handling required - Marginal: Additional metadata is required, along with custom handling and probably investigation (e.g., questions answered via email) - Not useable: the dataset has significant metadata missing; or has too many inconsistencies or layout challenges for it to be used even with manual handling.

Recommendations for datasets

Sampling methods

Methods are generally text metadata * Best * explanation of the sampling strategy. * diagrams of sampling plots and their spatial relationships * OK * reference included to a paper describing the above * Marginal/not usable * no description of methods

Dates

Temporal sampling regime is consistent

Synthesis efforts may be able to circumvent the lack of true dates by dropping records (and elevate a “not useable” dataset to “marginal”)

*the ECC currently checks attributes typed as dateTime in two ways:

  • format compared to a preferred list formats
  • if the format is in the preferred list, checks agreement of data values with that format

Locations

Should be complete, with latitude and longitude

Site nesting

Sampling site nesting can be understood

Taxa

Taxa can be resolved

Table column names

Metadata can be matched to entity column

This feature has come up in other discussions. The EML metadata asserts what the content of a column is, however there is no explicit “key” into that column except for the column header in the data entity itself. If these do not match (or there is no table header), then there is nothing to go on but trust. That’s fine if data are shared only within a tightly knit community, but is less reliable when data are reused outside the orginating group.

the ECC currently checks that number of columns and their typing match (within the limits of an RDB). For attributeNames, It shows you the first line of the table for manual comparison (an info check).

Table linkages

Foreign Key linkages are clear

Table linkages are most important for harmonization; synthesis efforts may be able to circumvent issues (and elevate a “not useable” dataset to “marginal”) by dropping records without referents in key fields.