Ecological community surveys
DRAFT DRAFT DRAFT
Introduction
We know from experience that primary research data sets cannot be easily reused until all data are completely understood. Community survey data in particular, can be highly complex, often with location-specific methods. Some general principles can be summarized from:
- synthesis scientists working with this type of data
- lessons learned from harmonization of primary data, e.g., EDI’s
ecocomDP
project.
Anticipated use cases
Synthesis
Synthesis requires that primary research data sets be understood and combined into a similar format. EDI has examined the needs of synthesis scientists using community survey data, and that feedback is incorporated here.
Harmonization
Although a prescribed format would make data easily reused, the complexity of community surveys generally means that a prescribed format cannot be imposed on research studies. Therefore, EDI is harmonizing this data type in a workflow framework, and the lessons learned from examining raw data are included here. For more information on EDI’s harmonization of community survey data, see: - https://github.com/EDIorg/ecocomDP - https://edirepository.org/resources/thematic-standardization#ecocomdp
Definitions and conventions
The recommendations and examples below are organized according to how easily the data are reused. In all cases Best is preferred, and recommended. Marginal and not useable may overlap in some situations. - Best: Data are easily understood (do not require manual handling or further investigation) - OK: Some manual, custom or specialized handling required - Marginal: Additional metadata is required, along with custom handling and probably investigation (e.g., questions answered via email) - Not useable: the dataset has significant metadata missing; or has too many inconsistencies or layout challenges for it to be used even with manual handling.
Recommendations for datasets
Sampling methods
Methods are generally text metadata * Best * explanation of the sampling strategy. * diagrams of sampling plots and their spatial relationships * OK * reference included to a paper describing the above * Marginal/not usable * no description of methods
Dates
Temporal sampling regime is consistent
- Best: A column for dateTime is in the entity, and its format is consistent throughout
- OK: sampling regime changes over time (yyyy, vs yyyy-mm-dd) * YYYY, vs YYYY-MM-DD
- Not useable: date and time columns are not typed in EML as dateTimes (i.e, typed as strings, as below)
Synthesis efforts may be able to circumvent the lack of true dates by dropping records (and elevate a “not useable” dataset to “marginal”)
*the ECC currently checks attributes typed as dateTime in two ways:
- format compared to a preferred list formats
- if the format is in the preferred list, checks agreement of data values with that format
Locations
Should be complete, with latitude and longitude
- Best: Columns for digital lat/lon
- OK (need custom processing):
- In metadata only:
- Deg-min-sec (strings)
- Locations in second table
- Not useable: sites codes without lat/lon
Taxa
Taxa can be resolved
- Best: Taxon codes in the entity itself, assigned at source by those familiar with these taxonomic groups
- OK: species binomials (ids will have to be added later, by someone less familiar with these taxa)
- example: https://portal.edirepository.org/nis/metadataviewer?packageid=knb-lter-sbc.17.33
- Not useable: local codes only
- example (*if all they had included was the column called “sp_code”): https://portal.edirepository.org/nis/metadataviewer?packageid=knb-lter-sbc.17.33
Table column names
Metadata can be matched to entity column
- Best: attributeName exactly matches column header
- OK: can be matched by manual examination
- Marginal: no header
- example
This feature has come up in other discussions. The EML metadata asserts what the content of a column is, however there is no explicit “key” into that column except for the column header in the data entity itself. If these do not match (or there is no table header), then there is nothing to go on but trust. That’s fine if data are shared only within a tightly knit community, but is less reliable when data are reused outside the orginating group.
the ECC currently checks that number of columns and their typing match (within the limits of an RDB). For attributeNames, It shows you the first line of the table for manual comparison (an info check).
Table linkages
Foreign Key linkages are clear
- Best: EML constraint included, with referential integrity
- OK: FK detected manually, has referential integrity
- url
- Not useable: FK detected manually, but no referential integrity
- url
Table linkages are most important for harmonization; synthesis efforts may be able to circumvent issues (and elevate a “not useable” dataset to “marginal”) by dropping records without referents in key fields.