2 Evaluate metadata content against FAIR criteria

2.1 Introduction

FAIR - findable, accessible, interoperable, reusable, as first suggested by Wilkinson et al. 20161 is a framework for understanding quality of metadata in terms of making data usable for somebody who was not directly involved in the sampling.

EDI has implemented the EML schema for metadata and a congruence checker which together already assure a high degree of metadata quality. As the community develops more guidelines for metadata quality it may become interesting for a research site or project to evaluate their overall performance in this area and to analyze where improvements may be implemented.

The Research Data Alliance (RDA) has developed guidelines for evaluating FAIR, the FAIR data maturity model. This framework is fairly general and research communities need to expand it with more specific community level criteria. DataONE has taken the initiative to develop such criteria for data in the DataONE community. Specific implementation of the checks for EML metadata can be found in the git repo.

Table 1: Comparison of FAIR criteria and how they are implemented in EDI

FAIR RDA DataONE EDI repository implementation EML schema EDI check
F Metadata is identified by a persistent identifier metadata identifier present required yes
F Data is identified by a persistent identifier entity identifier present yes
F Metadata is identified by a globally unique identifier metadata identifier present required yes
F Data is identified by a globally unique identifier entity identifier present yes
F entity identifier Type present yes
F Rich metadata is provided to allow discovery resource title length sufficient yes
F resource publication date present yes
F resource creator present required yes
F resource creator identifier present no
F resource abstract length sufficient yes
F resource keywords present yes
F resource keywords controlled no
F resource keyword type present no
F resource publication date timeframe no
F resource revision date present not in EML = latest publication date no
F resource spatial extent present yes
F geographic description present required with spatial extent yes
F resource taxonomic extent present yes
F resource temporal extent present yes
F Metadata includes the identifier for the data entity identifier present yes
F Metadata is offered in such a way that it can be harvested and indexed D1 member node, PASTA+ API, schema.org
A Metadata contains information to enable the user to get access to the data entity distribution URL resolvable yes
A resource access control rules present yes no
A resource distribution contact present required yes
A resource distribution contact identifier present no
A resource publisher present is EDI
A resource publisher identifier present EDI’s ROR
A resource service location present PASTA+ API not in EML
A resource service provider present PASTA+ API not in EML
A Metadata can be accessed manually (i.e. with human intervention) resource landing page present yes
A Data can be accessed manually (i.e. with human intervention) entity distribution URL resolvable yes yes
A Metadata identifier resolves to a metadata record metadata identifier resolvable no
A Data identifier resolves to a digital object yes
A Metadata is accessed through standardized protocol yes
A Data is accessible through standardized protocol yes
A Data can be accessed automatically (i.e. by a computer program) yes
A Metadata is accessible through a free access protocol yes
A Data is accessible through a free access protocol yes
A Data is accessible through an access protocol that supports authentication and authorisation yes
A Metadata is guaranteed to remain available after data is no longer available
I Metadata uses knowledge representation expressed in standardised format LTER vocabulary
I Data uses knowledge representation expressed in standardised format EML entity
I entity format present yes
I entity name present required yes
I entity type present encoded yes
I entity checksum present yes
I entity attributeName differs from description yes
I entity attributeNames unique yes
I entity attributeDefinition present no
I entity attributeDefinition sufficient no
I entity attributeStorageType present no
I Metadata uses machine-understandable knowledge representation LTER vocabulary in SKOS
I Data uses machine-understandable knowledge representation data models
I Metadata uses FAIR-compliant vocabularies LTER vocabulary
I Data uses FAIR-compliant vocabularies
I Metadata includes references to other metadata
I Data includes references to other data
I Metadata includes references to other data
I Data includes qualified references to other data
I Metadata includes qualified references to other metadata
I Metadata include qualified references to other data
R Plurality of accurate and relevant attributes are provided to allow reuse entity format nonproprietary
R entity attributeDomain present
R entity attributeUnits present required yes
R entity attributeMeasurementScale present required yes
R entity attributePrecision present
R entity description present required yes
R entity qualityDescription present
R Metadata includes information about the licence under which the data can be reused resource license present yes
R Metadata refers to a standard reuse licence
R Metadata refers to a machine-understandable reuse licence
R Metadata includes provenance information according to community-specific standards provenance processStepCode present
R provenance sourceEntity present
R provenance trace present not in EML
R resource methods present yes
R Metadata includes provenance information according to a cross-community language
R Metadata complies with a community standard EML yes
R Data complies with a community standard
R Metadata is expressed in compliance with a machine-understandable community standard EML yes
R Data is expressed in compliance with a machine-understandable community standard

2.2 Download EML files

For more information see the EDIutils R package, e.g. how to find all package IDs for a site or by keyword

library(EDIutils)
library(xml2)
library(stringr)
library(tidyverse)

scope <- "knb-lter-ntl"
identifier <- 1

#find the newest revision
revision <- list_data_package_revisions(scope = scope,
                                        identifier = identifier,
                                        filter = "newest")
package_id <- paste(scope,identifier,revision, sep = ".")

# Read the EML file for the data package ID and save EML locally.
eml_file <- read_metadata(packageId = package_id)
#write_xml(eml_file, file = paste("./data/", package_id, "xml", sep = "."))

2.3 Analyze EML content

These checks are not comprehensive. Checks for semantic annotations are not implemented here yet.

2.3.1 dataset has ID and ID is resolvable

We know that all datasets in EDI have a metadata ID for valid EML, however, that ID is not resolvable. This checks to see if there is an alternateIdentifier that is resolvable. One possible ID is the EDI inserted DOI, but others are possible as well.

  eml_id <- xml_attr(eml_file, 'packageId')

  eml_alt_id <- xml_text(xml_find_all(eml_file, './dataset/alternateIdentifier'))
  eml_alt_id_syst <- xml_find_all(eml_file, './dataset/alternateIdentifier') %>%
    xml_attr('system')
  eml_alt_id <- str_remove(eml_alt_id, 'doi:')
  
  eml_id_text <- paste(eml_alt_id_syst, eml_alt_id, sep = '/')
  eml_id_resolv <- ifelse(str_detect(eml_id_text, "https://|http://"), 1, 0)

2.3.2 title length

eml_title <- xml_text(xml_find_first(eml_file, './/title'))
title_words <- str_split(eml_title, '\\s+')
title_length <- length(title_words[[1]])

2.3.3 abstract length

eml_abstract <- xml_text(xml_find_first(eml_file, './/abstract'))
eml_abstract <- str_replace_all(eml_abstract, '\\\n', ' ')
eml_abstract <- str_remove_all(eml_abstract, '\\\t')
abstract_words <- str_split(eml_abstract, '\\s+')
abstract_length <- length(abstract_words[[1]])

2.3.4 number of keywords, keyword types, keyword thesaurus

eml_keywordsets <- xml_find_all(eml_file,'.//keywordSet')
eml_keywords <- xml_find_all(eml_keywordsets, './/keyword')
num_keywords <- length(eml_keywords)
eml_keyword_attr <- xml_has_attr(eml_keywords, 'keywordType')
num_keywordtype <- length(which(eml_keyword_attr))
eml_thesaurus <- xml_find_all(eml_keywordsets, './/keywordThesaurus')
num_thesaurus <- length(eml_thesaurus)

2.3.5 pub date

eml_pubdate <- xml_text(xml_find_first(eml_file, './/pubDate'))

#creator and orcid ID

  num_creators <- length(xml_find_all(eml_file, './dataset/creator'))
  num_orcids <- length(xml_find_all(eml_file, './dataset/creator/userId'))

2.3.6 coverages present

eml_geog_num <- length(xml_find_all(eml_file, './/geographicCoverage'))
if (eml_geog_num > 0) {eml_geog <- "yes"} else {eml_geog <- "no"}
eml_geog_descr_num <- length(xml_find_all(eml_file, './/geographicDescription'))
if (eml_geog_descr_num == eml_geog_num) {eml_geog_descr <- "yes"} else {eml_geog_descr <- "no"}
eml_time_num <- length(xml_find_all(eml_file, './/temporalCoverage'))
if (eml_time_num > 0) {eml_time <- "yes"} else {eml_time <- "no"}
eml_taxon_num <- length(xml_find_all(eml_file, './/taxonomicCoverage'))
if (eml_taxon_num > 0 ) {eml_taxon <- "yes"} else {eml_taxon <- "no"}

2.3.7 access is public

eml_access <- xml_text(xml_find_all(eml_file, './access/allow/principal'))
eml_public <- str_detect(eml_access, 'public')
public_num <- length(which(eml_public))
if (public_num > 0) {public <- "yes"} else {public <- "no"}

2.3.8 contact and contact ID present

eml_contact <- length(xml_find_all(eml_file, './dataset/contact/electronicMailAddress'))

eml_contact_id <- length(xml_find_all(eml_file, './dataset/contact/userId'))

2.3.9 publisher and publisher ID present

this is automatically added by EDI

  eml_publisher <- length(xml_find_all(eml_file, './dataset/publisher'))
  eml_publisher_id <- length(xml_find_all(eml_file, './dataset/publisher/userID'))

2.3.11 quality description present

num_qualitydesc <- length(xml_find_all(eml_file, '//qualityControl'))

2.3.12 methods description present and length

eml_methods <- xml_text(xml_find_all(eml_file, '//methods'))
if (length(eml_methods) > 0){
  eml_methods <- str_replace_all(eml_methods, '\\\n', ' ')
  eml_methods <- str_remove_all(eml_methods, '\\\t')
  eml_methods_word <- str_split(eml_methods, '\\s+')
  methods_length <- length(eml_methods_word[[1]])
}

2.3.13 license present

num_license <- length(xml_find_all(eml_file, '//intellectualRights'))
num_license <- num_license + length(xml_find_all(eml_file, '//licensed'))

2.3.14 provenance data source present

num_provdatasource <- length(xml_find_all(eml_file, '//dataSource'))

2.3.15 processing code present and described

This checks for certain file extensions

num_software <- length(xml_find_all(eml_file, '//software'))
eml_script <- xml_text(xml_find_all(eml_file, '//otherEntity/physical/objectName'))
extensions <- character(0)
extensions_to_check <- c('R', 'r', 'py', 'sql')
script <- 'no'
if (length(eml_script) > 0){
  for (j in 1:length(eml_script)) {
    eml_scriptparts <- str_split(eml_script[j], '\\.')
    p <- length(eml_scriptparts[[1]])
    extensions <- append(extensions, eml_scriptparts[[1]][p])
  }
  for (j in 1:length(extensions_to_check)) {
    if(extensions_to_check[j]  %in% extensions){
      script <- 'yes'
    }
  }
}

2.3.16 entity information

eml_entities <- xml_find_all(eml_file, './/entityName')
num_entities <- length(eml_entities)
eml_entity_info <- xml_siblings(eml_entities)
num_entity_url <- length(xml_find_all(eml_entity_info, './distribution/online/url'))
num_checksum <- length(xml_find_all(eml_entity_info, './authentication[1]'))
num_entitydescr <- length(xml_find_all(eml_entity_info, '//entityDescription'))
num_enitydescrsufficient <- 0
if (num_entitydescr > 0){
  eml_entitydescr <- xml_text(xml_find_all(eml_entity_info, '//entityDescription'))
  for (j in 1:num_entitydescr) {
    entitydescr_words <- str_split(eml_entitydescr[j], '\\s+')
    if (length(entitydescr_words[[1]]) > 2){
      num_enitydescrsufficient <- num_enitydescrsufficient + 1
    }
  }
}
num_entity_format <- length(xml_find_all(eml_entity_info, '//physical/dataFormat'))

entity_ids_text <- xml_attr(xml_parent(eml_entities), 'id')
entity_syst_text <- xml_attr(xml_parent(eml_entities), 'system')
entity_ids <- length(entity_ids_text[!is.na(entity_ids_text)])
entity_syst <- length(entity_syst_text[!is.na(entity_syst_text)])
entity_alt_ids <- length(xml_find_all(xml_parent(eml_entity_info), './alternateIdentifier'))
entity_ids <- entity_ids + entity_alt_ids
num_entity_id <- ifelse(entity_ids > num_entities, num_entities, entity_ids)
entity_alt_syst <- length(xml_find_all(xml_parent(eml_entity_info), './alternateIdentifier[@system]'))
entity_syst <- entity_syst + entity_alt_syst
num_entity_syst <- ifelse(entity_syst > num_entities, num_entities, entity_syst)

num_otherentity <- length(xml_find_all(eml_file, '//otherEntity'))

2.3.17 determine file format

eml_entity_filename <- xml_text(xml_find_all(eml_file, './/physical/objectName'))
entity_extensions <- vector()

if(length(eml_entity_filename) > 0) {
  for (j in 1:length(eml_entity_filename)) {
    eml_extension_parts <- str_split(eml_entity_filename[j], '\\.')
    p <- length(eml_extension_parts[[1]])
    if(p>1) {
      entity_extensions[j] <- eml_extension_parts[[1]][p]
    }
  }
}

extensions <- data.frame(entity_extensions)
standard <- read.csv('https://github.com/NCEAS/metadig-checks/raw/main/data/DataONEformats.csv')

standard <- standard %>%
  distinct(File.Extension, isProprietary) %>%
  mutate(entity_extensions = str_trim(File.Extension)) %>%
  mutate(isProprietary = str_trim(isProprietary)) %>%
  filter(nchar(File.Extension) > 1)

format <- left_join(extensions, standard, by = "entity_extensions")

open <- format %>%
  group_by(isProprietary) %>%
  summarize(count = n())

num_file_proprietary <- ifelse(any(open$isProprietary == 'Y'), open$count[open$isProprietary == 'Y'], 0)
num_file_open  <- ifelse(any(open$isProprietary == 'N'), open$count[open$isProprietary == 'N'], 0)

2.3.18 table entity specific information

eml_tableentities <- xml_find_all(eml_file, './/dataTable')
num_tables <- length(eml_tableentities)

2.3.19 attribute information

num_attributes <- 0
num_attributedefs <- 0
num_attrdefsufficient <- 0
num_attributedefdifferent <- 0
num_attributestoragetype <- 0
attr_nameunique <- 0
num_attributeprecision <- 0

if (length(eml_tableentities) > 0){
  for (j in 1:length(eml_tableentities)){
    eml_attributes <- xml_find_all(eml_tableentities[j], './/attribute')
    num_attributes <- num_attributes + length(eml_attributes)
    num_attributedefs <- num_attributedefs + length(xml_find_all(eml_attributes, './attributeDefinition'))
    num_attributeprecision <- num_attributeprecision + length(xml_find_all(eml_attributes, './measurementScale/interval/precision'))
    num_attributeprecision <- num_attributeprecision + length(xml_find_all(eml_attributes, './measurementScale/ratio/precision'))
    num_attributeprecision <- num_attributeprecision + length(xml_find_all(eml_attributes, './measurementScale/dateTime/dateTimePrecision'))
    num_attributestoragetype <- num_attributestoragetype + length(xml_find_all(eml_attributes, './storageType'))
    attr_names <- c('')
    
    if (length(eml_attributes) > 0){
      for (k in 1:length(eml_attributes)) {
        eml_attributename <- xml_text(xml_find_first(eml_attributes[k], './attributeName'))
        eml_attributedef <- xml_text(xml_find_first(eml_attributes[k], './attributeDefinition'))
        if (eml_attributedef != eml_attributename){
          num_attributedefdifferent <- num_attributedefdifferent + 1
        }
        attr_names <- append(attr_names, eml_attributename)
        attrdef_words <- str_split(eml_attributedef, '\\s+')
        attrdef_length <- length(attrdef_words[[1]])
        if (attrdef_length > 2){
          num_attrdefsufficient <- num_attrdefsufficient + 1
        }
      }
    }
    
    if (length(unique(attr_names)) == length(attr_names)){
      attr_nameunique <- 1
    }
  }
  
  
}
test <- c("package ID", "dataset ID resolvable", "number of words in title", "number of words in abstract", "number of keywords", "number of keywords with type", "thesauri identified", "publication date present", "number of creators", "number of creators with ID", "geographic coverage present", "geographic description present", "temporal coverage present", "taxononimc coverage present", "public access granted", "number of dataset contacts", "number of contacts with ID", "number of publishers", "number of publisher with ID", "landing page link", "number of words in methods description", "processing code in software element", "processing code in other entity", "license present", "provenance datasource linked", "number of table entities", "number of other entities", "number of all entities", "number of entity IDs present", "number of entitis downloadable", "number of entities with checksums", "number of entities with descriptions", "number of enity description of sufficient length", "number of entities with format defined", "number of entities open format", "number of entities proprietary format", "attribute names unique within each entity", "number of attributes", "number of attribute definitions", "number of attribute definitions of sufficient length", "attribute definition different then attribute name", "attribute storage type defined", "attribute precision defined", "number of data quality descriptions")

result <- c(eml_id, eml_id_text, title_length, abstract_length, num_keywords, num_keywordtype, num_thesaurus, eml_pubdate, num_creators, num_orcids, eml_geog, eml_geog_descr, eml_time, eml_taxon, public, eml_contact, eml_contact_id, eml_publisher, eml_publisher_id, eml_landing, methods_length, num_software, script, num_license, num_provdatasource, num_tables, num_otherentity, num_entities, num_entity_id, num_entity_url, num_checksum, num_entitydescr, num_enitydescrsufficient, num_entity_format, num_file_open, num_file_proprietary, attr_nameunique, num_attributes, num_attributedefs, num_attrdefsufficient, num_attributedefdifferent, num_attributestoragetype, num_attributeprecision, num_qualitydesc)

evaluation <- data.frame(test, result)

knitr::kable(evaluation, table.attr = "class=\"striped\"",
  format = "html")
test result
package ID knb-lter-ntl.1.52
dataset ID resolvable https://doi.org/10.6073/pasta/8359d27bbd91028f222d923a7936077d
number of words in title 17
number of words in abstract 206
number of keywords 41
number of keywords with type 0
thesauri identified 5
publication date present 2010-09-20
number of creators 4
number of creators with ID 2
geographic coverage present yes
geographic description present yes
temporal coverage present yes
taxononimc coverage present no
public access granted yes
number of dataset contacts 2
number of contacts with ID 0
number of publishers 0
number of publisher with ID 0
landing page link https://test7.limnology.wisc.edu/dataset/north-temperate-lakes-lter-chemical-limnology-primary-study-lakes-nutrients-ph-and-carbon-19
number of words in methods description 635
processing code in software element 0
processing code in other entity no
license present 1
provenance datasource linked 0
number of table entities 1
number of other entities 0
number of all entities 1
number of entity IDs present 0
number of entitis downloadable 1
number of entities with checksums 1
number of entities with descriptions 1
number of enity description of sufficient length 1
number of entities with format defined 1
number of entities open format 1
number of entities proprietary format 0
attribute names unique within each entity 1
number of attributes 59
number of attribute definitions 59
number of attribute definitions of sufficient length 51
attribute definition different then attribute name 59
attribute storage type defined 31
attribute precision defined 2
number of data quality descriptions 0

  1. Wilkinson, M., Dumontier, M., Aalbersberg, I. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3, 160018 (2016). https://doi.org/10.1038/sdata.2016.18↩︎