2 Evaluate metadata content against FAIR criteria

2.1 Introduction

FAIR - findable, accessible, interoperable, reusable, as first suggested by Wilkinson et al. 2016¹ is a framework for understanding quality of metadata in terms of making data usable for somebody who was not directly involved in the sampling.

EDI has implemented the EML schema for metadata and a congruence checker which together already assure a high degree of metadata quality. As the community develops more guidelines for metadata quality it may become interesting for a research site or project to evaluate their overall performance in this area and to analyze where improvements may be implemented.

The Research Data Alliance (RDA) has developed guidelines for evaluating FAIR, the FAIR data maturity model. This framework is fairly general and research communities need to expand it with more specific community level criteria. DataONE has taken the initiative to develop such criteria for data in the DataONE community. Specific implementation of the checks for EML metadata can be found in the git repo.

Table 1: Comparison of FAIR criteria and how they are implemented in EDI

FAIR	RDA	DataONE	EDI repository implementation	EML schema	EDI check
F	Metadata is identified by a persistent identifier	metadata identifier present		required	yes
F	Data is identified by a persistent identifier	entity identifier present	yes
F	Metadata is identified by a globally unique identifier	metadata identifier present		required	yes
F	Data is identified by a globally unique identifier	entity identifier present	yes
F		entity identifier Type present	yes
F	Rich metadata is provided to allow discovery	resource title length sufficient			yes
F		resource publication date present			yes
F		resource creator present		required	yes
F		resource creator identifier present			no
F		resource abstract length sufficient			yes
F		resource keywords present			yes
F		resource keywords controlled			no
F		resource keyword type present			no
F		resource publication date timeframe			no
F		resource revision date present		not in EML = latest publication date	no
F		resource spatial extent present			yes
F		geographic description present		required with spatial extent	yes
F		resource taxonomic extent present			yes
F		resource temporal extent present			yes
F	Metadata includes the identifier for the data	entity identifier present	yes
F	Metadata is offered in such a way that it can be harvested and indexed		D1 member node, PASTA+ API, schema.org

A	Metadata contains information to enable the user to get access to the data	entity distribution URL resolvable			yes
A		resource access control rules present	yes		no
A		resource distribution contact present		required	yes
A		resource distribution contact identifier present			no
A		resource publisher present	is EDI
A		resource publisher identifier present	EDI’s ROR
A		resource service location present	PASTA+ API	not in EML
A		resource service provider present	PASTA+ API	not in EML
A	Metadata can be accessed manually (i.e. with human intervention)	resource landing page present	yes
A	Data can be accessed manually (i.e. with human intervention)	entity distribution URL resolvable	yes		yes
A	Metadata identifier resolves to a metadata record	metadata identifier resolvable			no
A	Data identifier resolves to a digital object		yes
A	Metadata is accessed through standardized protocol		yes
A	Data is accessible through standardized protocol		yes
A	Data can be accessed automatically (i.e. by a computer program)		yes
A	Metadata is accessible through a free access protocol		yes
A	Data is accessible through a free access protocol		yes
A	Data is accessible through an access protocol that supports authentication and authorisation		yes
A	Metadata is guaranteed to remain available after data is no longer available

I	Metadata uses knowledge representation expressed in standardised format		LTER vocabulary
I	Data uses knowledge representation expressed in standardised format			EML entity
I		entity format present			yes
I		entity name present		required	yes
I		entity type present		encoded	yes
I		entity checksum present			yes
I		entity attributeName differs from description			yes
I		entity attributeNames unique			yes
I		entity attributeDefinition present			no
I		entity attributeDefinition sufficient			no
I		entity attributeStorageType present			no
I	Metadata uses machine-understandable knowledge representation			LTER vocabulary in SKOS
I	Data uses machine-understandable knowledge representation		data models
I	Metadata uses FAIR-compliant vocabularies		LTER vocabulary
I	Data uses FAIR-compliant vocabularies
I	Metadata includes references to other metadata
I	Data includes references to other data
I	Metadata includes references to other data
I	Data includes qualified references to other data
I	Metadata includes qualified references to other metadata
I	Metadata include qualified references to other data

R	Plurality of accurate and relevant attributes are provided to allow reuse	entity format nonproprietary
R		entity attributeDomain present
R		entity attributeUnits present		required	yes
R		entity attributeMeasurementScale present		required	yes
R		entity attributePrecision present
R		entity description present		required	yes
R		entity qualityDescription present
R	Metadata includes information about the licence under which the data can be reused	resource license present	yes
R	Metadata refers to a standard reuse licence
R	Metadata refers to a machine-understandable reuse licence
R	Metadata includes provenance information according to community-specific standards	provenance processStepCode present
R		provenance sourceEntity present
R		provenance trace present		not in EML
R		resource methods present			yes
R	Metadata includes provenance information according to a cross-community language
R	Metadata complies with a community standard			EML	yes
R	Data complies with a community standard
R	Metadata is expressed in compliance with a machine-understandable community standard			EML	yes
R	Data is expressed in compliance with a machine-understandable community standard

2.2 Download EML files

For more information see the EDIutils R package, e.g. how to find all package IDs for a site or by keyword

library(EDIutils)
library(xml2)
library(stringr)
library(tidyverse)

scope <- "knb-lter-ntl"
identifier <- 1

#find the newest revision
revision <- list_data_package_revisions(scope = scope,
                                        identifier = identifier,
                                        filter = "newest")
package_id <- paste(scope,identifier,revision, sep = ".")

# Read the EML file for the data package ID and save EML locally.
eml_file <- read_metadata(packageId = package_id)
#write_xml(eml_file, file = paste("./data/", package_id, "xml", sep = "."))

2.3 Analyze EML content

These checks are not comprehensive. Checks for semantic annotations are not implemented here yet.

2.3.1 dataset has ID and ID is resolvable

We know that all datasets in EDI have a metadata ID for valid EML, however, that ID is not resolvable. This checks to see if there is an alternateIdentifier that is resolvable. One possible ID is the EDI inserted DOI, but others are possible as well.

  eml_id <- xml_attr(eml_file, 'packageId')

  eml_alt_id <- xml_text(xml_find_all(eml_file, './dataset/alternateIdentifier'))
  eml_alt_id_syst <- xml_find_all(eml_file, './dataset/alternateIdentifier') %>%
    xml_attr('system')
  eml_alt_id <- str_remove(eml_alt_id, 'doi:')
  
  eml_id_text <- paste(eml_alt_id_syst, eml_alt_id, sep = '/')
  eml_id_resolv <- ifelse(str_detect(eml_id_text, "https://|http://"), 1, 0)

2.3.2 title length

eml_title <- xml_text(xml_find_first(eml_file, './/title'))
title_words <- str_split(eml_title, '\\s+')
title_length <- length(title_words[[1]])

2.3.3 abstract length

eml_abstract <- xml_text(xml_find_first(eml_file, './/abstract'))
eml_abstract <- str_replace_all(eml_abstract, '\\\n', ' ')
eml_abstract <- str_remove_all(eml_abstract, '\\\t')
abstract_words <- str_split(eml_abstract, '\\s+')
abstract_length <- length(abstract_words[[1]])

2.3.4 number of keywords, keyword types, keyword thesaurus

eml_keywordsets <- xml_find_all(eml_file,'.//keywordSet')
eml_keywords <- xml_find_all(eml_keywordsets, './/keyword')
num_keywords <- length(eml_keywords)
eml_keyword_attr <- xml_has_attr(eml_keywords, 'keywordType')
num_keywordtype <- length(which(eml_keyword_attr))
eml_thesaurus <- xml_find_all(eml_keywordsets, './/keywordThesaurus')
num_thesaurus <- length(eml_thesaurus)

2.3.5 pub date

eml_pubdate <- xml_text(xml_find_first(eml_file, './/pubDate'))

#creator and orcid ID

  num_creators <- length(xml_find_all(eml_file, './dataset/creator'))
  num_orcids <- length(xml_find_all(eml_file, './dataset/creator/userId'))

2.3.6 coverages present

eml_geog_num <- length(xml_find_all(eml_file, './/geographicCoverage'))
if (eml_geog_num > 0) {eml_geog <- "yes"} else {eml_geog <- "no"}
eml_geog_descr_num <- length(xml_find_all(eml_file, './/geographicDescription'))
if (eml_geog_descr_num == eml_geog_num) {eml_geog_descr <- "yes"} else {eml_geog_descr <- "no"}
eml_time_num <- length(xml_find_all(eml_file, './/temporalCoverage'))
if (eml_time_num > 0) {eml_time <- "yes"} else {eml_time <- "no"}
eml_taxon_num <- length(xml_find_all(eml_file, './/taxonomicCoverage'))
if (eml_taxon_num > 0 ) {eml_taxon <- "yes"} else {eml_taxon <- "no"}

2.3.7 access is public

eml_access <- xml_text(xml_find_all(eml_file, './access/allow/principal'))
eml_public <- str_detect(eml_access, 'public')
public_num <- length(which(eml_public))
if (public_num > 0) {public <- "yes"} else {public <- "no"}

2.3.8 contact and contact ID present

eml_contact <- length(xml_find_all(eml_file, './dataset/contact/electronicMailAddress'))

eml_contact_id <- length(xml_find_all(eml_file, './dataset/contact/userId'))

2.3.9 publisher and publisher ID present

this is automatically added by EDI

  eml_publisher <- length(xml_find_all(eml_file, './dataset/publisher'))
  eml_publisher_id <- length(xml_find_all(eml_file, './dataset/publisher/userID'))

2.3.10 landing page link present

  eml_landing <- xml_text(xml_find_all(eml_file, './dataset/distribution/online/url[@function="information"]'))
  eml_landing <- ifelse(length(eml_landing > 0), eml_landing, "")
  eml_landing_resolv <- ifelse(str_detect(eml_landing, "https://|http://"), 1, 0)

2.3.11 quality description present

num_qualitydesc <- length(xml_find_all(eml_file, '//qualityControl'))

2.3.12 methods description present and length

eml_methods <- xml_text(xml_find_all(eml_file, '//methods'))
if (length(eml_methods) > 0){
  eml_methods <- str_replace_all(eml_methods, '\\\n', ' ')
  eml_methods <- str_remove_all(eml_methods, '\\\t')
  eml_methods_word <- str_split(eml_methods, '\\s+')
  methods_length <- length(eml_methods_word[[1]])
}

2.3.13 license present

num_license <- length(xml_find_all(eml_file, '//intellectualRights'))
num_license <- num_license + length(xml_find_all(eml_file, '//licensed'))

2.3.14 provenance data source present

num_provdatasource <- length(xml_find_all(eml_file, '//dataSource'))

2.3.15 processing code present and described

This checks for certain file extensions

num_software <- length(xml_find_all(eml_file, '//software'))
eml_script <- xml_text(xml_find_all(eml_file, '//otherEntity/physical/objectName'))
extensions <- character(0)
extensions_to_check <- c('R', 'r', 'py', 'sql')
script <- 'no'
if (length(eml_script) > 0){
  for (j in 1:length(eml_script)) {
    eml_scriptparts <- str_split(eml_script[j], '\\.')
    p <- length(eml_scriptparts[[1]])
    extensions <- append(extensions, eml_scriptparts[[1]][p])
  }
  for (j in 1:length(extensions_to_check)) {
    if(extensions_to_check[j]  %in% extensions){
      script <- 'yes'
    }
  }
}

2.3.16 entity information

eml_entities <- xml_find_all(eml_file, './/entityName')
num_entities <- length(eml_entities)
eml_entity_info <- xml_siblings(eml_entities)
num_entity_url <- length(xml_find_all(eml_entity_info, './distribution/online/url'))
num_checksum <- length(xml_find_all(eml_entity_info, './authentication[1]'))
num_entitydescr <- length(xml_find_all(eml_entity_info, '//entityDescription'))
num_enitydescrsufficient <- 0
if (num_entitydescr > 0){
  eml_entitydescr <- xml_text(xml_find_all(eml_entity_info, '//entityDescription'))
  for (j in 1:num_entitydescr) {
    entitydescr_words <- str_split(eml_entitydescr[j], '\\s+')
    if (length(entitydescr_words[[1]]) > 2){
      num_enitydescrsufficient <- num_enitydescrsufficient + 1
    }
  }
}
num_entity_format <- length(xml_find_all(eml_entity_info, '//physical/dataFormat'))

entity_ids_text <- xml_attr(xml_parent(eml_entities), 'id')
entity_syst_text <- xml_attr(xml_parent(eml_entities), 'system')
entity_ids <- length(entity_ids_text[!is.na(entity_ids_text)])
entity_syst <- length(entity_syst_text[!is.na(entity_syst_text)])
entity_alt_ids <- length(xml_find_all(xml_parent(eml_entity_info), './alternateIdentifier'))
entity_ids <- entity_ids + entity_alt_ids
num_entity_id <- ifelse(entity_ids > num_entities, num_entities, entity_ids)
entity_alt_syst <- length(xml_find_all(xml_parent(eml_entity_info), './alternateIdentifier[@system]'))
entity_syst <- entity_syst + entity_alt_syst
num_entity_syst <- ifelse(entity_syst > num_entities, num_entities, entity_syst)

num_otherentity <- length(xml_find_all(eml_file, '//otherEntity'))

2.3.17 determine file format

eml_entity_filename <- xml_text(xml_find_all(eml_file, './/physical/objectName'))
entity_extensions <- vector()

if(length(eml_entity_filename) > 0) {
  for (j in 1:length(eml_entity_filename)) {
    eml_extension_parts <- str_split(eml_entity_filename[j], '\\.')
    p <- length(eml_extension_parts[[1]])
    if(p>1) {
      entity_extensions[j] <- eml_extension_parts[[1]][p]
    }
  }
}

extensions <- data.frame(entity_extensions)
standard <- read.csv('https://github.com/NCEAS/metadig-checks/raw/main/data/DataONEformats.csv')

standard <- standard %>%
  distinct(File.Extension, isProprietary) %>%
  mutate(entity_extensions = str_trim(File.Extension)) %>%
  mutate(isProprietary = str_trim(isProprietary)) %>%
  filter(nchar(File.Extension) > 1)

format <- left_join(extensions, standard, by = "entity_extensions")

open <- format %>%
  group_by(isProprietary) %>%
  summarize(count = n())

num_file_proprietary <- ifelse(any(open$isProprietary == 'Y'), open$count[open$isProprietary == 'Y'], 0)
num_file_open  <- ifelse(any(open$isProprietary == 'N'), open$count[open$isProprietary == 'N'], 0)

2.3.18 table entity specific information

eml_tableentities <- xml_find_all(eml_file, './/dataTable')
num_tables <- length(eml_tableentities)

2.3.19 attribute information

num_attributes <- 0
num_attributedefs <- 0
num_attrdefsufficient <- 0
num_attributedefdifferent <- 0
num_attributestoragetype <- 0
attr_nameunique <- 0
num_attributeprecision <- 0

if (length(eml_tableentities) > 0){
  for (j in 1:length(eml_tableentities)){
    eml_attributes <- xml_find_all(eml_tableentities[j], './/attribute')
    num_attributes <- num_attributes + length(eml_attributes)
    num_attributedefs <- num_attributedefs + length(xml_find_all(eml_attributes, './attributeDefinition'))
    num_attributeprecision <- num_attributeprecision + length(xml_find_all(eml_attributes, './measurementScale/interval/precision'))
    num_attributeprecision <- num_attributeprecision + length(xml_find_all(eml_attributes, './measurementScale/ratio/precision'))
    num_attributeprecision <- num_attributeprecision + length(xml_find_all(eml_attributes, './measurementScale/dateTime/dateTimePrecision'))
    num_attributestoragetype <- num_attributestoragetype + length(xml_find_all(eml_attributes, './storageType'))
    attr_names <- c('')
    
    if (length(eml_attributes) > 0){
      for (k in 1:length(eml_attributes)) {
        eml_attributename <- xml_text(xml_find_first(eml_attributes[k], './attributeName'))
        eml_attributedef <- xml_text(xml_find_first(eml_attributes[k], './attributeDefinition'))
        if (eml_attributedef != eml_attributename){
          num_attributedefdifferent <- num_attributedefdifferent + 1
        }
        attr_names <- append(attr_names, eml_attributename)
        attrdef_words <- str_split(eml_attributedef, '\\s+')
        attrdef_length <- length(attrdef_words[[1]])
        if (attrdef_length > 2){
          num_attrdefsufficient <- num_attrdefsufficient + 1
        }
      }
    }
    
    if (length(unique(attr_names)) == length(attr_names)){
      attr_nameunique <- 1
    }
  }
  
  
}

test <- c("package ID", "dataset ID resolvable", "number of words in title", "number of words in abstract", "number of keywords", "number of keywords with type", "thesauri identified", "publication date present", "number of creators", "number of creators with ID", "geographic coverage present", "geographic description present", "temporal coverage present", "taxononimc coverage present", "public access granted", "number of dataset contacts", "number of contacts with ID", "number of publishers", "number of publisher with ID", "landing page link", "number of words in methods description", "processing code in software element", "processing code in other entity", "license present", "provenance datasource linked", "number of table entities", "number of other entities", "number of all entities", "number of entity IDs present", "number of entitis downloadable", "number of entities with checksums", "number of entities with descriptions", "number of enity description of sufficient length", "number of entities with format defined", "number of entities open format", "number of entities proprietary format", "attribute names unique within each entity", "number of attributes", "number of attribute definitions", "number of attribute definitions of sufficient length", "attribute definition different then attribute name", "attribute storage type defined", "attribute precision defined", "number of data quality descriptions")

result <- c(eml_id, eml_id_text, title_length, abstract_length, num_keywords, num_keywordtype, num_thesaurus, eml_pubdate, num_creators, num_orcids, eml_geog, eml_geog_descr, eml_time, eml_taxon, public, eml_contact, eml_contact_id, eml_publisher, eml_publisher_id, eml_landing, methods_length, num_software, script, num_license, num_provdatasource, num_tables, num_otherentity, num_entities, num_entity_id, num_entity_url, num_checksum, num_entitydescr, num_enitydescrsufficient, num_entity_format, num_file_open, num_file_proprietary, attr_nameunique, num_attributes, num_attributedefs, num_attrdefsufficient, num_attributedefdifferent, num_attributestoragetype, num_attributeprecision, num_qualitydesc)

evaluation <- data.frame(test, result)

knitr::kable(evaluation, table.attr = "class=\"striped\"",
  format = "html")

test	result
package ID	knb-lter-ntl.1.52
dataset ID resolvable	https://doi.org/10.6073/pasta/8359d27bbd91028f222d923a7936077d
number of words in title	17
number of words in abstract	206
number of keywords	41
number of keywords with type	0
thesauri identified	5
publication date present	2010-09-20
number of creators	4
number of creators with ID	2
geographic coverage present	yes
geographic description present	yes
temporal coverage present	yes
taxononimc coverage present	no
public access granted	yes
number of dataset contacts	2
number of contacts with ID	0
number of publishers	0
number of publisher with ID	0
landing page link	https://test7.limnology.wisc.edu/dataset/north-temperate-lakes-lter-chemical-limnology-primary-study-lakes-nutrients-ph-and-carbon-19
number of words in methods description	635
processing code in software element	0
processing code in other entity	no
license present	1
provenance datasource linked	0
number of table entities	1
number of other entities	0
number of all entities	1
number of entity IDs present	0
number of entitis downloadable	1
number of entities with checksums	1
number of entities with descriptions	1
number of enity description of sufficient length	1
number of entities with format defined	1
number of entities open format	1
number of entities proprietary format	0
attribute names unique within each entity	1
number of attributes	59
number of attribute definitions	59
number of attribute definitions of sufficient length	51
attribute definition different then attribute name	59
attribute storage type defined	31
attribute precision defined	2
number of data quality descriptions	0

Wilkinson, M., Dumontier, M., Aalbersberg, I. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3, 160018 (2016). https://doi.org/10.1038/sdata.2016.18 ↩︎