2 Evaluate metadata content against FAIR criteria
2.1 Introduction
FAIR - findable, accessible, interoperable, reusable, as first suggested by Wilkinson et al. 20161 is a framework for understanding quality of metadata in terms of making data usable for somebody who was not directly involved in the sampling.
EDI has implemented the EML schema for metadata and a congruence checker which together already assure a high degree of metadata quality. As the community develops more guidelines for metadata quality it may become interesting for a research site or project to evaluate their overall performance in this area and to analyze where improvements may be implemented.
The Research Data Alliance (RDA) has developed guidelines for evaluating FAIR, the FAIR data maturity model. This framework is fairly general and research communities need to expand it with more specific community level criteria. DataONE has taken the initiative to develop such criteria for data in the DataONE community. Specific implementation of the checks for EML metadata can be found in the git repo.
Table 1: Comparison of FAIR criteria and how they are implemented in EDI
FAIR | RDA | DataONE | EDI repository implementation | EML schema | EDI check |
---|---|---|---|---|---|
F | Metadata is identified by a persistent identifier | metadata identifier present | required | yes | |
F | Data is identified by a persistent identifier | entity identifier present | yes | ||
F | Metadata is identified by a globally unique identifier | metadata identifier present | required | yes | |
F | Data is identified by a globally unique identifier | entity identifier present | yes | ||
F | entity identifier Type present | yes | |||
F | Rich metadata is provided to allow discovery | resource title length sufficient | yes | ||
F | resource publication date present | yes | |||
F | resource creator present | required | yes | ||
F | resource creator identifier present | no | |||
F | resource abstract length sufficient | yes | |||
F | resource keywords present | yes | |||
F | resource keywords controlled | no | |||
F | resource keyword type present | no | |||
F | resource publication date timeframe | no | |||
F | resource revision date present | not in EML = latest publication date | no | ||
F | resource spatial extent present | yes | |||
F | geographic description present | required with spatial extent | yes | ||
F | resource taxonomic extent present | yes | |||
F | resource temporal extent present | yes | |||
F | Metadata includes the identifier for the data | entity identifier present | yes | ||
F | Metadata is offered in such a way that it can be harvested and indexed | D1 member node, PASTA+ API, schema.org | |||
A | Metadata contains information to enable the user to get access to the data | entity distribution URL resolvable | yes | ||
A | resource access control rules present | yes | no | ||
A | resource distribution contact present | required | yes | ||
A | resource distribution contact identifier present | no | |||
A | resource publisher present | is EDI | |||
A | resource publisher identifier present | EDI’s ROR | |||
A | resource service location present | PASTA+ API | not in EML | ||
A | resource service provider present | PASTA+ API | not in EML | ||
A | Metadata can be accessed manually (i.e. with human intervention) | resource landing page present | yes | ||
A | Data can be accessed manually (i.e. with human intervention) | entity distribution URL resolvable | yes | yes | |
A | Metadata identifier resolves to a metadata record | metadata identifier resolvable | no | ||
A | Data identifier resolves to a digital object | yes | |||
A | Metadata is accessed through standardized protocol | yes | |||
A | Data is accessible through standardized protocol | yes | |||
A | Data can be accessed automatically (i.e. by a computer program) | yes | |||
A | Metadata is accessible through a free access protocol | yes | |||
A | Data is accessible through a free access protocol | yes | |||
A | Data is accessible through an access protocol that supports authentication and authorisation | yes | |||
A | Metadata is guaranteed to remain available after data is no longer available | ||||
I | Metadata uses knowledge representation expressed in standardised format | LTER vocabulary | |||
I | Data uses knowledge representation expressed in standardised format | EML entity | |||
I | entity format present | yes | |||
I | entity name present | required | yes | ||
I | entity type present | encoded | yes | ||
I | entity checksum present | yes | |||
I | entity attributeName differs from description | yes | |||
I | entity attributeNames unique | yes | |||
I | entity attributeDefinition present | no | |||
I | entity attributeDefinition sufficient | no | |||
I | entity attributeStorageType present | no | |||
I | Metadata uses machine-understandable knowledge representation | LTER vocabulary in SKOS | |||
I | Data uses machine-understandable knowledge representation | data models | |||
I | Metadata uses FAIR-compliant vocabularies | LTER vocabulary | |||
I | Data uses FAIR-compliant vocabularies | ||||
I | Metadata includes references to other metadata | ||||
I | Data includes references to other data | ||||
I | Metadata includes references to other data | ||||
I | Data includes qualified references to other data | ||||
I | Metadata includes qualified references to other metadata | ||||
I | Metadata include qualified references to other data | ||||
R | Plurality of accurate and relevant attributes are provided to allow reuse | entity format nonproprietary | |||
R | entity attributeDomain present | ||||
R | entity attributeUnits present | required | yes | ||
R | entity attributeMeasurementScale present | required | yes | ||
R | entity attributePrecision present | ||||
R | entity description present | required | yes | ||
R | entity qualityDescription present | ||||
R | Metadata includes information about the licence under which the data can be reused | resource license present | yes | ||
R | Metadata refers to a standard reuse licence | ||||
R | Metadata refers to a machine-understandable reuse licence | ||||
R | Metadata includes provenance information according to community-specific standards | provenance processStepCode present | |||
R | provenance sourceEntity present | ||||
R | provenance trace present | not in EML | |||
R | resource methods present | yes | |||
R | Metadata includes provenance information according to a cross-community language | ||||
R | Metadata complies with a community standard | EML | yes | ||
R | Data complies with a community standard | ||||
R | Metadata is expressed in compliance with a machine-understandable community standard | EML | yes | ||
R | Data is expressed in compliance with a machine-understandable community standard |
2.2 Download EML files
For more information see the EDIutils R package, e.g. how to find all package IDs for a site or by keyword
library(EDIutils)
library(xml2)
library(stringr)
library(tidyverse)
<- "knb-lter-ntl"
scope <- 1
identifier
#find the newest revision
<- list_data_package_revisions(scope = scope,
revision identifier = identifier,
filter = "newest")
<- paste(scope,identifier,revision, sep = ".")
package_id
# Read the EML file for the data package ID and save EML locally.
<- read_metadata(packageId = package_id)
eml_file #write_xml(eml_file, file = paste("./data/", package_id, "xml", sep = "."))
2.3 Analyze EML content
These checks are not comprehensive. Checks for semantic annotations are not implemented here yet.
2.3.1 dataset has ID and ID is resolvable
We know that all datasets in EDI have a metadata ID for valid EML, however, that ID is not resolvable. This checks to see if there is an alternateIdentifier that is resolvable. One possible ID is the EDI inserted DOI, but others are possible as well.
<- xml_attr(eml_file, 'packageId')
eml_id
<- xml_text(xml_find_all(eml_file, './dataset/alternateIdentifier'))
eml_alt_id <- xml_find_all(eml_file, './dataset/alternateIdentifier') %>%
eml_alt_id_syst xml_attr('system')
<- str_remove(eml_alt_id, 'doi:')
eml_alt_id
<- paste(eml_alt_id_syst, eml_alt_id, sep = '/')
eml_id_text <- ifelse(str_detect(eml_id_text, "https://|http://"), 1, 0) eml_id_resolv
2.3.2 title length
<- xml_text(xml_find_first(eml_file, './/title'))
eml_title <- str_split(eml_title, '\\s+')
title_words <- length(title_words[[1]]) title_length
2.3.3 abstract length
<- xml_text(xml_find_first(eml_file, './/abstract'))
eml_abstract <- str_replace_all(eml_abstract, '\\\n', ' ')
eml_abstract <- str_remove_all(eml_abstract, '\\\t')
eml_abstract <- str_split(eml_abstract, '\\s+')
abstract_words <- length(abstract_words[[1]]) abstract_length
2.3.4 number of keywords, keyword types, keyword thesaurus
<- xml_find_all(eml_file,'.//keywordSet')
eml_keywordsets <- xml_find_all(eml_keywordsets, './/keyword')
eml_keywords <- length(eml_keywords)
num_keywords <- xml_has_attr(eml_keywords, 'keywordType')
eml_keyword_attr <- length(which(eml_keyword_attr))
num_keywordtype <- xml_find_all(eml_keywordsets, './/keywordThesaurus')
eml_thesaurus <- length(eml_thesaurus) num_thesaurus
2.3.5 pub date
<- xml_text(xml_find_first(eml_file, './/pubDate')) eml_pubdate
#creator and orcid ID
<- length(xml_find_all(eml_file, './dataset/creator'))
num_creators <- length(xml_find_all(eml_file, './dataset/creator/userId')) num_orcids
2.3.6 coverages present
<- length(xml_find_all(eml_file, './/geographicCoverage'))
eml_geog_num if (eml_geog_num > 0) {eml_geog <- "yes"} else {eml_geog <- "no"}
<- length(xml_find_all(eml_file, './/geographicDescription'))
eml_geog_descr_num if (eml_geog_descr_num == eml_geog_num) {eml_geog_descr <- "yes"} else {eml_geog_descr <- "no"}
<- length(xml_find_all(eml_file, './/temporalCoverage'))
eml_time_num if (eml_time_num > 0) {eml_time <- "yes"} else {eml_time <- "no"}
<- length(xml_find_all(eml_file, './/taxonomicCoverage'))
eml_taxon_num if (eml_taxon_num > 0 ) {eml_taxon <- "yes"} else {eml_taxon <- "no"}
2.3.7 access is public
<- xml_text(xml_find_all(eml_file, './access/allow/principal'))
eml_access <- str_detect(eml_access, 'public')
eml_public <- length(which(eml_public))
public_num if (public_num > 0) {public <- "yes"} else {public <- "no"}
2.3.8 contact and contact ID present
<- length(xml_find_all(eml_file, './dataset/contact/electronicMailAddress'))
eml_contact
<- length(xml_find_all(eml_file, './dataset/contact/userId')) eml_contact_id
2.3.9 publisher and publisher ID present
this is automatically added by EDI
<- length(xml_find_all(eml_file, './dataset/publisher'))
eml_publisher <- length(xml_find_all(eml_file, './dataset/publisher/userID')) eml_publisher_id
2.3.10 landing page link present
<- xml_text(xml_find_all(eml_file, './dataset/distribution/online/url[@function="information"]'))
eml_landing <- ifelse(length(eml_landing > 0), eml_landing, "")
eml_landing <- ifelse(str_detect(eml_landing, "https://|http://"), 1, 0) eml_landing_resolv
2.3.11 quality description present
<- length(xml_find_all(eml_file, '//qualityControl')) num_qualitydesc
2.3.12 methods description present and length
<- xml_text(xml_find_all(eml_file, '//methods'))
eml_methods if (length(eml_methods) > 0){
<- str_replace_all(eml_methods, '\\\n', ' ')
eml_methods <- str_remove_all(eml_methods, '\\\t')
eml_methods <- str_split(eml_methods, '\\s+')
eml_methods_word <- length(eml_methods_word[[1]])
methods_length }
2.3.13 license present
<- length(xml_find_all(eml_file, '//intellectualRights'))
num_license <- num_license + length(xml_find_all(eml_file, '//licensed')) num_license
2.3.14 provenance data source present
<- length(xml_find_all(eml_file, '//dataSource')) num_provdatasource
2.3.15 processing code present and described
This checks for certain file extensions
<- length(xml_find_all(eml_file, '//software'))
num_software <- xml_text(xml_find_all(eml_file, '//otherEntity/physical/objectName'))
eml_script <- character(0)
extensions <- c('R', 'r', 'py', 'sql')
extensions_to_check <- 'no'
script if (length(eml_script) > 0){
for (j in 1:length(eml_script)) {
<- str_split(eml_script[j], '\\.')
eml_scriptparts <- length(eml_scriptparts[[1]])
p <- append(extensions, eml_scriptparts[[1]][p])
extensions
}for (j in 1:length(extensions_to_check)) {
if(extensions_to_check[j] %in% extensions){
<- 'yes'
script
}
} }
2.3.16 entity information
<- xml_find_all(eml_file, './/entityName')
eml_entities <- length(eml_entities)
num_entities <- xml_siblings(eml_entities)
eml_entity_info <- length(xml_find_all(eml_entity_info, './distribution/online/url'))
num_entity_url <- length(xml_find_all(eml_entity_info, './authentication[1]'))
num_checksum <- length(xml_find_all(eml_entity_info, '//entityDescription'))
num_entitydescr <- 0
num_enitydescrsufficient if (num_entitydescr > 0){
<- xml_text(xml_find_all(eml_entity_info, '//entityDescription'))
eml_entitydescr for (j in 1:num_entitydescr) {
<- str_split(eml_entitydescr[j], '\\s+')
entitydescr_words if (length(entitydescr_words[[1]]) > 2){
<- num_enitydescrsufficient + 1
num_enitydescrsufficient
}
}
}<- length(xml_find_all(eml_entity_info, '//physical/dataFormat'))
num_entity_format
<- xml_attr(xml_parent(eml_entities), 'id')
entity_ids_text <- xml_attr(xml_parent(eml_entities), 'system')
entity_syst_text <- length(entity_ids_text[!is.na(entity_ids_text)])
entity_ids <- length(entity_syst_text[!is.na(entity_syst_text)])
entity_syst <- length(xml_find_all(xml_parent(eml_entity_info), './alternateIdentifier'))
entity_alt_ids <- entity_ids + entity_alt_ids
entity_ids <- ifelse(entity_ids > num_entities, num_entities, entity_ids)
num_entity_id <- length(xml_find_all(xml_parent(eml_entity_info), './alternateIdentifier[@system]'))
entity_alt_syst <- entity_syst + entity_alt_syst
entity_syst <- ifelse(entity_syst > num_entities, num_entities, entity_syst)
num_entity_syst
<- length(xml_find_all(eml_file, '//otherEntity')) num_otherentity
2.3.17 determine file format
<- xml_text(xml_find_all(eml_file, './/physical/objectName'))
eml_entity_filename <- vector()
entity_extensions
if(length(eml_entity_filename) > 0) {
for (j in 1:length(eml_entity_filename)) {
<- str_split(eml_entity_filename[j], '\\.')
eml_extension_parts <- length(eml_extension_parts[[1]])
p if(p>1) {
<- eml_extension_parts[[1]][p]
entity_extensions[j]
}
}
}
<- data.frame(entity_extensions)
extensions <- read.csv('https://github.com/NCEAS/metadig-checks/raw/main/data/DataONEformats.csv')
standard
<- standard %>%
standard distinct(File.Extension, isProprietary) %>%
mutate(entity_extensions = str_trim(File.Extension)) %>%
mutate(isProprietary = str_trim(isProprietary)) %>%
filter(nchar(File.Extension) > 1)
<- left_join(extensions, standard, by = "entity_extensions")
format
<- format %>%
open group_by(isProprietary) %>%
summarize(count = n())
<- ifelse(any(open$isProprietary == 'Y'), open$count[open$isProprietary == 'Y'], 0)
num_file_proprietary <- ifelse(any(open$isProprietary == 'N'), open$count[open$isProprietary == 'N'], 0) num_file_open
2.3.18 table entity specific information
<- xml_find_all(eml_file, './/dataTable')
eml_tableentities <- length(eml_tableentities) num_tables
2.3.19 attribute information
<- 0
num_attributes <- 0
num_attributedefs <- 0
num_attrdefsufficient <- 0
num_attributedefdifferent <- 0
num_attributestoragetype <- 0
attr_nameunique <- 0
num_attributeprecision
if (length(eml_tableentities) > 0){
for (j in 1:length(eml_tableentities)){
<- xml_find_all(eml_tableentities[j], './/attribute')
eml_attributes <- num_attributes + length(eml_attributes)
num_attributes <- num_attributedefs + length(xml_find_all(eml_attributes, './attributeDefinition'))
num_attributedefs <- num_attributeprecision + length(xml_find_all(eml_attributes, './measurementScale/interval/precision'))
num_attributeprecision <- num_attributeprecision + length(xml_find_all(eml_attributes, './measurementScale/ratio/precision'))
num_attributeprecision <- num_attributeprecision + length(xml_find_all(eml_attributes, './measurementScale/dateTime/dateTimePrecision'))
num_attributeprecision <- num_attributestoragetype + length(xml_find_all(eml_attributes, './storageType'))
num_attributestoragetype <- c('')
attr_names
if (length(eml_attributes) > 0){
for (k in 1:length(eml_attributes)) {
<- xml_text(xml_find_first(eml_attributes[k], './attributeName'))
eml_attributename <- xml_text(xml_find_first(eml_attributes[k], './attributeDefinition'))
eml_attributedef if (eml_attributedef != eml_attributename){
<- num_attributedefdifferent + 1
num_attributedefdifferent
}<- append(attr_names, eml_attributename)
attr_names <- str_split(eml_attributedef, '\\s+')
attrdef_words <- length(attrdef_words[[1]])
attrdef_length if (attrdef_length > 2){
<- num_attrdefsufficient + 1
num_attrdefsufficient
}
}
}
if (length(unique(attr_names)) == length(attr_names)){
<- 1
attr_nameunique
}
}
}
<- c("package ID", "dataset ID resolvable", "number of words in title", "number of words in abstract", "number of keywords", "number of keywords with type", "thesauri identified", "publication date present", "number of creators", "number of creators with ID", "geographic coverage present", "geographic description present", "temporal coverage present", "taxononimc coverage present", "public access granted", "number of dataset contacts", "number of contacts with ID", "number of publishers", "number of publisher with ID", "landing page link", "number of words in methods description", "processing code in software element", "processing code in other entity", "license present", "provenance datasource linked", "number of table entities", "number of other entities", "number of all entities", "number of entity IDs present", "number of entitis downloadable", "number of entities with checksums", "number of entities with descriptions", "number of enity description of sufficient length", "number of entities with format defined", "number of entities open format", "number of entities proprietary format", "attribute names unique within each entity", "number of attributes", "number of attribute definitions", "number of attribute definitions of sufficient length", "attribute definition different then attribute name", "attribute storage type defined", "attribute precision defined", "number of data quality descriptions")
test
<- c(eml_id, eml_id_text, title_length, abstract_length, num_keywords, num_keywordtype, num_thesaurus, eml_pubdate, num_creators, num_orcids, eml_geog, eml_geog_descr, eml_time, eml_taxon, public, eml_contact, eml_contact_id, eml_publisher, eml_publisher_id, eml_landing, methods_length, num_software, script, num_license, num_provdatasource, num_tables, num_otherentity, num_entities, num_entity_id, num_entity_url, num_checksum, num_entitydescr, num_enitydescrsufficient, num_entity_format, num_file_open, num_file_proprietary, attr_nameunique, num_attributes, num_attributedefs, num_attrdefsufficient, num_attributedefdifferent, num_attributestoragetype, num_attributeprecision, num_qualitydesc)
result
<- data.frame(test, result)
evaluation
::kable(evaluation, table.attr = "class=\"striped\"",
knitrformat = "html")
test | result |
---|---|
package ID | knb-lter-ntl.1.52 |
dataset ID resolvable | https://doi.org/10.6073/pasta/8359d27bbd91028f222d923a7936077d |
number of words in title | 17 |
number of words in abstract | 206 |
number of keywords | 41 |
number of keywords with type | 0 |
thesauri identified | 5 |
publication date present | 2010-09-20 |
number of creators | 4 |
number of creators with ID | 2 |
geographic coverage present | yes |
geographic description present | yes |
temporal coverage present | yes |
taxononimc coverage present | no |
public access granted | yes |
number of dataset contacts | 2 |
number of contacts with ID | 0 |
number of publishers | 0 |
number of publisher with ID | 0 |
landing page link | https://test7.limnology.wisc.edu/dataset/north-temperate-lakes-lter-chemical-limnology-primary-study-lakes-nutrients-ph-and-carbon-19 |
number of words in methods description | 635 |
processing code in software element | 0 |
processing code in other entity | no |
license present | 1 |
provenance datasource linked | 0 |
number of table entities | 1 |
number of other entities | 0 |
number of all entities | 1 |
number of entity IDs present | 0 |
number of entitis downloadable | 1 |
number of entities with checksums | 1 |
number of entities with descriptions | 1 |
number of enity description of sufficient length | 1 |
number of entities with format defined | 1 |
number of entities open format | 1 |
number of entities proprietary format | 0 |
attribute names unique within each entity | 1 |
number of attributes | 59 |
number of attribute definitions | 59 |
number of attribute definitions of sufficient length | 51 |
attribute definition different then attribute name | 59 |
attribute storage type defined | 31 |
attribute precision defined | 2 |
number of data quality descriptions | 0 |
Wilkinson, M., Dumontier, M., Aalbersberg, I. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3, 160018 (2016). https://doi.org/10.1038/sdata.2016.18↩︎