Automate data publication • EMLassemblyline

Why and when?

Recurring data publication was traditionally a manual edit and upload process that can now be automated with programmatic approaches (e.g. EAL) and use of data repository APIs (e.g. EDIutils). These approaches not only reduce effort, but improve metadata accuracy and enable automated workflows. Automated publication and upload is suitable for:

Ongoing data collection Data package is updated periodically with new data
Ongoing derived data product Data package is derived from sources with ongoing collection

Each use case is demonstrated below.

Ongoing data collection

Ongoing data collection can be the addition of new data to a time series or data changing as a result of an evolving data processing or analytical method. In either case, automated publication requires stabilization of package contents and configuration of EAL for these contents. If package contents deviate from what can be programmatically input to EAL, then metadata accuracy decreases and human and machine understanding are compromised. Example automated data publication for ongoing collection:

Upload data Field technician uploads new raw data to a queue on their project server.
Process data The presence of queued data triggers a scripted workflow in which raw data are aggregated and quality control procedures applied to create a publication ready dataset for which EAL has been configured.
Make EML The script then updates values input to make_eml()’s temporal.coverage and package.id arguments, and then runs make_eml() for the new data.
Publish The new data and EML are then published via the data repository API.

Ongoing derived data product

Another valuable use case of automated data publication is the creation of data packages that are derived from sources with ongoing collection. The derived package may produce a stand alone product, or may be part of a larger science workflow with additional downstream processes. The derived package publication relies event notification, which are a service some data repositories provide. The event notification is sent to the subscriber when a source data package is updated. Building upon the previous workflow:

Event notification An event notification is sent by the data repository to the synthesis science team server, which triggers a scripted data processing and publication workflow.
Download data The script downloads the updated package.
Process data Next the data are processed, analyzed, and formatted into a data product for which the EAL has been configured.
Make EML The script then updates values for the temporal.coverage and package.id arguments, and runs make_eml() for the data product.
Publish The data product and EML are then published via the data repository API.

At this point the synthesis science teams workflow ends, but is the beginning of another automated workflow that utilizes the published data product and serves the information to the public. Continuing with the workflow:

Event notification The local water authority website receives an event notification that the data product of interest has been updated.
Download data The new data package is downloaded.
Process data The package contents are processed into website content.
Use information The public uses this information.

Automated data publication coupled with repository event notifications enable efficient, reproducible, and valuable science workflows.