Data interoperability is improved by the adoption of standard date and time formats. To facilitate use of the ISO 8601 standard, we’ve developed functions for converting to ISO 8601 formats, for reporting ISO 8601 format specifiers, and for reading ISO 8601 strings into POSIXct and POSIXt. This vignette provides an overview of these functions and demonstrates their capabilities.

# Install and load dataCleanr

# remotes::install_github('EDIorg/dataCleanr')
library(dataCleanr)

Conversion

iso8601_convert converts date and time strings to standard ISO 8601 formatted strings, with the output temporal resolution matching the input, and full support of timezone offsets. This function does not convert to all ISO 8601 formats. Currently supported formats include calendar dates, times, time zones, and valid combinations of these. Week dates, ordinal dates, and time intervals are not yet supported.

iso8601_convert leverages the power of lubridate::parse_date_time to parse dates and times, then uses regular expressions on the orders argument to identify the temporal resolution of input data to output data accordingly. Most of the arguments available to lubridate::parse_date_time can be used with iso8601_convert.

A common data management issue is converting date and time data into a consistent format. The following example illustrates some issues encountered in this process and how to solve them with iso8601_convert.

# Load the example data and view the first few lines to identify datetime orders

x <- data_iso8601$datetime
head(x)
#> [1] "5/15/12 13:00" "5/15/12 14:00" "5/15/12 15:00" "5/15/12 16:00"
#> [5] "5/15/12 17:00" "5/15/12 18:00"

# Convert the data using 'mdy HM' orders

x_cnv <- iso8601_convert(x, orders = 'mdy HM')
#> Warning in iso8601_convert(x, orders = "mdy HM"): Some data failed to parse.
#> Consider updating your list of orders.
x_cnv
#>  [1] "2012-05-15T13:00" "2012-05-15T14:00" "2012-05-15T15:00" "2012-05-15T16:00"
#>  [5] "2012-05-15T17:00" "2012-05-15T18:00" "2012-05-15T19:00" "2012-05-15T20:00"
#>  [9] "2012-05-15T21:00" "2012-05-15T22:00" "2012-05-15T23:00" "2012-05-16T00:00"
#> [13] NA                 NA                 NA                 NA                
#> [17] NA                 "2012-05-16T06:00" "2012-05-16T07:00" "2012-05-16T08:00"
#> [21] "2012-05-16T09:00" NA                 NA                 NA                
#> [25] NA                 NA                 NA                 NA                
#> [29] NA                 NA                 NA

Looks like the orders didn’t fully describe the input data. NAs are returned where parsing failed. We can use these to view what wasn’t processed and why.

# View data that wasn't parsed

x[is.na(x_cnv)]
#>  [1] "5/16/12 1"     "5/16/12 2"     "5/16/12 3"     "5/16/12 4"    
#>  [5] "5/16/12 5"     "16/5/12 10:00" "16/5/12 11:00" "16/5/12 12:00"
#>  [9] "16/5/12 13:00" "16/5/12 14:00" "16/5/12 15:00" "16/5/12 16:00"
#> [13] "16/5/12 17:00" "16/5/12 18:00" "16/5/12 19:00"

# There are two additional orders present in these data, 'mdy H' and 'dmy HM'.
# Try converting with an updated list of orders.

x_cnv <- iso8601_convert(x, orders = c('mdy H', 'mdy HM', 'dmy HM'))
#> Warning in iso8601_convert(x, orders = c("mdy H", "mdy HM", "dmy HM")):
#> Converted data contains multiple levels of temporal resolution. Use the argument
#> "return.format = T" to see where.
x_cnv
#>  [1] "2012-05-15T13:00" "2012-05-15T14:00" "2012-05-15T15:00" "2012-05-15T16:00"
#>  [5] "2012-05-15T17:00" "2012-05-15T18:00" "2012-05-15T19:00" "2012-05-15T20:00"
#>  [9] "2012-05-15T21:00" "2012-05-15T22:00" "2012-05-15T23:00" "2012-05-16T00:00"
#> [13] "2012-05-16T01"    "2012-05-16T02"    "2012-05-16T03"    "2012-05-16T04"   
#> [17] "2012-05-16T05"    "2012-05-16T06:00" "2012-05-16T07:00" "2012-05-16T08:00"
#> [21] "2012-05-16T09:00" "2012-05-16T10:00" "2012-05-16T11:00" "2012-05-16T12:00"
#> [25] "2012-05-16T13:00" "2012-05-16T14:00" "2012-05-16T15:00" "2012-05-16T16:00"
#> [29] "2012-05-16T17:00" "2012-05-16T18:00" "2012-05-16T19:00"

All the input data have been converted to ISO 8601 and the output contains multiple temporal resolutions (as indicated by the warning message). If we want to output a consistent resolution, we can use the return.format argument to see where the output resolution differs, and make the appropriate changes to iso8601_convert.

# Return a data frame containing x, x_converted, and the format of x_converted.

x_cnv <- iso8601_convert(x, orders = c('mdy H', 'mdy HM', 'dmy HM'), return.format = TRUE)
x_cnv
#>                x      x_converted           format
#> 1  5/15/12 13:00 2012-05-15T13:00 YYYY-MM-DDThh:mm
#> 2  5/15/12 14:00 2012-05-15T14:00 YYYY-MM-DDThh:mm
#> 3  5/15/12 15:00 2012-05-15T15:00 YYYY-MM-DDThh:mm
#> 4  5/15/12 16:00 2012-05-15T16:00 YYYY-MM-DDThh:mm
#> 5  5/15/12 17:00 2012-05-15T17:00 YYYY-MM-DDThh:mm
#> 6  5/15/12 18:00 2012-05-15T18:00 YYYY-MM-DDThh:mm
#> 7  5/15/12 19:00 2012-05-15T19:00 YYYY-MM-DDThh:mm
#> 8  5/15/12 20:00 2012-05-15T20:00 YYYY-MM-DDThh:mm
#> 9  5/15/12 21:00 2012-05-15T21:00 YYYY-MM-DDThh:mm
#> 10 5/15/12 22:00 2012-05-15T22:00 YYYY-MM-DDThh:mm
#> 11 5/15/12 23:00 2012-05-15T23:00 YYYY-MM-DDThh:mm
#> 12  5/16/12 0:00 2012-05-16T00:00 YYYY-MM-DDThh:mm
#> 13     5/16/12 1    2012-05-16T01    YYYY-MM-DDThh
#> 14     5/16/12 2    2012-05-16T02    YYYY-MM-DDThh
#> 15     5/16/12 3    2012-05-16T03    YYYY-MM-DDThh
#> 16     5/16/12 4    2012-05-16T04    YYYY-MM-DDThh
#> 17     5/16/12 5    2012-05-16T05    YYYY-MM-DDThh
#> 18  5/16/12 6:00 2012-05-16T06:00 YYYY-MM-DDThh:mm
#> 19  5/16/12 7:00 2012-05-16T07:00 YYYY-MM-DDThh:mm
#> 20  5/16/12 8:00 2012-05-16T08:00 YYYY-MM-DDThh:mm
#> 21  5/16/12 9:00 2012-05-16T09:00 YYYY-MM-DDThh:mm
#> 22 16/5/12 10:00 2012-05-16T10:00 YYYY-MM-DDThh:mm
#> 23 16/5/12 11:00 2012-05-16T11:00 YYYY-MM-DDThh:mm
#> 24 16/5/12 12:00 2012-05-16T12:00 YYYY-MM-DDThh:mm
#> 25 16/5/12 13:00 2012-05-16T13:00 YYYY-MM-DDThh:mm
#> 26 16/5/12 14:00 2012-05-16T14:00 YYYY-MM-DDThh:mm
#> 27 16/5/12 15:00 2012-05-16T15:00 YYYY-MM-DDThh:mm
#> 28 16/5/12 16:00 2012-05-16T16:00 YYYY-MM-DDThh:mm
#> 29 16/5/12 17:00 2012-05-16T17:00 YYYY-MM-DDThh:mm
#> 30 16/5/12 18:00 2012-05-16T18:00 YYYY-MM-DDThh:mm
#> 31 16/5/12 19:00 2012-05-16T19:00 YYYY-MM-DDThh:mm

# Return unique formats

frmts <- unique(x_cnv$format)
frmts
#> [1] "YYYY-MM-DDThh:mm" "YYYY-MM-DDThh"

# View input data corresponding to 'YYYY-MM-DDThh' outputs

x_cnv$x[x_cnv$format == frmts[2]]
#> [1] "5/16/12 1" "5/16/12 2" "5/16/12 3" "5/16/12 4" "5/16/12 5"

# To output data in the same temporal resolution change 'mdy H' to 'mdy HM' 
# and set `truncated` to the number of format characters missing in the 
# original data (i.e. 1).

x_cnv <- iso8601_convert(x, orders = c('mdy HM', 'mdy HM', 'dmy HM'), truncated = 1)
x_cnv
#>  [1] "2012-05-15T13:00" "2012-05-15T14:00" "2012-05-15T15:00" "2012-05-15T16:00"
#>  [5] "2012-05-15T17:00" "2012-05-15T18:00" "2012-05-15T19:00" "2012-05-15T20:00"
#>  [9] "2012-05-15T21:00" "2012-05-15T22:00" "2012-05-15T23:00" "2012-05-16T00:00"
#> [13] "2012-05-16T01:00" "2012-05-16T02:00" "2012-05-16T03:00" "2012-05-16T04:00"
#> [17] "2012-05-16T05:00" "2012-05-16T06:00" "2012-05-16T07:00" "2012-05-16T08:00"
#> [21] "2012-05-16T09:00" "2012-05-16T10:00" "2012-05-16T11:00" "2012-05-16T12:00"
#> [25] "2012-05-16T13:00" "2012-05-16T14:00" "2012-05-16T15:00" "2012-05-16T16:00"
#> [29] "2012-05-16T17:00" "2012-05-16T18:00" "2012-05-16T19:00"

# The output resolution is now consistent.

Time zone offsets are added as a two digit hour with a ‘+’ or ‘-’ with respect to UTC.

# Add a time zone offset

x_cnv <- iso8601_convert(x, orders = c('mdy HM', 'mdy HM', 'dmy HM'), truncated = 1, tz = '-05')
x_cnv
#>  [1] "2012-05-15T13:00-05" "2012-05-15T14:00-05" "2012-05-15T15:00-05"
#>  [4] "2012-05-15T16:00-05" "2012-05-15T17:00-05" "2012-05-15T18:00-05"
#>  [7] "2012-05-15T19:00-05" "2012-05-15T20:00-05" "2012-05-15T21:00-05"
#> [10] "2012-05-15T22:00-05" "2012-05-15T23:00-05" "2012-05-16T00:00-05"
#> [13] "2012-05-16T01:00-05" "2012-05-16T02:00-05" "2012-05-16T03:00-05"
#> [16] "2012-05-16T04:00-05" "2012-05-16T05:00-05" "2012-05-16T06:00-05"
#> [19] "2012-05-16T07:00-05" "2012-05-16T08:00-05" "2012-05-16T09:00-05"
#> [22] "2012-05-16T10:00-05" "2012-05-16T11:00-05" "2012-05-16T12:00-05"
#> [25] "2012-05-16T13:00-05" "2012-05-16T14:00-05" "2012-05-16T15:00-05"
#> [28] "2012-05-16T16:00-05" "2012-05-16T17:00-05" "2012-05-16T18:00-05"
#> [31] "2012-05-16T19:00-05"

Voila! Dates and times in an ISO 8601 standard format.

Get the datetime format string

Another common data management task is to identify and report the format string specifier (e.g. ‘YYYY-MM-DD’) for a vector of date and time data. While not particularly difficult to do, automating this task can improve workflow efficiency and accuracy.

iso8601_get_format_string uses regular expressions to parse input data and identify the most common format string present in a set of ISO 8601 dates and times in the format output by iso8601_convert.

# Get the format of the processed date and time data

iso8601_get_format_string(x_cnv)
#> [1] "YYYY-MM-DDThh:mm-hh"

# If there were more than one date time formats present, then the mode is 
# returned with a warning message.

x_different_formats <- c('2012-05-15T13:45:00',
                         '2012-06-15T13:45:00',
                         '2012-07-15T13:45:00',
                         '2012-08-15T13:45',
                         '2012-09-15T13:45',
                         '2012-10-15T13')

iso8601_get_format_string(x_different_formats)
#> Warning in iso8601_get_format_string(x_different_formats): More than one format
#> was found. The returned value is the mode of the detected formats. Use the
#> argument "return.format = T" to see all detected fomats.
#> [1] "YYYY-MM-DDThh:mm:ss"

# Use the return.format argument to see where formats differ

iso8601_get_format_string(x_different_formats, return.format = TRUE)
#> Warning in iso8601_get_format_string(x_different_formats, return.format = TRUE):
#> More than one format was found. The returned value is the mode of the detected
#> formats. Use the argument "return.format = T" to see all detected fomats.
#>                     x              format
#> 1 2012-05-15T13:45:00 YYYY-MM-DDThh:mm:ss
#> 2 2012-06-15T13:45:00 YYYY-MM-DDThh:mm:ss
#> 3 2012-07-15T13:45:00 YYYY-MM-DDThh:mm:ss
#> 4    2012-08-15T13:45    YYYY-MM-DDThh:mm
#> 5    2012-09-15T13:45    YYYY-MM-DDThh:mm
#> 6       2012-10-15T13       YYYY-MM-DDThh

Reading into POSIXct POSIXt

iso8601_read provides a lightweight option for reading ISO 8601 data created with iso8601_convert into POSIXct POSIXt. This function uses regular expressions to extract the orders and time zone offset arguments and then passess the info to lubridate::parse_date_time.

# Read data into POSIXct POSIXt

# Read data into R
x_pos <- iso8601_read(x_cnv)
attributes(x_pos)
#> $class
#> [1] "POSIXct" "POSIXt" 
#> 
#> $tzone
#> [1] "Etc/GMT+5"

Now that the data are in POSIX, the myriad of date and time functions of lubridate are available to you.