All you need to know about data/time in R and SAS

How time zone works in SAS
  • SAS datetime value (when it’s represented as double) does not store timezone!
    • SAS datetime values are measured in seconds, beginning with 01Jan1960 00:00:00 local time. For example, the SAS datetime value for 01Jan1960 00:00:00 in England is 0. In Japan, the SAS datetime value for 01Jan1960 00:00:00 is also 0. There is a nine-hour difference between England and Japan. At the same point in time, the datetime in England and the datetime in Japan cannot be 0. In order to work with absolute time and datetime values, SAS supports the UTC date, time, and datetime values in time zones. SAS Help Center: Specifying Time Zones in SAS
  • In SAS, time zone is handled by the options timezone options. After setting the system-level time zone, then a datetime format will display time zone like 2013-03-15T04:15:00-05:00
[Rule of thumb] How to handle time zone in SAS
  • For the most compatibility with other software/programming languages, don’t use datetime format!
  • When you want to represent datetime and its time zone, first, convert the datetime to UTC. Then store the date part and time part separately, e.g., date_utc and time_utc.
  • This is how WRDS handles time zone in SAS.

date

  • Number of days since Jan 1, 1960.
  • e.g., 0 for 1960-1-1

time

  • The number of seconds since midnight (12:00 am)

datetime

  • The number of seconds since the midnight of Jan 1, 1960
  • The time zone is handled by the options timezone option

In base R, there’s no dedicated format for “time.” Date and POSIXct are the only available choices.

Date

  • The number of days since 1970-01-01.
  • No time zone info is stored!
  • The difference of two Date is number of days.

POSIXct

  • The number of seconds since 1970-01-01
  • It’s the only format that stores time zone in R.
  • The difference of two POSIXcts is number of seconds.
How time zone works in POSIXct?
  • When created without specifying a time zone, the local time zone is used. However, even in this case, no tz attribute is created. R implicitly treats a no-time-zone POSIXct as if it has the local time zone.
  • When a POSIXct has no time zone, R will use local time zone as default (“MST” in the following case)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
# create a POSIXct without specifying timezone
# you'll see local time zone is implicitly used
d1 = as.POSIXct('1970-01-01 08:00:00')
d1
#> [1] "1970-01-01 08:00:00 MST"

# but you'll see no `tz` attribute
attributes(d1)
#> $class
#>   'POSIXct''POSIXt'
#> $tzone
#>   ''

# the following case again shows that local time zone is implicitly used
# note the time is converted to 10:00:00
d1 %>% with_tz('America/New_York')
#> [1] "1970-01-01 10:00:00 EST"
  • When created with time zone specified, the time zone is explicitly used/displayed.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# create a POSIXct with timezone speficied
# you'll see the time zone is explicitly displayed (EST)
d1 = as.POSIXct('1970-01-01 08:00:00', tz = 'America/New_York')
d1
#> [1] "1970-01-01 08:00:00 EST"

# now you see the `tz` attribute
attributes(d1)
#> $class
#>   'POSIXct''POSIXt'
#> $tzone
#>   'America/New_York'

# the following case again shows that local time zone is implicitly used
d1 %>% with_tz('Asia/Shanghai')
#> [1] "1970-01-01 21:00:00 CST"
How to process time with "hms"?

With the help of the hms package, we can easily handle time now.

In hms, time is stored as the number of seconds since 00:00:00.

as_hms is used to convert a POSIXct to hms.

1
2
3
4
5
# Say we want to search a vector of `POSIXct` for any observations that are after 3pm

datetime = ymd_hms('2000-01-01 17:00:00')
datetime[as_hms(datetime)<as_hms('15:00:00')]
#> TRUE
How datetime is stored in WRDS
  • WRDS converts datetime to UTC and then stores the date and time part separately, e.g., date_utc and time_utc.
  • The advantage is that one doesn’t need to figure out how to store time zone.
How to combine a `Date` and `double` time into a POSIXct?

As discussed above, WRDS likes to separately store a UTC datetime into a date and a time. To combine them, use the following snippet:

1
2
# Don't use `as.POSIXct`. You must use `as_datetime`!
data[, ':='(datetime=lubridate::as_datetime(date_utc, tz='UTC') + time_utc)][]

Results:

The rational behind is that POSIXct stores data as the number of elapsed seconds. Since our time is already in double, we just need to add it.

You may be tempted to use the following code:

1
data[, ':='(datetime_gold=as.POSIXct(as.numeric(date)*86400+time, origin='1970-01-01', tz='UTC'))]

This is problematic because of leap seconds. Multiplying the number of elapsed days with 86400 not always equals to the elapsed seconds in UTC.

How to avoid unwanted time zone converting when reading WRDS datetime?

If the original downloaded data is in SAS format (sas7bdat):

  • When we convert the downloaded sas7bdat into feather, pandas/haven will sometimes convert a date object to a datetime object, and hence involves unwanted time zone.
  • The solution is simple: during the conversion, force datatime64[ns] to be a date and then save, e.g.,
1
2
# force datetime64[ns] to date
df['mostimportantdateutc'] = df['mostimportantdateutc'].dt.date
  • Sometimes the original column is of datatime. In this case, we need to explicitly add a time zone:
1
2
# add a time zone
df['mostimportantdateutc'] = df['mostimportantdateutc'].dt.tz_localize('UTC')
  • difftime is the result of arithmetic operation on POSIXct or Date
  • By default, difftime is in seconds (Date) or days (POSIXct).
  • A very useful trick is to convert difftime to numeric. For example, if we want to convert a difftime into hours, we can do as.difftime(x, units='hours').

You may be tempted to do:

1
2
3
4
datetime = ymd_hms('2005-10-24 20:00:00')  # a POSIXct
as.Date(datetime)

#> 2005-10-25

Surprise. We expected “2005-10-24” but we get “2005-10-25.” The reason is that as.Date requires you to specify the time zone. When no time zone is available, it defaults to UTC.

One (not the best) solution is:

1
as.Date(datetime, 'America/New_York')

But a better solution is to use lubridate::as_date

1
as_date(dateime)  # No time zone needed!

The base R has no dedicated class for time (recall it only has Date and POSIXct). We use the hms package to format and manage time. Particularly, it represent a time in pretty format (e.g., 12:21:12) but internally still stores it as the number of elapsed seconds.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# 1) Create a time in double
x = data.table(time1=46768)

#>    time1 
#> 1: 46768

# 2) Use `as_hms` to convert time
x[, time2 := as_hms(time1)]

#>    time1     time2
#> 1: 46768  12:59:28

# 3) Internally, they're identical
x$time1 == x$time2

#> TRUE