All you need to know about data/time in R and SAS
1 How SAS stores datetime
- SAS
datetime
value (when it’s represented as double) does not store timezone!- SAS datetime values are measured in seconds, beginning with 01Jan1960 00:00:00 local time. For example, the SAS datetime value for 01Jan1960 00:00:00 in England is 0. In Japan, the SAS datetime value for 01Jan1960 00:00:00 is also 0. There is a nine-hour difference between England and Japan. At the same point in time, the datetime in England and the datetime in Japan cannot be 0. In order to work with absolute time and datetime values, SAS supports the UTC date, time, and datetime values in time zones. SAS Help Center: Specifying Time Zones in SAS
- In SAS, time zone is handled by the
options timezone
options. After setting the system-level time zone, then adatetime
format will display time zone like2013-03-15T04:15:00-05:00
- For the most compatibility with other software/programming languages, don’t use
datetime
format! - When you want to represent datetime and its time zone, first, convert the datetime to UTC. Then store the date part and time part separately, e.g.,
date_utc
andtime_utc
. - This is how WRDS handles time zone in SAS.
date
- Number of days since Jan 1, 1960.
- e.g.,
0
for 1960-1-1
time
- The number of seconds since midnight (12:00 am)
datetime
- The number of seconds since the midnight of Jan 1, 1960
- The time zone is handled by the
options timezone
option
2 How R stores datetime
In base R, there’s no dedicated format for “time.”
Date
andPOSIXct
are the only available choices.
Date
- The number of days since 1970-01-01.
- No time zone info is stored!
- The difference of two
Date
is number of days.
POSIXct
- The number of seconds since 1970-01-01
- It’s the only format that stores time zone in R.
- The difference of two
POSIXct
s is number of seconds.
- When created without specifying a time zone, the local time zone is used. However, even in this case, no
tz
attribute is created. R implicitly treats a no-time-zone POSIXct as if it has the local time zone. - When a POSIXct has no time zone, R will use local time zone as default (“MST” in the following case)
# create a POSIXct without specifying timezone
# you'll see local time zone is implicitly used
d1 = as.POSIXct('1970-01-01 08:00:00')
d1
#> [1] "1970-01-01 08:00:00 MST"
# but you'll see no `tz` attribute
attributes(d1)
#> $class
#> 'POSIXct''POSIXt'
#> $tzone
#> ''
# the following case again shows that local time zone is implicitly used
# note the time is converted to 10:00:00
d1 %>% with_tz('America/New_York')
#> [1] "1970-01-01 10:00:00 EST"
- When created with time zone specified, the time zone is explicitly used/displayed.
# create a POSIXct with timezone speficied
# you'll see the time zone is explicitly displayed (EST)
d1 = as.POSIXct('1970-01-01 08:00:00', tz = 'America/New_York')
d1
#> [1] "1970-01-01 08:00:00 EST"
# now you see the `tz` attribute
attributes(d1)
#> $class
#> 'POSIXct''POSIXt'
#> $tzone
#> 'America/New_York'
# the following case again shows that local time zone is implicitly used
d1 %>% with_tz('Asia/Shanghai')
#> [1] "1970-01-01 21:00:00 CST"
With the help of the hms
package, we can easily handle time now.
In hms
, time is stored as the number of seconds since 00:00:00.
as_hms
is used to convert a POSIXct
to hms
.
# Say we want to search a vector of `POSIXct` for any observations that are after 3pm
datetime = ymd_hms('2000-01-01 17:00:00')
datetime[as_hms(datetime)<as_hms('15:00:00')]
#> TRUE
3 WRDS Guide
- WRDS converts datetime to UTC and then stores the date and time part separately, e.g.,
date_utc
andtime_utc
. - The advantage is that one doesn’t need to figure out how to store time zone.
As discussed above, WRDS likes to separately store a UTC datetime into a date and a time. To combine them, use the following snippet:
# Don't use `as.POSIXct`. You must use `as_datetime`!
data[, ':='(datetime=lubridate::as_datetime(date_utc, tz='UTC') + time_utc)][]
Results:
The rational behind is that POSIXct stores data as the number of elapsed seconds. Since our time is already in double
, we just need to add it.
You may be tempted to use the following code:
data[, ':='(datetime_gold=as.POSIXct(as.numeric(date)*86400+time, origin='1970-01-01', tz='UTC'))]
This is problematic because of leap seconds. Multiplying the number of elapsed days with 86400 not always equals to the elapsed seconds in UTC.
If the original downloaded data is in SAS format (sas7bdat):
- When we convert the downloaded
sas7bdat
into feather, pandas/haven will sometimes convert adate
object to adatetime
object, and hence involves unwanted time zone. - The solution is simple: during the conversion, force
datatime64[ns]
to be adate
and then save, e.g.,
# force datetime64[ns] to date
df['mostimportantdateutc'] = df['mostimportantdateutc'].dt.date
- Sometimes the original column is of
datatime
. In this case, we need to explicitly add a time zone:
# add a time zone
df['mostimportantdateutc'] = df['mostimportantdateutc'].dt.tz_localize('UTC')
4 Other tips
4.1 Time interval to numeric: difftime
difftime
is the result of arithmetic operation onPOSIXct
orDate
- By default,
difftime
is in seconds (Date
) or days (POSIXct
). - A very useful trick is to convert
difftime
tonumeric
. For example, if we want to convert adifftime
into hours, we can doas.difftime(x, units='hours')
.
4.2 Convert POSIXct
to Date
You may be tempted to do:
datetime = ymd_hms('2005-10-24 20:00:00') # a POSIXct
as.Date(datetime)
#> 2005-10-25
Surprise. We expected “2005-10-24” but we get “2005-10-25.” The reason is that as.Date
requires you to specify the time zone. When no time zone is available, it defaults to UTC.
One (not the best) solution is:
as.Date(datetime, 'America/New_York')
But a better solution is to use lubridate::as_date
as_date(dateime) # No time zone needed!
4.3 Manage time
The base R has no dedicated class for time (recall it only has Date
and POSIXct
). We use the hms
package to format and manage time. Particularly, it represent a time in pretty format (e.g., 12:21:12) but internally still stores it as the number of elapsed seconds.
# 1) Create a time in double
x = data.table(time1=46768)
#> time1
#> 1: 46768
# 2) Use `as_hms` to convert time
x[, time2 := as_hms(time1)]
#> time1 time2
#> 1: 46768 12:59:28
# 3) Internally, they're identical
x$time1 == x$time2
#> TRUE