How to query an interval? Understand `foverlap` and `roll` in data.table
Query the interval: foverlap
We often need to query overlaps. For example, given a table of daily trading volume, we want to find for each day t, the trading volume of its surrounding days, i.e., [t-N,t+N].
Example:
Suppose we have an event table. The event table contains the event date. Events are infrequent. In our example, events only take place on day 1, 3, 5:
# event: infrequent events of interest
event = data.table(date=c(1,3,5))
event
#> date
#> 1: 1
#> 2: 3
#> 3: 5There’s another table, called history. The history table records values of interest on each day. In the example, the history table has a value for each day from 1 to 5:
# history: chronological records for each date.
history = data.table(date=1:5, value=sample(letters, 5, replace=TRUE))
history
#> date value
#> 1: 1 v
#> 2: 2 s
#> 3: 3 j
#> 4: 4 p
#> 5: 5 pOur goal: For each date t, find the value in the range of [t-1,t+1]:
# our goal:
#> date value
#> 1: 1 v,s
#> 2: 3 s,j,p
#> 3: 5 p,pTo achieve our goal, follow these steps:
# Step 1: add start/end to event
event[, ':='(start=date-1, end=date+1)]
setkey(event, start, end) # event must be keyed!
event
#> date start end
#> 1: 1 0 2
#> 2: 3 2 4
#> 3: 5 4 6
# Step 2: add start/end to history
history[, ':='(start=date, end=date)]
setkey(history, start, end) # history is recommended to be keyed
history
#> date value start end
#> 1: 1 v 1 1
#> 2: 2 s 2 2
#> 3: 3 j 3 3
#> 4: 4 p 4 4
#> 5: 5 p 5 5
# Step 3: run foverlaps!
out = foverlaps(history, event, nomatch=NULL)
out[, .(value=list(value)), keyby=.(date)]
#> date value
#> 1: 1 v,s
#> 2: 3 s,j,p
#> 3: 5 p,pQuery the fuzzy point: roll
Sometimes we do not need to find all surrounding days for the event date. Instead, we only need to find the nearest day for the event date. This is when roll comes in.
Example:
Let’s say this time we only have one event at day 2:
event = data.table(date=c(2))
#> date
#> 1: 2The history table contains five observations:
history = data.table(date=c(1, 3, 5, 8, 9), value=sample(letters, 5, replace=TRUE))
#> date value
#> 1: 1 r
#> 2: 3 b
#> 3: 5 g
#> 4: 8 e
#> 5: 9 gOur goal: find the matched observation 4 days after the event (i.e., day 6)
Step 1: Add the join_date to event and history:
# Add join_date to the event table
event[, ':='(join_date=date+4)]
#> date join_date
#> 1: 2 6
# Add join_date to the history table
history[, ':='(join_date=date)] # the join_date is the same as date!
#> date value join_date
#> 1: 1 r 1
#> 2: 3 b 3
#> 3: 5 g 5
#> 4: 8 e 8
#> 5: 9 g 9Step 2: Use event to query history.
The results vary depending on the roll options:
# Case 1: closest: the closest to "6" is "5"
history[event, on=.(join_date), roll='nearest'][, .(date, value)]
#> date value
#> 1: 5 g
# Case 2: Inf: match the previous value (LOCF). The previous value of "6" is "5"
history[event, on=.(join_date), roll=Inf][, .(date, value)]
#> date value
#> 1: 5 g
# Case 3: -Inf: match the next value (NOCB). The next value of "6" is "8"
history[event, on=.(join_date), roll=-Inf][, .(date, value)]
#> date value
#> 1: 8 e
# Case 4: Positive: match the previous at no more than "roll" days
# The previous value of "6" is "5", which is within 1 day
history[event, on=.(join_date), roll=1][, .(date, value)]
#> date value
#> 1: 5 g
# Case 5: Negative: match the next at no more than "roll" days
# The next no-more-than-one-day value of "6" is "7", match failed
history[event, on=.(join_date), roll=-1][, .(date, value)]
#> date value
#> 1: NA <NA>The unit of roll (when specified as a number) depends on the unit of join_date.
- If
join_dateisDate, the unit is days, i.e.,roll=5means LOCF no more than 5 days. - If
join_dateisPOSIXct, the unit is seconds, i.e.,roll=5means LOCF no more than 5 seconds.
Both methods are fast.
foverlaps
- Pros:
- allows for precise matching
- can match multiple values
- Cons:
- cannot do fuzzy matching
- more complicated
roll
- Pros:
- can do fuzzy matching
- less code
- Cons:
- cannot do precise matching (can only specify a range)
- only matches one value