Fast overlap joins. Usually, df
is a very large data.table
(e.g. insurance portfolio) with small interval ranges, and dates
is much
smaller with (e.g.) claim dates.
rows_per_date(
df,
dates,
df_begin,
df_end,
dates_date,
...,
nomatch = NULL,
mult = "all"
)
data.frame with portfolio (df should include time period)
data.frame with dates to join
column name with begin dates of time period in df
column name with end dates of time period in df
column name with dates in dates
additional column names in dates
to join by
When a row (with interval say, [a,b]
) in x has no match in
y, nomatch=NA means NA is returned for y's non-by.y columns for that row of
x. nomatch=NULL (default) means no rows will be returned for that row of x.
When multiple rows in y match to the row in x, mult
controls
which values are returned - "all" (default), "first" or "last".
returned class is equal to class of df
library(lubridate)
portfolio <- data.frame(
begin1 = ymd(c("2014-01-01", "2014-01-01")),
end = ymd(c("2014-03-14", "2014-05-10")),
termination = ymd(c("2014-03-14", "2014-05-10")),
exposure = c(0.2025, 0.3583),
premium = c(125, 150),
car_type = c("BMW", "TESLA"))
## Find active rows on different dates
dates0 <- data.frame(active_date = seq(ymd("2014-01-01"), ymd("2014-05-01"),
by = "months"))
rows_per_date(portfolio, dates0, df_begin = begin1, df_end = end,
dates_date = active_date)
#> begin1 end termination exposure premium car_type index_df
#> 1 2014-01-01 2014-03-14 2014-03-14 0.2025 125 BMW 1
#> 2 2014-01-01 2014-05-10 2014-05-10 0.3583 150 TESLA 2
#> 3 2014-01-01 2014-03-14 2014-03-14 0.2025 125 BMW 1
#> 4 2014-01-01 2014-05-10 2014-05-10 0.3583 150 TESLA 2
#> 5 2014-01-01 2014-03-14 2014-03-14 0.2025 125 BMW 1
#> 6 2014-01-01 2014-05-10 2014-05-10 0.3583 150 TESLA 2
#> 7 2014-01-01 2014-05-10 2014-05-10 0.3583 150 TESLA 2
#> 8 2014-01-01 2014-05-10 2014-05-10 0.3583 150 TESLA 2
#> active_date index_dates
#> 1 2014-01-01 1
#> 2 2014-01-01 1
#> 3 2014-02-01 2
#> 4 2014-02-01 2
#> 5 2014-03-01 3
#> 6 2014-03-01 3
#> 7 2014-04-01 4
#> 8 2014-05-01 5
## With extra identifiers (merge claim date with time interval in portfolio)
claim_dates <- data.frame(claim_date = ymd("2014-01-01"),
car_type = c("BMW", "VOLVO"))
### Only rows are returned that can be matched
rows_per_date(portfolio, claim_dates, df_begin = begin1,
df_end = end, dates_date = claim_date, car_type)
#> begin1 end termination exposure premium car_type index_df
#> 1: 2014-01-01 2014-03-14 2014-03-14 0.2025 125 BMW 1
#> claim_date index_dates
#> 1: 2014-01-01 1
### When row cannot be matched, NA is returned for that row
rows_per_date(portfolio, claim_dates, df_begin = begin1,
df_end = end, dates_date = claim_date, car_type, nomatch = NA)
#> begin1 end termination exposure premium car_type index_df
#> 1: 2014-01-01 2014-03-14 2014-03-14 0.2025 125 BMW 1
#> 2: 2014-01-01 2014-05-10 2014-05-10 0.3583 150 TESLA 2
#> claim_date index_dates
#> 1: 2014-01-01 1
#> 2: <NA> NA