The goal of
insurancerating is to give analytic techniques that can be used in insurance rating. It helps actuaries to implement GLMs within all relevant steps needed to construct a risk premium from raw data. It provides a data driven strategy for the construction of tariff classes in P&C insurance. The goal is to bin the continuous factors such that categorical risk factors result which capture the effect of the covariate on the response in an accurate way, while being easy to use in a generalized linear model (GLM).
insurancerating also provides recipes on how to easily perform univariate analyses on an insurance portfolio. In addition it adds functionality to include reference categories in the levels of the coefficients in the output of a generalized linear regression analysis.
Install insurancerating from CRAN:
Or the development version from GitHub:
# install.packages("remotes") remotes::install_github("MHaringa/insurancerating")
This is a basic example which shows the techniques provided in insurancerating.
The first part shows how to fit a GAM for the variable age_policyholder in the MTPL dataset:
library(insurancerating) # Claim frequency age_policyholder_frequency <- fit_gam(data = MTPL, nclaims = nclaims, x = age_policyholder, exposure = exposure) # Claim severity age_policyholder_severity <- fit_gam(data = MTPL, nclaims = nclaims, x = age_policyholder, exposure = exposure, amount = amount, model = "severity")
autoplot(age_policyholder_frequency, show_observations = TRUE)
Determine classes for the claim frequency (the points show the ratio between the observed number of claims and exposure for each age):
clusters_freq <- construct_tariff_classes(age_policyholder_frequency) clusters_sev <- construct_tariff_classes(age_policyholder_severity) autoplot(clusters_freq, show_observations = TRUE)
In this example the term exposure is a measure of what is being insured. Here an insured vehicle is an exposure. If the vehicle is insured as of July 1 for a certain year, then during that year, this would represent an exposure of 0.5 to the insurance company.
The figure shows that younger policyholders have a higher risk profile. The fitted GAM is lower than might be expected from the observed claim frequency for policyholders of age 19. This is because there are very few young policyholders of age 19 present in the portfolio.
The GAM for the claim severity :
age_policyholder_severity %>% autoplot(., show_observations = TRUE, remove_outliers = 100000)
The second part adds the constructed tariff classes for the variable
age_policyholder to the dataset, and sets the base level of the factor
age_policyholder to the level with the largest exposure. In this example for claim frequency the class for ages (39,50], which contains the largest exposure.
## Rows: 32,731 ## Columns: 5 ## $ age_policyholder <int> 43, 21, 54, 44, 20, 38, 68, 45, 76, 30, 28,… ## $ nclaims <int> 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0… ## $ exposure <dbl> 1.0000000, 1.0000000, 1.0000000, 1.0000000,… ## $ amount <dbl> 0, 0, 0, 57540, 2057, 0, 0, 6510, 0, 0, 0, … ## $ age_policyholder_freq_cat <fct> "(39,50]", "[18,25]", "(50,57]", "(39,50]",…
The last part is to fit a generalized linear model.
rating_factors() prints the output including the reference group.
model_freq1 <- glm(nclaims ~ age_policyholder_freq_cat, offset = log(exposure), family = "poisson", data = dat) model_freq2 <- glm(nclaims ~ age_policyholder_freq_cat + age_policyholder, offset = log(exposure), family = "poisson", data = dat) x <- rating_factors(model_freq1, model_freq2) x
## [34mSignificance levels: *** p < 0.001; ** p < 0.01; * p < 0.05; . p < 0.1[39m risk_factor level est_model_freq1 est_model_freq2 ## 1 (Intercept) (Intercept) 0.136818 *** 0.321029 *** ## 2 age_policyholder_freq_cat (39,50] 1.000000 1.000000 ## 3 age_policyholder_freq_cat [18,25] 1.943823 *** 1.290960 ## 4 age_policyholder_freq_cat (25,32] 1.323499 *** 0.980283 ## 5 age_policyholder_freq_cat (32,39] 1.056854 0.892347 ## 6 age_policyholder_freq_cat (50,57] 0.891970 * 1.053587 ## 7 age_policyholder_freq_cat (57,64] 0.742400 *** 1.005661 ## 8 age_policyholder_freq_cat (64,71] 0.737936 *** 1.138351 ## 9 age_policyholder_freq_cat (71,83] 0.702135 *** 1.248675 ## 10 age_policyholder_freq_cat (83,95] 0.693338 1.510083 ## 11 age_policyholder age_policyholder NA 0.981193 **
autoplot.riskfactor() creates a figure. The base level of the factor
age_policyholder_freq_cat is the group with the largest exposure and is shown first.
model_data to sort the clustering in the original order. Ordering the factor
age_policyholder_freq_cat only works when
biggest_reference() is used to set the base level of the factor to the level with the largest exposure.
The following graph includes the exposure as a bar graph and shows some more options:
rating_factors(model_freq1, model_freq2, model_data = dat, exposure = exposure) %>% autoplot(., linetype = TRUE)
Add predictions to the data set:
## Rows: 32,731 ## Columns: 7 ## $ age_policyholder <int> 43, 21, 54, 44, 20, 38, 68, 45, 76, 30, 28,… ## $ nclaims <int> 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0… ## $ exposure <dbl> 1.0000000, 1.0000000, 1.0000000, 1.0000000,… ## $ amount <dbl> 0, 0, 0, 57540, 2057, 0, 0, 6510, 0, 0, 0, … ## $ age_policyholder_freq_cat <fct> "(39,50]", "[18,25]", "(50,57]", "(39,50]",… ## $ pred_nclaims_model_freq1 <dbl> 0.13681815, 0.26595024, 0.12203763, 0.13681… ## $ pred_nclaims_model_freq2 <dbl> 0.14190382, 0.27816629, 0.12132927, 0.13923…
Compute indices of model performance for GLMs. The RMSE is the square root of the average of squared differences between prediction and actual observation and indicates the absolute fit of the model to the data. It can be interpreted as the standard deviation of the unexplained variance, and is in the same units as the response variable.
To test the stability of the predictive ability of the fitted model it might be helpful to determine the variation in the computed RMSE. The variation is calculated by computing the root mean squared errors from generated bootstrap replicates.
For claim severity models it might be helpful to test the variation in the RMSE in case the portfolio contains large claim sizes. The figure below shows that the variation in the RMSE of the frequency model is quite low (as expected). The dashed line shows the RMSE of the fitted (original) model, the other lines represent the 95% confidence interval.
Check Poisson GLM for overdispersion. A dispersion ratio larger than one indicates overdispersion, this occurs when the observed variance is higher than the variance of the theoretical model. If the dispersion ratio is close to one, a Poisson model fits well to the data. A p-value < .05 indicates overdispersion. Overdispersion > 2 probably means there is a larger problem with the data: check (again) for outliers.
Misspecifications in GLMs cannot reliably be diagnosed with standard residual plots, and GLMs are thus often not as thoroughly checked as LMs. One reason why GLMs residuals are harder to interpret is that the expected distribution of the data changes with the fitted values. As a result, standard residual plots, when interpreted in the same way as for linear models, seem to show all kind of problems, such as non-normality, heteroscedasticity, even if the model is correctly specified.
check_residuals() aims at solving these problems by creating readily interpretable residuals for GLMs that are standardized to values between 0 and 1, and that can be interpreted as intuitively as residuals for the linear model. This is achieved by a simulation-based approach, similar to the Bayesian p-value or the parametric bootstrap, that transforms the residuals to a standardized scale. This explanation is adopted from the vignette for DHARMa.
Detect deviations from the expected distribution, and produce a uniform quantile-quantile plot. The simulated residuals in the QQ plot below show no clear deviation from a Poisson distribution. Note that formal tests almost always yield significant results for the distribution of residuals and visual inspections (e.g. Q-Q plots) are preferable.
It might happen that in the fitted model for a data point all simulations have the same value (e.g. zero), this returns the error message Error in approxfun: need at least two non-NA values to interpolate. If that is the case, it could help to increase the number of simulations (e.g. n = 1000).
This is a basic example which shows how to easily perform an univariate analysis on a MTPL portfolio using
An univariate analysis consists in the evaluation of overall claim frequency, severity and risk premium. Its main purpose lies in verifying the experience data reasonableness using previous experience comparison and professional judgement.
univariate() shows the basic risk indicators split by the levels of the discrete risk factor:
library(insurancerating) univariate(MTPL2, x = area, # discrete risk factor nclaims = nclaims, # number of claims exposure = exposure, premium = premium, severity = amount) # loss
## area amount nclaims exposure premium frequency average_severity ## 1 2 4063270 98 818.53973 51896 0.11972540 41461.94 ## 2 3 7945311 113 764.99178 49337 0.14771401 70312.49 ## 3 1 6896187 146 1065.74795 65753 0.13699299 47234.16 ## 4 0 6922 1 13.30685 902 0.07514927 6922.00 ## risk_premium loss_ratio average_premium ## 1 4964.0474 78.296400 63.40071 ## 2 10386.1390 161.041632 64.49350 ## 3 6470.7486 104.880188 61.69658 ## 4 520.1832 7.674058 67.78464
The following indicators are calculated:
Here the term exposure is a measure of what is being insured. For example, an insured vehicle is an exposure. If the vehicle is insured as of July 1 for a certain year, then during that year, this would represent an exposure of 0.5 to the insurance company. The term risk premium is used here as an equivalent of pure premium and burning cost.
univariate() ignores missing input arguments, for instance only the claim frequency is calculated when
severity are unknown:
univariate(MTPL2, x = area, nclaims = nclaims, exposure = exposure)
However, the above table is small and easy to understand, the same information might be presented more effectively with a graph, as shown below.
show_plots defines the plots to show and also the order of the plots. The following plots are available:
For example, to show the exposure and claim frequency plots:
univariate(MTPL2, x = area, nclaims = nclaims, exposure = exposure) %>% autoplot(., show_plots = c(6,1))
To check whether claim frequency is consistent over the years:
MTPL2 %>% mutate(year = sample(2016:2019, nrow(.), replace = TRUE)) %>% univariate(., x = area, nclaims = nclaims, exposure = exposure, by = year) %>% autoplot(., show_plots = 1)
To remove the bars from the plot with the line graph, add
background = FALSE:
univariate(MTPL2, x = area, nclaims = nclaims, exposure = exposure) %>% autoplot(., show_plots = c(6,1), background = FALSE)
sort orders the levels of the risk factor into descending order by exposure:
univariate(MTPL2, x = area, nclaims = nclaims, exposure = exposure) %>% autoplot(., show_plots = c(6,1), background = FALSE, sort = TRUE)
autoplot.univariate() can be used to sort the levels of the discrete risk factor into your own ordering. This makes sense when the levels of the risk factor has a natural order, or when not all levels of the risk factor are desired in the output.
univariate(MTPL2, x = area, nclaims = nclaims, exposure = exposure) %>% autoplot(., show_plots = c(6,1), background = FALSE, sort_manual = c("2", "3", "1", "0"))
The following graph shows some more options:
univariate(MTPL2, x = area, nclaims = nclaims, exposure = exposure) %>% autoplot(., show_plots = c(6,1), background = FALSE, sort = TRUE, ncol = 2, color_bg = "dodgerblue", color = "blue")
Or create a bar graph for the number of claims:
univariate(MTPL2, x = area, nclaims = nclaims) %>% autoplot(., show_plots = 8, coord_flip = TRUE, sort = TRUE)
For continuous variables a histogram can be created:
Two ways of displaying numerical data over a very wide range of values in a compact way are taking the logarithm of the variable, or omitting the outliers. Both do not show the original distribution, however. Another way is to create one bin for all the outliers. This yields both the original distribution, and also gives a feel for the number of outliers.
histbin(MTPL2, premium, right = 110, line = TRUE)