GLM for Hytte Forsikrings Data

(1)

for Cottage Insurance Data

Master i Modellering og dataanalyse

SHANJIDA AKHTER

Master’s Thesis, Spring 2015

(2)

(3)

A handful of special methods and techniques are available for solving the problems in insurance industries. The techniques are different based on the situation and each of these techniques has their own theory and logic. Thus, we need to decide the distribution families for a particular data depending on those theories and then we fit statistical models for the claim frequencies and claim sizes.

In this thesis paper we have data from Norwegian holiday cottage insurance where claim frequencies are comparatively lower than other types of insur- ance. We have mainly focused on finding out the factors or variables which have most obligations for the occurrence of claims and losses. After finding out those variables by using the generalized linear modeling (GLM) approach, we tried to check the accuracy of the models. Further, we used two different types of distribution families and wanted to see if they give different results considering this particular data set. Prior to that, we have also worked with some theories at chapter three where we compared the accuracy of three different ways of modeling the claim frequency data by a simple simulation example.

Results from the models using two different distribution families were not so different for this data set. However, when it comes to the theory about three different methods of modeling, it seems they converge in long run.

Moreover, it is not only the claim frequency which we are looking for, com- panies need to know how much a customer may ask for his or her future loss, which is very difficult to predict. Nonetheless, we have fitted some distribu- tions without any covariates to get some idea about future losses. We have fitted models with covariates as well. In this connection, it was difficult to find out the significant variables for the claim sizes. However, while using Inverse Gaussian family under GLM the result showed that there are some covariates which may affect the size of the claims.

After modeling the claim frequencies and the claim sizes we were able to find

out the factors that influence the occurrence of claims and the size of the

claims.

(4)

(5)

This thesis was written as a part of my master’s degree in modeling and data analysis during the period August 2014 to May 2015, with various master courses at the University of Oslo.

I really relished working on this thesis and have learned lots of new things which I did not imagine when I started. First and foremost, I would like to thank my benevolent supervisors Anders Rygh Swensen and Nils Fridthjov Haavardsson. Thank you both so much for many good conversations, helpful instructions and your valuable time. A special thanks to Anders for always keeping your door open and tolerating my "Silly" questions, and to Nils Haavardsson for taking the time to meet with me even after your work at DNB.

Specially I would like to thank all the people in B800 for a lot of interesting conversations and a friendly environment.

At last, but not at least, I would like to thank my family and friends for their endless support. A special thanks goes to my elder brother Syed who has always been my idol. His motivation and guidance was precious to me. In addition, special thanks are dedicated to my parents back home in Dhaka, who keep supporting me and inspiring me to stay the course. I really appre- ciate it.

Shanjida

May, 2015

(6)

1 Introduction 7

2 Description of The Data 10

2.1 Descriptive Statistics: Individual Variables . . . . 16

2.1.1 Claim Frequency With Respect to Age Groups . . . . . 16

2.1.2 Box Plots for Cottage Age Against Claims Due To Wa- ter Damage . . . . 17

2.1.3 Claims and Cellar: Water Damages . . . . 19

2.1.4 Age Groups Of Policy Holders Covering Only Water Claims . . . . 20

2.2 Descriptive Statistics: Association Between Variables . . . . . 21

2.2.1 Distance From The Road and Surface Category: Water Claims . . . . 21

2.2.2 Surface Category and Water Distribution : Water Claims 23 2.2.3 Water and Cottage Age Category: Water Claims . . . 24

3 Theory for Claim Frequency Modeling 26 3.1 Generalized Linear Modeling . . . . 26

3.1.1 Poisson Family . . . . 27

3.1.2 Negative Binomial Due to Poisson Over Dispersion . . 32

3.1.3 Data Truncation: Complimentary Log Link . . . . 34

3.2 Model Validation and Model Checking Tools . . . . 35

4 Selection of Important Risk Drivers in The Model 40 4.1 Model Selection Process . . . . 41

4.2 Variable Selection Criteria . . . . 42

4.3 Modeling: Forward Selection . . . . 43

4.3.1 Selection Criterion AIC . . . . 43

4.3.2 Selection Criterion P Value . . . . 52

4.4 Model validation Using Diagnostics Plots . . . . 56

4.4.1 Water Claims . . . . 56

1

(7)

4.4.2 Fire, Theft and Others Claims . . . . 59

4.4.3 Summary Discussion . . . . 60

5 Modeling The Size of The Claims 63 5.1 Modeling Theory For Claim Sizes . . . . 63

5.2 Analyzing Water Claims . . . . 65

5.2.1 Descriptive Statistics . . . . 65

5.2.2 Modeling Using Different Distributions: Without Co- variates . . . . 72

5.2.3 Modeling Using Different Distributions: With Covariates 76 5.3 Analyzing Claims Due to Fire, Theft and Other Damages . . . 80

5.3.1 Descriptive Statistics . . . . 80

5.3.2 Modeling Using Different Distribution: Without Co- variates . . . . 83

5.3.3 Modeling Using Different Distribution: With Covariates 87 6 Summary and Conclusion 90 6.1 Summary Of The Most Important Topics in This Thesis . . . 90

6.2 Challenges and Further Work . . . . 92

APPENDICES 94 A Additional Tables For Forward Selection in Chapter 4 95 A.1 Tables For Steps in Forward Selection: Water Claims . . . . . 96

B R codes 100 B.1 Chapter 2 . . . 101

B.2 Chapter 3 . . . 101

B.3 Chapter 4 . . . 104

B.3.1 AIC: Poisson . . . 105

B.3.2 P Value: Negative Binomial . . . 115

B.4 Chapter 5 . . . 122

(8)

2.1 Filtration of data . . . . 13

2.2 Claim frequency (due to water damages) against age groups . . 16

2.3 Claim frequency for damages due to fire, theft and others . . . 17

2.4 Box plot of cottage age against number of claims . . . . 18

2.5 Comparison of cellar in claims and no claims groups . . . . . 19

2.6 Age category distribution of policy holders . . . . 20

2.7 Distance road and surface category as presented in 2.9 . . . . . 21

2.8 Water and surface category bar plot . . . . 23

4.1 Deviance residual plot . . . . 56

4.2 Anscombe residuals for fitted model . . . . 57

4.3 Leverage and cooks distance . . . . 58

4.4 Deviance residuals plot . . . . 59

4.5 Leverage and cooks distance for fitted model . . . . 60

5.1 Percentiles plot of the claim sizes(n= 271) . . . . 67

5.2 Density plot for claim sizes(n= 271) . . . . 68

5.3 Density plot for claim size below and over 90 %(n=244 and n=27 respectively) . . . . 69

5.4 Box plot of alarms regarding claim sizes . . . . 71

5.5 QQ and density plot for log-normal(data below 90%(n=244) . 72 5.6 QQ and density plot for log-normal (data over 90%(n=27)) . . 73

5.7 QQ plot for gamma distribution(n=244) and (n=27) . . . . . 74

5.8 QQ plot for pareto distribution(n=271 and n=27) . . . . 75

5.9 Density plot for damages regarding fire,theft and other . . . . 80

5.10 Percentile plot for damages regarding fire,theft and other(n=330) 82 5.11 QQ plot and density plot for log-normal(Below 90%) . . . . . 83

5.12 QQ plot and density plot for log-normal(Over 90%)) . . . . 84

5.13 Gamma distribution QQ plot . . . . 85

5.14 QQ plot for pareto . . . . 86

3

(9)

2.1 Variables to find out different coverage of the policies . . . . . 11

2.2 Data set overview . . . . 12

2.3 Variable summary . . . . 14

2.4 Water claims frequency table for different age groups . . . . . 16

2.5 Fire, theft and others claims frequency table for different age groups . . . . 17

18 19 2.8 Distribution of claims over different age groups of policy holders 21 2.9 Cross-table showing association between surface of the houses, distance from the main road and number of claims . . . . 22

2.10 Cross tabulation for showing the association between the avail- ability of water in the cottages, surface of the cottages and the number of claims: Water damages . . . . 23

2.11 Cross tabulation of association between age of the cottages, the availability of water and number of claims . . . . 24

3.1 Evaluating three different ways of modeling and see how they converges . . . . 32

3.2 Number of claims . . . . 35

4.1 Summaries from models . . . . 45

45 47 4.4 Summaries from the fitted models on claims covering fire,theft and others elements . . . . 49

4.5 AIC values at different steps of forward selection . . . . 50

4.6 Model summary for negative binomial family . . . . 51

4.7 Comparing different models water claims . . . . 54

4.8 Comparing different models for fire, theft and other claims . . 54

4.9 Comparing results by Hosmer Lemeshow Test . . . . 61

4

(10)

5.2 Summaries from the variable "Tube" . . . . 69

5.3 Summaries from variable "Distance Road" . . . . 70

5.4 AIC values for modeling claim size(Below 90%) . . . . 76

5.5 AIC values for modeling claim size under gamma family(Over 90%) . . . . 77

5.6 Inverse gaussian modeling for claim sizes(below 90%) . . . . . 78

5.7 Second step in forward selection in claim size modeling(below 90%) . . . . 79

5.8 Percentiles of the claim sizes of fire,theft and other types of damages(n=330) . . . . 81

96 A.2 Step Three . . . . 97

A.3 Step Four . . . . 98

A.4 Step Five . . . . 99

A.5 Step Six . . . . 99

(11)

(12)

Introduction

The most convenient way to get rid of unexpected financial devastation is to have insurance on our valuable products and wealth, for instance houses, cars, boats and other types of personal property. That is why people have insurance and there are lots of insurance companies now a day. Insurance companies take the risk of the valuable properties from us. To control or to deal with these risks in property insurance we need to know the factors behind the losses. Although the companies always come up with service to their customers, but it is also important for them to know the techniques of dealing with the customers and create a win-win situation for both parties.

Modeling in insurance industries mainly divided into two parts; claim fre- quency and claim size [1]. Most models for claim frequencies are usually supported the poisson regression, but there may be other special cases to use negative binomial family for modeling the claim frequencies.

Having knowledge about claim frequencies is not enough for the insurance workers. Moreover, they need to know how the losses are distributed, how extreme they are and how to work with different categories of loss amount, for example there may exist some really big amount of losses which needs special treatment.

For this report we have a data set describing insurance policies covering around 20000 cabins ¹ during the period August 2011-August 2013. The policies cover four categories: Water, Fire, Theft and a rest category. The data contains descriptions of the number of claims as well as the claim sizes.

While modeling the claims we have focused on individual policies over dif-

1 The popular Norwegian vacation houses known as "Hytte"

7

(13)

ferent time but not on the individuals. Different chapters were organized to present different types of analysis.

Chapter Two

In this chapter we will mainly go through how the actual data file is? How we have filtered the data set for our analysis and then go through some de- scriptive analysis to see if one or more variable are really reasonable to add in a GLM model.

Chapter Three

This chapter mainly deals with the background and theory for modeling that we have used in later chapters. One important part here is whether to add the exposure to the right side of the model as one explanatory variable or to just divide the "Response" with the exposure value.

Chapter Four

AIC and P value both are the selection criteria for a variable either in the linear or in the non-linear regression models. In this chapter, we tried to find out the best fitted model using several selection criteria in model selection process and also we have used several families of distributions under GLM.

Chapter Five

For the claim sizes it is difficult to find out how the losses depends on the covariates. We can try several options, for example using Log-normal, In- verse Gaussian or Gamma family under GLM. However, it is very common that these distribution families may fail to interpret how the covariates are related to the response-claim sizes, which leads us to think about a model without covariates for the claim sizes.

Chapter Six

In this part we have a brief summary of whole work. Although we tried to perform accurate data analysis, still there are some limitations. Those limi- tations have also been discussed in this chapter.

Appendix A: The tables for displaying steps in forward selection procedure.

Appendix B: R Codes

(14)

(15)

Description of The Data

The main data file contains 415888 rows and 33 columns, these 33 columns are the variables in the data set. In total we have 21923 policies. The poli- cies regarding the main building were selected only, which means that, we have discarded the policies for the external parts of the cottages for instance garage or any other type of small houses with in the cottages.

Further, we have 24 row (2 × 4 × 3 = 24) of information for each of the policy holders(in most cases). For some of them the number is less then 24, because may be the policy holders withdraw before the time or any other reason.

There are 12(3 × 4) rows for buildings and 12(3 × 4) rows for furniture.

Within these rows we have four other categories of policies such as -water, fire, theft and others and three row for each of them because we have data for three different time period, which are 1st of August 2011- 20th January of 2012, 21st January 2012-20th January 2013 and 21st January 2013 - 1st August 2013 in table 2.1. We have considered the policies over time for our analysis, that is same policy in different time points were in our data set.

In this data set there are lots of redundancies. We have the same infor- mation in one or more variables. In total we have 33 variables which can be divided into two parts. Some of them are used to separate and group the data presented in table 2.1 and the others are for analyzing the risk factors and claim patterns, shown in table 2.3.

10

(16)

Variables For Identifying The Types of Claims

The variable "Cover" is same as "Cover Name", but it was written with cod- ing (1 or 2) and the similar was done with the variable "Element Name", coding variable for this one is called "Element", which has values 1, 2,3 or 4 for water, fire, theft and others respectively (Table 2.1). Then it makes the sense of having 2 × 4 = 8 rows and for three period of time it makes 8 × 3 = 24 rows for most of the policies.

Based on the information about the variables in table 2.1, a sub-file was selected from the large data set which considered the insurance elements which have claims more frequently or have higher number of claims.

Table 2.1: Variables to find out different coverage of the policies

Variable Name Possible Values Type of Variables

ID Any Number Numerical

Valid From

Date Between 01.08.2011

to 31.07.2013

Numerical

Valid To

Date Between 01.08.2011

to 37.07.2013

Numerical

Cover 1,2 Numerical

Cover Name Innbo(Household) Character

Element 1,2,3,4 Numerical

Element Name

Vann(Water) Brann(Fire) Tyveri(Theft) Annet(Others)

Character

Number Of Cabins 0,1 Numerical

Here we chose the first three row of the policies which implies that we are considering the policies with the main house and abandoned the part of furniture insurance data. After that, we analyzed the data for the damages due to water, fire, theft and others. Then we have 215960 rows. So, this part of data is ²¹⁵⁹⁶⁰ ₄₁₅₈₈₈ × 100 = 51%(approximately) of the data.

Within this selection we divided our new file into two parts before modeling.

(17)

In one we have considered the claims under the element name "Water" and in another the claims under "Fire, Theft and Others" together. The obser- vations those were considered in the new selection of "Water" are about ¹ ₈ ^th of the whole data set and the "Fire, Theft and Others" are the ³ ₈ ^th of the whole data set.

The first one has experienced 368(358 + (2 × 5)) claims. Here, 358 policies had single claim and other 5 had 2 two claims in each. Further, the second selection has reported 462 + (6 × 2) + 4 = 478 claims. That means, 462 poli- cies has only one claim, there are 6 policies having two and and one policy that has four claims. So, first we had the data set looked like table 2.2, later we divided 4 policy coverage into two groups. "Water" was the big group among the four, so it was considered as one single group and then the other three coverage were used as one group while modeling.

Table 2.2: Data set overview

Cover Name Element Name Time Variables

Buidlings

Water

01 August 2011-20 January 2012 21 January 2012-20 January 2013

01 January 2013-01 August 2013 Fire

01 August 2011-20 January 2012 21 January 2012-20 January 2013 01 January 2013-01 August 2013 Theft

01 August 2011-20 January 2012 21 January 2012-20 January 2013 01 January 2013-01 August 2013 Others

01 August 2011-20 January 2012 21 January 2012-20 January 2013 01 January 2013-01 August 2013

The whole data has have experienced with 1013 claims for both "Build- ings" and "Furniture". Furthermore, in the previous paragraph we mentioned that it was in total 368 and 478 claims for two categories of claims under

"Building" coverage and other 181 claims were reported for the "furniture"

coverage.

As we see in the figure 2.1, we have done one modeling only for the water

claims and another modeling was done for the combined claims due to fire,

theft and others.

(18)

Figure 2.1: Filtration of data Variables for Modeling The Claims

These variables may be considered as the actual covariates. We have "Cus- tomer Age" and "Cottage Age" giving us information about how old the policy holders and the cottages are. There are also information about the availability of fire alarms, theft alarms, water stop and cellar. "Fire alarm"

and "Theft alarms" have three different categories, where one says no alarm and another is saying local alarm that means that the cabin itself has a alarm to control fire or theft. However, a third category of alarm is also available which controls the fire or theft from the alarm stations. Losses can vary due to the distance of the cabin from the road side. So, we have a variable called

"Distance Road", which gives us information about the distance of the cabin from the road. More information about the variables are provided in table 2.3.

Since we have a data from August 2011-August 2013, the whole period has

been divided in to three sub-periods and we have information about them

separately on three different rows. Thus, we have three row for each of

the "element name". We have one important variable known as "Exposure

Days", which informs us about the number of days a policy holder was with

the policy. The variable "Number of Cabins" has value 1 after each 24 (there

are some exceptions) which means that a new policy holder has been entered

in the data set.

(19)

Table 2.3: Variable summary

Variable Name Possible Values Type Of Variables Number of Miss- ing Ob- serva- tions

Proportion of miss- ing obser- vations in the data

Customer Age any numbers numerical 696 .167%

Customer Region 01-21 character 0 0%

Customer District 0100-2100 character 0 0%

Cottage Age any number numerical 0 0%

Insurance Sum any number numerical 0 0%

Surface any number numerical 0 0%

Cellar J(Yes),N(No) character 0 0%

Fire Alarm 1(no alarm), 2(local alert), 3(alert from alarm station)

character 69628 16.7%

Theft Alarm 1(no alarm), 2(local alert), 3(alert from alarm station)

character 69628 16.7%

Distance Road 1(over 300 kilo- meter),

2(under 300 kilo- meter)

character 69628 16.7%

Water 1(not in-

clude), 2(included)

character 69628 16.7%

WaterStop J(Yes),N(No) character 0 0%

Tube J(Yes),N(No) character 0 0%

Cottage Region 01-21 character 0 0%

Cottage District 0100-2100 character 0 0%

Excess 1-15 numerical 0 0%

Annualized Premium any number numerical 0 0%

Claim Per Exposure any number numerical 414875 99.7%

Claim District 0100-2100 character 414875 99.7%

Number Of Elements numerical 0 0%

Policy Similar to

the variable ID , it is a number identifying the policies

character 0 0%

Number of Claim Events numerical 0 0%

Earned Premium numerical 0 0%

Exposure Days 0-365 days numerical 0 0%

These are the variables in table 2.3 which were considered first to look at

(20)

their descriptive statistics status and then based on that to include them in the GLM model.

We have one variable called "Excess" in table 2.3 describes the self risk of the policy holder and deducted from the payment from the company if any injuries occur. So if a person claim that he/she has a loss of 9000 kroner and the deductible amount was 1000 kroner, then that person will given 8000 kroner.

There are few missing values in the data set. We can divide them into two categories of missing values. Firstly, Some of them are just usual missing value for some random reason. Secondly, for the "Claim Per Exposure"

variable we have 414875 missing values in the main data file. The reason

behind this is very simple, because this variable in the data only contains

information when there is a claim or loss occurs, otherwise it shows the

value as 0 or NA. The same reason is also valid for another variable "Claim

District".

(21)

2.1 Descriptive Statistics: Individual Variables

Our main aim is to model the claim frequencies and know how different vari- ables have affect on the occurrence of the claims. Before starting modeling we need to get a general overview of different variables and their descriptive statistics. These helps to get closer insight of the data and the information.

2.1.1 Claim Frequency With Respect to Age Groups

We have grouped the data set based on the age of the cottages and then plotted the claim frequency(in percentages) against the age groups. Where,

Claim Frequency = ^N _T ⁱ

i × 100

T=number of days exposed in group i N=number of claims in groups i

0.0020 0.0025 0.0030 0.0035 0.0040

CottageAgegroups

claim frequencies

agegroup1 agegroup3 agegroup5 agegroup7

Figure 2.2: Claim frequency (due to water damages) against age groups We also have the table below presenting the numerical ideas.

Table 2.4: Water claims frequency table for different age groups

Age groups 0-10 11-20 21-30 31-40 41-50 51-60 60+

claim frequency(%) 0.004096738 0.003581013 0.0036776147 0.002768758 0.001968600 0.003238116 0.002249520

(22)

The previous plot and table was only for claims due to water damages.We also have other group of data set where we have claims on fire, theft and other reasons. The claim frequencies for that group has been shown below by a plot,

0.0011 0.0012 0.0013 0.0014 0.0015

CottageAgegroups

claim frequencies

agegroup1 agegroup3 agegroup5 agegroup7

Figure 2.3: Claim frequency for damages due to fire, theft and others

Table 2.5: Fire, theft and others claims frequency table for different age groups

age groups 0-10 11-20 21-30 31-40 41-50 51-60 60+

claim frequencies(%) 0.001459295 0.001308447 0.001225871 0.001490870 0.001100100 0.001248024 0.001326640

Claim frequency is very low and this is true for this data set. Moreover, there are some fluctuations in the frequency table, which gives us the idea about the affect of "Cottage Age" on the occurrence of claims

2.1.2 Box Plots for Cottage Age Against Claims Due To Water Damage

The box plot for the age of the customers against the number of claims has

plotted below and we will have a look at the counts of the claims before that

presentation:

(23)

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

0 1 2

0 20 40 60 80 100

Number of claims

Cottage Age

0=no claim(n1=0) 1=one claim(n2=358) 2= two claims(n3=5)

Figure 2.4: Box plot of cottage age against number of claims

Here, from the box plot it is clear that, the median age of the cottages which have a claim is 30 though there are some outliers. Whereas, it is also visible that some of the old cottages do not have a claim. While considering the third box, it is reasonable to mention that it only consists of five obser- vations, since we have only 5 of the policies with two claims. However, those policies with two claims have the cottage ages near to 40 years old.

We have also categorized the variable "Cottage Age" and a table with num- ber of claims and the age category of cottages has shown below:

Table 2.6: Distribution of claims over different age groups of the cottages ¹

Age Groups of Cottages 0-10 11-20 21-30 31-40 41-50 51-60 60+ Total No Claims 11249 6514 5314 10378 7767 4482 7923 53627

One Claim 100 50 44 61 32 32 39 358

More Than One Claims 1 1 0 2 1 0 0 5

Total 11350 6565 5358 10441 7800 4514 7962 53990

Proportion(%) 0.89 0.78 0.82 0.60 0.42 0.71 0.49 0.66

From the table we see that most of the cottages between 0-10, 11-20 and

31-40 years old the have higher claims. However, it is also true that in the

whole data set we have large number of cottages from those age groups.

(24)

2.1.3 Claims and Cellar: Water Damages

We may have a look in comparing this variable for both claim and no claim group. If we have a look at the table of proportions:

Table 2.7: Claim percentage on cellar and no cellar groups ¹

Cellar Categories Cellar No Cellar

Claims 0.006056 0.006867

No Claims 0.99394 0.9932 Total 1(approx) 1(approx)

We have the graph below to understand the scenario easily.

J N

Claims Group Frequency 0 50 100 150 200 250 300

^J=IncludedN=Not Included

J N

No Claims Group Frequency 0 10000 20000 30000 40000

J=Included N=Not Included

Figure 2.5: Comparison of cellar in claims and no claims groups

The usual impression corresponds well to the proportions which are fairly

(25)

equal. Maybe, this information indicates us that variable is not so important while modeling the water claim frequencies.

2.1.4 Age Groups Of Policy Holders Covering Only Wa- ter Claims

Similar to the age of the cottages, the policy holders age is also important.

Customers with different age groups have different personalities and that can affect the occurrence of a damage. For example, a young man can take more care of his belongings rather than a middle aged or old person. Thus we can explore this idea by having a look at the table which displays the claim distributions over different age groups.

Histogram of Policy Holders Age

Policy Holders Age

Frequency

20 40 60 80 100

0 2000 4000 6000 8000

Figure 2.6: Age category distribution of policy holders

The categorization was made in such a way so that all groups have nearly equal number of claims.We can see that most of the policy holders are in age group 46-60 year, that is, most of them are with in fifty to sixty years.

However, the second largest age groups are 60+ and the lowest is the group

16-30 years old. People below 16 years old rarely own a cottage and this

is also same for those who are between 16 to 30. Moreover, when we have

plotted the actual ages of the customer instead of the groups, the distribution

looks quite normal actually. We can have a look to the counts of different

groups in table 2.8.

(26)

Table 2.8: Distribution of claims over different age groups of policy holders Age Groups of Policy Holders 16-30 31-45 46-60 60+ total

Number of Claims Per Group 5 88 165 110 368

2.2 Descriptive Statistics: Association Between Variables

Alongside analyzing the each individual variable we have also looked at two or more variables together with the number of claims. At this point we have considered the variables for example, "Surface", "Distance Road", "Water".

Similar to the previous section we have categorized the variable surface to understand differences between different cottages of different sizes.

2.2.1 Distance From The Road and Surface Category:

Water Claims

Here, we tried to see the Surface Categories based on Distance from the road and Number Of Claims:

1 2 3 4 5 6

Bar Plot of Surface and Distance Road

Surface Category Counts of The Policies 050001000015000

distance over 300 meter distance below 300 meter

Figure 2.7: Distance road and surface category as presented in 2.9

The table is related to the figure and give more complete insight with the

(27)

actual number of claims and so on.

Table 2.9: Cross-table showing association between surface of the houses, distance from the main road and number of claims

Surface categories in square meter 0-30 31-60 61-90 91-120 121-150 150+ Total Distance over 300 meter

No Claim 495 4220 4785 2056 996 695 13247

One Claim 0 7 24 13 9 10 63

More than One 0 0 0 0 1 0 1

Distance under 300 meter

No claim 648 7287 11761 6028 3132 2461 31317

One claim 2 35 78 63 39 39 256

More than One 0 1 1 1 0 1 4

Total with out missing 1145 11550 16649 8161 4177 3206 44888

Missing observations 9102

Total 53990

The no claim is highest in number and we know from before that our data

set has fewer number of claims. Moreover, claim occurrence is higher when

a cottage is near to the main road. This is because, a cottage near to the

main road has all the modern facilities like water, electricity and so on. After

that, if we look at the surface categories, it reveals that middle sized cottages

has higher number of claims. However, number of middle sized cottages are

higher than others, so the proportion of claims should be higher as well.

(28)

2.2.2 Surface Category and Water Distribution : Water Claims

The cross tabulation of "Surface Category" with respect to "Water" and

"Number of Claims" are shown in table 2.10:

Table 2.10: Cross tabulation for showing the association between the avail- ability of water in the cottages, surface of the cottages and the number of claims: Water damages

Surface categories in square meter 0-30 31-60 61-90 91-120 121-150 150+ Total Water not included=1

no claim 989 7820 7118 1560 346 146 17979

one claim 2 13 19 4 3 1 42

more than one 0 0 0 0 0 0 0

Water included =2 no claim 154 3687 9428 6524 3782 3010 26585

one claim 0 29 83 72 45 48 277

more than one 0 1 1 1 1 1 5

Total with out missing 1145 11550 16649 8161 4177 3206 44888

Missing observations 9102

Total 53990

We have here the graphical representation of "Surface Category" and the

"Water" variable. Here, the variable "Water" means the availability of water in the cottages.

1 2 3 4 5 6

Bar Plot of Surface and Water

Surface Category Counts of The P olicies 0 5000 10000 15000

Water Not Included Water Included

Figure 2.8: Water and surface category bar plot

(29)

In our data for now we are dealing with claims that has damages due to water. However, we have one variable which says us if the individual cabins have a water connection or not. From the above table we can see that for the coverage with some of those cabins do not have a water connection, but they do have damages (claims) due to water. There may be several reasons for that. For example, heavy rainfall or floods.

2.2.3 Water and Cottage Age Category: Water Claims

Information from the cross tabulation of "Water" and "Cottage Age Groups"

could also be interesting to look at. This is what we have in table 2.11.

Table 2.11: Cross tabulation of association between age of the cottages, the availability of water and number of claims

Age groups of the cottages 0-10 11-20 21-30 31-40 41-50 51-60 60+ Total Water Not Included

No Claims 2012 1878 2072 3649 3643 1838 2887

One Claim 12 3 2 7 8 4 6

More Than One Claim 1 0 0 0 0 0 0

Water Included

No Claims 7259 3514 2276 5174 2881 1869 3612

One Claim 79 42 35 49 20 24 28

More Than One Claim 1 1 0 2 1 0 0

Eventually, looking at the descriptive statistics in section 2.1 and section 2.2

we have some idea about the variables and now we can start with modeling

using the variables which we found reasonable. This is because, some of the

variables showed different result in different groups meaning that they may

have affect on the response. For example, due to different surface of the

cottages the number of claims are different, the same is also true for different

categories of the variable "Distance Road". Thus, considering these variables

we can start with modeling now.

(30)

(31)

Theory for Claim Frequency Modeling

In this chapter we shall present the possibilities of using several statistical models and will explain how the parameters are estimated.

3.1 Generalized Linear Modeling

Generalized Linear Models (GLMs) are commonly used to predict the rela- tionship between one response and one or more covariates.

A GLM has three parts.

The first part is called the linear predictor, η = β ₁ + β ₂ x

and the second part is the link function, where we assume that, µ = E(y).

g(µ) = η

Where g is a smooth, monotonic function. The linear predictor builds the re- lationship between η and the covariate x. Here, the assumption is that, there exists a linear relationship between η and x where β 1 is the intercept and β 2

is the slope. The link function is a function that links the expected value µ of the response variable to the linear predictor η. The third component is the random or stochastic component. The stochastic component specifies the distribution of the response variable y. The observations y ₁ , ..., y _n are assumed to be independent and it is assumed that the density of y _i is from the exponential family ¹

In our case we have considered the "Number of Claims" or the "Occur- rence of Claims" as count variable, which means that the observations are

1 Gaussian, Poisson, Binomial, Gamma, Beta, etc.

26

(32)

non-negative integer values and integers comes from counting rather than ranking. However, our considered data has something different that is, it has most of the values as 0 or 1 with only few of the claims have values other than 0 or 1. Since, it has at least some of the counts other than zero, it makes us the sense to use a Poisson or Negative binomial family.

Moreover, at the end it was hard to make interpretations and forecast con- sidering the data as a count. So, some of the interpretations were done under binary or odds ratio concept.

For this report we will mainly focus on Poisson, Negative Binomial for mod- eling the claim frequencies.

3.1.1 Poisson Family

We want to model claim frequency and we know that claim frequency is a count variable. Thus, the possible model for this purpose should be a pois- son regression ¹ for modeling the count dependent variable named "number of claims". In Poisson regression, the mean µ _i is explained in terms of ex- planatory variables with an appropriate link function. So, we can write the Poisson regression model as,

y i ∼ P (µ i ); g(µ i ) = x i 0β

Popular choices for g(µ _i )are the identity link µ _i = x _i 0β and log link log µ _i = x _i 0β. If we use log link µ _i = exp(x _i 0β) is positive, but with the identity link, positivity is not always true. However, in this section we will deal with the log link mostly.

In addition, there is another approach of modeling count data when the pois- son model does not fit well to the data because of over dispersion, which will be discussed briefly in later part of this chapter.

While building the models for claim frequency we need to take into ac- count the exposure days, that is, to how many policy years the customers were exposed. There are two concepts for building the model. First, we can think it the exposure days as an independent variable that is on the right side of the model. The second one is we can consider the exposure variable on the independent part, in equation form we can write as (assuming only one co variate),

Let,

1 Using poisson family under GLM

(33)

y _i =Dependent variable x _i =Independent variable T _i =The exposure

β ₁ =Intercept/ coefficient 1 β ₂ =Coefficient 2

So the Poisson model with log link and exposure T _i can be formulated as,

log(y i ) = µ i = β 1 + β 2 x i + log(T i )

or we can write it as

E(y _i |x _i ) = µ _i = T _i e ^β ¹ ^+β ² ^x ⁱ (3.1) Now the likelihood can be written as,

L = log(l) = log

n

Y

i=1

e ^−µ ⁱ µ ^y _i ⁱ y _i !

!

=

n

X

i=1

(y _i log µ _i − µ _i − log(y _i )!) (3.2)

=

n

X

i=1

y _i log(T _i e ^β ¹ ^+β ² ^x ⁱ ) − T _i e ^β ¹ ^+β ² ^x ⁱ − log(y _i )!

=

n

X

i=1

y _i (β ₁ + β ₂ x + log(T _i )) − T _i e ^β ¹ ^+β ² ^x ⁱ − log(y _i )!

Now differentiating with respect to β 1 and equating to zero we get,

(34)

d dβ ₁

" _n X

i=1

y _i (β ₁ + β ₂ x _i + log(T _i )) − T _i e ^β ¹ ^+β ² ^x ⁱ − constant

#

= 0

⇒

n

X

i=1

y _i − T _i e ^β ¹ ^+β ² ^x ⁱ

= 0

⇒

n

X

i=1

y _i =

n

X

i=1

T _i e ^β ¹ ^+β ² ^x ⁱ

(3.3) To get the value for the parameter β ₂ we will differentiate L with respect to β ₂ and equate this to zero,

d dβ ₂

" _n X

i=1

y _i (β ₁ + β ₂ x _i + log(T _i )) − T _i e ^β ¹ ^+β ² ^x ⁱ − log(y _i )!

#

= 0

⇒

n

X

i=1

x _i y _i − T _i x _i e ^β ¹ ^+β ² ^x ⁱ = 0

⇒

n

X

i=1

x i y i =

n

X

i=1

T i x i e ^β ¹ ^+β ² ^x ⁱ (3.4)

Further, if we consider the offsets on the left side of the model we can write it as,

E( y _i

T _i ) = µ _i = e ^β ¹ ^+β ² ^x ⁱ

We can write the log of maximization function, which can be considered as an alternative to the equation 3.2 as,

=

n

X

i=1

y _i

T _i log µ i − µ i − log( y _i T _i )!

=

n

X

i=1

y _i

T _i (β ₁ + β ₂ x _i ) − e ^β ¹ ^+β ² ^x ⁱ − log( y _i T _i )!

(35)

Now differentiating it with respect to β ₁ and equating to zero we get

dL dβ ₁ =

n

X

i=1

y _i

T _i − e ^β ¹ ^+β ² ^x ⁱ

= 0

⇒

n

X

i=1

y _i T _i =

n

X

i=1

e ^β ¹ ^+β ² ^x ⁱ (3.5)

Again differentiating with respect to β ₂ and equating to zero we get

dL dβ ₂ =

n

X

i=1

x i y i

T _i − x _i e ^β ¹ ^+β ² ^x ⁱ

= 0

⇒

n

X

i=1

x i y i

T _i =

n

X

i=1

x i e ^β ¹ ^+β ² ^x ⁱ (3.6)

In the calculations above we are trying to find the values of the coefficients of

“GLM" where the family is Poisson. If we cross match the right hand side of equation 3.1 and 3.2 or equation 3.5 and 3.6 we can see that the expressions are not same unless T _i = T . In one we are using only the actual observa- tions y i and in other one we have the expressions for converted observations

y i

T i , which makes sense that they will not produce the same result for the

coefficients as well.

(36)

Simulation Experiment

We are trying to simulate a data set from known distributions and then try to see how the regression coefficients are estimated in different methods.

Say we have,

β ₁ = −1.2 and β ₂ = .005 after simulating data with R we get the coefficients for usual GLM method in given in R and also we tried with the "optim" func- tion in R for the two different expression of the optimizing functions. One of them is likelihood, but the other is not likelihood, rather it is a maximization function for the transformed expression of the response in the model since we divide the response by the offsets.

In the first one we are optimizing the function, L = P n

i=1 y _i (β ₁ + β ₂ x _i + log(T _i )) − T _i e ^β ¹ ^+β ² ^x ⁱ − log(y _i )! )

In contrast, we are using a function of the y _i ’s i.e. f(y _i ) = ^y _T ⁱ

i as the ob- servations in the second method, then we need to maximize the function, L = P n

i=1

y i

T i (β 1 + β 2 x i ) − e ^β ¹ ^+β ² ^x ⁱ − log( _T ^y ⁱ

i )!

We have simulated data from "R" and analyzed the simulated data to see how the methods work differently or if they converge in long run.

To ease the understanding of how the data has been simulated one simple example of only one simulation of sample size n=200, is shown in the algo- rithm below,

Algorithm 1 Simple Example of Data Simulation

1: Choose n = 200, β ₁ = −1.2, β ₂ = 0.005

2: Generate a random sample of x _i from N (10, 1)

3: Consider T i ← gamma(10)

4: To get y(The response variable):

5: y _i = rpois(n, λ _i = T _i e ^β0+β ¹ ^x ⁱ )

6: Model 1 ← Fit the first model using GLM function in R.

7: U se the optim function in R for the other two models:

8: Model 2 ← M aximize[ P n

i=1 y _i (β ₁ + β ₂ x + log(T _i )) − T _i e ^β ¹ ^+β ² ^x ⁱ − log(y _i )!

]

9: Model 3 ← M aximize[ P n i=1

y i

T i (β ₁ + β ₂ x _i ) − e ^β ¹ ^+β ² ^x ⁱ − log( _T ^y ⁱ

i )!

]

In the equation 3.1 we have µ i =E(T i )E(e ^β ¹ )E(e ^β ² ^x ⁱ )

Where, µ _i is apparently the mean of the y _i ’s(the response) by the definition

(37)

under GLM method.

In our case, we have simulated data using λ _i = T _i e ^β ¹ ^+β ² ^x ⁱ see algorithm 1 as the mean of our simulated ransom sample of responses. So, we wish that both µ _i and T _i e ^β ¹ ^+β ² ^x ⁱ will be approximately equal to show that our simula- tion method was correct.

Now, we have, E(T _i )E(e ^β ¹ )E(e ^β ² ^x ⁱ )

= E(T _i )E(e ^β ¹ )e ^mean(x ⁱ ^)β ² ⁺ ¹ ² ^σ ² ^β ² ² ¹

= 10.09281 × 0.3011942 × 1.221574

= 3.713459

On the other hand, _n ¹ P n

i=1 y _i = 3.55

So, we can see that both sides are equal in equation 3.1 if we use our simu- lated data set.

However, now we are trying to fit three different modeling methods for the simulated data with several sample sizes and will see the results on the table:

Table 3.1: Evaluating three different ways of modeling and see how they converges

Model Coefficient Value n=2000

Coefficient Value n=20000

Coefficient Value n=1000000 GLM -1.079978( 0.125479)

-0.008214 ( 0.012474 )

-1.181537( 0.039702) 0.002994(0.003953 )

-1.217676 (0.017905) 0.006819 (0.001780 ) Y _i as response -1.080438518

-0.008180521

-1.184160382 0.004204871

-1.239030499 0.008942016

Y i

T i as response -1.07851133 -0.00914358

-1.232840499 0.007884979

-1.21513325 0.00648902 In the above table the value of the standard deviations has been shown in the brackets when we use the GLM model for our simulated data. We have the values of the parameters from three different models and they converge as the sample size increases.

3.1.2 Negative Binomial Due to Poisson Over Disper- sion

In statistics by over dispersion we mean the ratio between the standard de- viation and mean is greater than one. If the variance of the data set is larger

1 Since, x i ∼ N(10, 1), E e ^β ² ^x ⁱ ) can be written as the moment generating function of

normal distribution, i, e, M _x ^t = e ^µt+ ¹ ² ^σ ² ^t ² ,here, β ₂ = t

(38)

than the mean, then there comes an possibility of having over dispersion and we can use the negative binomial regression in this case. Thus suppose λ is a positive continuous random variable with probability function g(λ). Given, λ the variable y is distributed as P (λ), where y is a count variable. Then the probability function of y is

f (y) = R ∞ 0

e ^−λ λ ^y

y! g(λ)dλ

A convenient choice for g(λ) is the gamma probability function G(µ, ν), im- plying f (y) is N B (µ, κ) where κ = 1/ν.

Now, if we have,

f λ

= R ∞ 0

λ ⁻¹ Γ(ν)

λν µ

ν e ⁻ ^λν ^µ dλ

Which converts f (y ) as [2],

γ(ν+y) y!γ(ν)

ν ν+µ

ν

µ

ν+µ ) ^y ; y=0,1,2,..

So, we can say that, the negative binomial arises when risks are divided into different groups, each group characterized by a separate poisson mean, and these means are distributed according to the gamma distribution.

As an illustration, consider the number y of claims involving a randomly chosen customer from a population. If the mean accident rate λ over the population is homogeneous, then perhaps y ∼ P (λ). However, individuals may have different levels of damage tendency, which implies thatis hetero- geneous across individuals. If λ ∼ G(µ, ν ) and hence claim occurrences is gamma distributed over the population and y ∼ N B(µ, κ).

f(y) = ^Γ(y+

1 κ )

y!Γ( ¹ _κ ) ( _1+κµ ¹ ) ¹ ^κ ( _1+κµ ^κµ ) ^y

with E (y )= µ and V (y )= µ (1 + κµ ) and f (y ) is a member of exponen- tial distribution family with φ = 1, θ = ln

µ

1+κµ ) and a θ

= − ¹ _κ

ln 1 − κe ^θ ) The variance function for negative binomial is given by,

V (µ) = ^{V ar(y)} _φ = µ(1 + κµ)

(39)

3.1.3 Data Truncation: Complimentary Log Link

If we consider a situation where the numbers of occurrences of claims are not of interest, moreover we are interested about only whether an event has been occurred or not.

From our data set overview in previous chapter, we can see that there are only a few policies which have claim more than once and nearly all of them have only one claim. Thus it creates a possibility of thinking for a categorical /binomial response. That is, if we only want to know if there is any claim or not instead of looking for the number of claims, this will lead us to a binomial response model. Let us define the response y _i as

y _i 0 =

( 0, if y i = 0 1, otherwise

Then, clearly,y _i 0 indep.b(1, 1 − e ^−µ ) [6]

Let π _i = P r(y _i 0 = 1) is the probability that the new categorical variable will equal to 1. Distribution for y _i 0 is,

f (y _i 0) = _(N _−y ^N!

i 0)!y _i 0! π _i ^y ⁱ ⁰ (1 − π _i ) ^(1−y ⁱ ⁰⁾ So, we can find from π _i that,

π _i = P r(y _i > 0)

= 1 − P r(y _i ≤ 0)

= 1-exp(−µ _i )µ ⁰ _i _0!

=1 − exp ^−µ ⁱ

= 1 − exp ⁻ ^exp(α+βx ⁱ ⁾

Now, we can find a new link function for this . we have,

π _i = 1 − exp ⁻ ^exp(α+βx ⁱ ⁾ ⇒ 1 − π _i = exp ⁻ ^exp(α+βx ⁱ ⁾ ⇒ − log(1 − π _i ) = exp(α + βx _i ) ⇒ log(− log(1 − π _i )) = α + βx _i

and this link function is known as "complementary log–log link". This way of converting the observations to binomial is called "group testing" (also see [5]). More on this will be found on [6].