for Cottage Insurance Data
Master i Modellering og dataanalyse
SHANJIDA AKHTER
Master’s Thesis, Spring 2015
A handful of special methods and techniques are available for solving the problems in insurance industries. The techniques are different based on the situation and each of these techniques has their own theory and logic. Thus, we need to decide the distribution families for a particular data depending on those theories and then we fit statistical models for the claim frequencies and claim sizes.
In this thesis paper we have data from Norwegian holiday cottage insurance where claim frequencies are comparatively lower than other types of insur- ance. We have mainly focused on finding out the factors or variables which have most obligations for the occurrence of claims and losses. After finding out those variables by using the generalized linear modeling (GLM) approach, we tried to check the accuracy of the models. Further, we used two different types of distribution families and wanted to see if they give different results considering this particular data set. Prior to that, we have also worked with some theories at chapter three where we compared the accuracy of three different ways of modeling the claim frequency data by a simple simulation example.
Results from the models using two different distribution families were not so different for this data set. However, when it comes to the theory about three different methods of modeling, it seems they converge in long run.
Moreover, it is not only the claim frequency which we are looking for, com- panies need to know how much a customer may ask for his or her future loss, which is very difficult to predict. Nonetheless, we have fitted some distribu- tions without any covariates to get some idea about future losses. We have fitted models with covariates as well. In this connection, it was difficult to find out the significant variables for the claim sizes. However, while using Inverse Gaussian family under GLM the result showed that there are some covariates which may affect the size of the claims.
After modeling the claim frequencies and the claim sizes we were able to find
out the factors that influence the occurrence of claims and the size of the
claims.
This thesis was written as a part of my master’s degree in modeling and data analysis during the period August 2014 to May 2015, with various master courses at the University of Oslo.
I really relished working on this thesis and have learned lots of new things which I did not imagine when I started. First and foremost, I would like to thank my benevolent supervisors Anders Rygh Swensen and Nils Fridthjov Haavardsson. Thank you both so much for many good conversations, helpful instructions and your valuable time. A special thanks to Anders for always keeping your door open and tolerating my "Silly" questions, and to Nils Haavardsson for taking the time to meet with me even after your work at DNB.
Specially I would like to thank all the people in B800 for a lot of interesting conversations and a friendly environment.
At last, but not at least, I would like to thank my family and friends for their endless support. A special thanks goes to my elder brother Syed who has always been my idol. His motivation and guidance was precious to me. In addition, special thanks are dedicated to my parents back home in Dhaka, who keep supporting me and inspiring me to stay the course. I really appre- ciate it.
Shanjida
May, 2015
1 Introduction 7
2 Description of The Data 10
2.1 Descriptive Statistics: Individual Variables . . . . 16
2.1.1 Claim Frequency With Respect to Age Groups . . . . . 16
2.1.2 Box Plots for Cottage Age Against Claims Due To Wa- ter Damage . . . . 17
2.1.3 Claims and Cellar: Water Damages . . . . 19
2.1.4 Age Groups Of Policy Holders Covering Only Water Claims . . . . 20
2.2 Descriptive Statistics: Association Between Variables . . . . . 21
2.2.1 Distance From The Road and Surface Category: Water Claims . . . . 21
2.2.2 Surface Category and Water Distribution : Water Claims 23 2.2.3 Water and Cottage Age Category: Water Claims . . . 24
3 Theory for Claim Frequency Modeling 26 3.1 Generalized Linear Modeling . . . . 26
3.1.1 Poisson Family . . . . 27
3.1.2 Negative Binomial Due to Poisson Over Dispersion . . 32
3.1.3 Data Truncation: Complimentary Log Link . . . . 34
3.2 Model Validation and Model Checking Tools . . . . 35
4 Selection of Important Risk Drivers in The Model 40 4.1 Model Selection Process . . . . 41
4.2 Variable Selection Criteria . . . . 42
4.3 Modeling: Forward Selection . . . . 43
4.3.1 Selection Criterion AIC . . . . 43
4.3.2 Selection Criterion P Value . . . . 52
4.4 Model validation Using Diagnostics Plots . . . . 56
4.4.1 Water Claims . . . . 56
1
4.4.2 Fire, Theft and Others Claims . . . . 59
4.4.3 Summary Discussion . . . . 60
5 Modeling The Size of The Claims 63 5.1 Modeling Theory For Claim Sizes . . . . 63
5.2 Analyzing Water Claims . . . . 65
5.2.1 Descriptive Statistics . . . . 65
5.2.2 Modeling Using Different Distributions: Without Co- variates . . . . 72
5.2.3 Modeling Using Different Distributions: With Covariates 76 5.3 Analyzing Claims Due to Fire, Theft and Other Damages . . . 80
5.3.1 Descriptive Statistics . . . . 80
5.3.2 Modeling Using Different Distribution: Without Co- variates . . . . 83
5.3.3 Modeling Using Different Distribution: With Covariates 87 6 Summary and Conclusion 90 6.1 Summary Of The Most Important Topics in This Thesis . . . 90
6.2 Challenges and Further Work . . . . 92
APPENDICES 94 A Additional Tables For Forward Selection in Chapter 4 95 A.1 Tables For Steps in Forward Selection: Water Claims . . . . . 96
B R codes 100 B.1 Chapter 2 . . . 101
B.2 Chapter 3 . . . 101
B.3 Chapter 4 . . . 104
B.3.1 AIC: Poisson . . . 105
B.3.2 P Value: Negative Binomial . . . 115
B.4 Chapter 5 . . . 122
2.1 Filtration of data . . . . 13
2.2 Claim frequency (due to water damages) against age groups . . 16
2.3 Claim frequency for damages due to fire, theft and others . . . 17
2.4 Box plot of cottage age against number of claims . . . . 18
2.5 Comparison of cellar in claims and no claims groups . . . . . 19
2.6 Age category distribution of policy holders . . . . 20
2.7 Distance road and surface category as presented in 2.9 . . . . . 21
2.8 Water and surface category bar plot . . . . 23
4.1 Deviance residual plot . . . . 56
4.2 Anscombe residuals for fitted model . . . . 57
4.3 Leverage and cooks distance . . . . 58
4.4 Deviance residuals plot . . . . 59
4.5 Leverage and cooks distance for fitted model . . . . 60
5.1 Percentiles plot of the claim sizes(n= 271) . . . . 67
5.2 Density plot for claim sizes(n= 271) . . . . 68
5.3 Density plot for claim size below and over 90 %(n=244 and n=27 respectively) . . . . 69
5.4 Box plot of alarms regarding claim sizes . . . . 71
5.5 QQ and density plot for log-normal(data below 90%(n=244) . 72 5.6 QQ and density plot for log-normal (data over 90%(n=27)) . . 73
5.7 QQ plot for gamma distribution(n=244) and (n=27) . . . . . 74
5.8 QQ plot for pareto distribution(n=271 and n=27) . . . . 75
5.9 Density plot for damages regarding fire,theft and other . . . . 80
5.10 Percentile plot for damages regarding fire,theft and other(n=330) 82 5.11 QQ plot and density plot for log-normal(Below 90%) . . . . . 83
5.12 QQ plot and density plot for log-normal(Over 90%)) . . . . 84
5.13 Gamma distribution QQ plot . . . . 85
5.14 QQ plot for pareto . . . . 86
3
2.1 Variables to find out different coverage of the policies . . . . . 11
2.2 Data set overview . . . . 12
2.3 Variable summary . . . . 14
2.4 Water claims frequency table for different age groups . . . . . 16
2.5 Fire, theft and others claims frequency table for different age groups . . . . 17
18 19 2.8 Distribution of claims over different age groups of policy holders 21 2.9 Cross-table showing association between surface of the houses, distance from the main road and number of claims . . . . 22
2.10 Cross tabulation for showing the association between the avail- ability of water in the cottages, surface of the cottages and the number of claims: Water damages . . . . 23
2.11 Cross tabulation of association between age of the cottages, the availability of water and number of claims . . . . 24
3.1 Evaluating three different ways of modeling and see how they converges . . . . 32
3.2 Number of claims . . . . 35
4.1 Summaries from models . . . . 45
45 47 4.4 Summaries from the fitted models on claims covering fire,theft and others elements . . . . 49
4.5 AIC values at different steps of forward selection . . . . 50
4.6 Model summary for negative binomial family . . . . 51
4.7 Comparing different models water claims . . . . 54
4.8 Comparing different models for fire, theft and other claims . . 54
4.9 Comparing results by Hosmer Lemeshow Test . . . . 61
4
5.2 Summaries from the variable "Tube" . . . . 69
5.3 Summaries from variable "Distance Road" . . . . 70
5.4 AIC values for modeling claim size(Below 90%) . . . . 76
5.5 AIC values for modeling claim size under gamma family(Over 90%) . . . . 77
5.6 Inverse gaussian modeling for claim sizes(below 90%) . . . . . 78
5.7 Second step in forward selection in claim size modeling(below 90%) . . . . 79
5.8 Percentiles of the claim sizes of fire,theft and other types of damages(n=330) . . . . 81
96 A.2 Step Three . . . . 97
A.3 Step Four . . . . 98
A.4 Step Five . . . . 99
A.5 Step Six . . . . 99
Introduction
The most convenient way to get rid of unexpected financial devastation is to have insurance on our valuable products and wealth, for instance houses, cars, boats and other types of personal property. That is why people have insurance and there are lots of insurance companies now a day. Insurance companies take the risk of the valuable properties from us. To control or to deal with these risks in property insurance we need to know the factors behind the losses. Although the companies always come up with service to their customers, but it is also important for them to know the techniques of dealing with the customers and create a win-win situation for both parties.
Modeling in insurance industries mainly divided into two parts; claim fre- quency and claim size [1]. Most models for claim frequencies are usually supported the poisson regression, but there may be other special cases to use negative binomial family for modeling the claim frequencies.
Having knowledge about claim frequencies is not enough for the insurance workers. Moreover, they need to know how the losses are distributed, how extreme they are and how to work with different categories of loss amount, for example there may exist some really big amount of losses which needs special treatment.
For this report we have a data set describing insurance policies covering around 20000 cabins 1 during the period August 2011-August 2013. The policies cover four categories: Water, Fire, Theft and a rest category. The data contains descriptions of the number of claims as well as the claim sizes.
While modeling the claims we have focused on individual policies over dif-
1 The popular Norwegian vacation houses known as "Hytte"
7
ferent time but not on the individuals. Different chapters were organized to present different types of analysis.
Chapter Two
In this chapter we will mainly go through how the actual data file is? How we have filtered the data set for our analysis and then go through some de- scriptive analysis to see if one or more variable are really reasonable to add in a GLM model.
Chapter Three
This chapter mainly deals with the background and theory for modeling that we have used in later chapters. One important part here is whether to add the exposure to the right side of the model as one explanatory variable or to just divide the "Response" with the exposure value.
Chapter Four
AIC and P value both are the selection criteria for a variable either in the linear or in the non-linear regression models. In this chapter, we tried to find out the best fitted model using several selection criteria in model selection process and also we have used several families of distributions under GLM.
Chapter Five
For the claim sizes it is difficult to find out how the losses depends on the covariates. We can try several options, for example using Log-normal, In- verse Gaussian or Gamma family under GLM. However, it is very common that these distribution families may fail to interpret how the covariates are related to the response-claim sizes, which leads us to think about a model without covariates for the claim sizes.
Chapter Six
In this part we have a brief summary of whole work. Although we tried to perform accurate data analysis, still there are some limitations. Those limi- tations have also been discussed in this chapter.
Appendix A: The tables for displaying steps in forward selection procedure.
Appendix B: R Codes
Description of The Data
The main data file contains 415888 rows and 33 columns, these 33 columns are the variables in the data set. In total we have 21923 policies. The poli- cies regarding the main building were selected only, which means that, we have discarded the policies for the external parts of the cottages for instance garage or any other type of small houses with in the cottages.
Further, we have 24 row (2 × 4 × 3 = 24) of information for each of the policy holders(in most cases). For some of them the number is less then 24, because may be the policy holders withdraw before the time or any other reason.
There are 12(3 × 4) rows for buildings and 12(3 × 4) rows for furniture.
Within these rows we have four other categories of policies such as -water, fire, theft and others and three row for each of them because we have data for three different time period, which are 1st of August 2011- 20th January of 2012, 21st January 2012-20th January 2013 and 21st January 2013 - 1st August 2013 in table 2.1. We have considered the policies over time for our analysis, that is same policy in different time points were in our data set.
In this data set there are lots of redundancies. We have the same infor- mation in one or more variables. In total we have 33 variables which can be divided into two parts. Some of them are used to separate and group the data presented in table 2.1 and the others are for analyzing the risk factors and claim patterns, shown in table 2.3.
10
Variables For Identifying The Types of Claims
The variable "Cover" is same as "Cover Name", but it was written with cod- ing (1 or 2) and the similar was done with the variable "Element Name", coding variable for this one is called "Element", which has values 1, 2,3 or 4 for water, fire, theft and others respectively (Table 2.1). Then it makes the sense of having 2 × 4 = 8 rows and for three period of time it makes 8 × 3 = 24 rows for most of the policies.
Based on the information about the variables in table 2.1, a sub-file was selected from the large data set which considered the insurance elements which have claims more frequently or have higher number of claims.
Table 2.1: Variables to find out different coverage of the policies
Variable Name Possible Values Type of Variables
ID Any Number Numerical
Valid From
Date Between 01.08.2011
to 31.07.2013
Numerical
Valid To
Date Between 01.08.2011
to 37.07.2013
Numerical
Cover 1,2 Numerical
Cover Name Innbo(Household) Character
Element 1,2,3,4 Numerical
Element Name
Vann(Water) Brann(Fire) Tyveri(Theft) Annet(Others)
Character
Number Of Cabins 0,1 Numerical
Here we chose the first three row of the policies which implies that we are considering the policies with the main house and abandoned the part of furniture insurance data. After that, we analyzed the data for the damages due to water, fire, theft and others. Then we have 215960 rows. So, this part of data is 215960 415888 × 100 = 51%(approximately) of the data.
Within this selection we divided our new file into two parts before modeling.
In one we have considered the claims under the element name "Water" and in another the claims under "Fire, Theft and Others" together. The obser- vations those were considered in the new selection of "Water" are about 1 8 th of the whole data set and the "Fire, Theft and Others" are the 3 8 th of the whole data set.
The first one has experienced 368(358 + (2 × 5)) claims. Here, 358 policies had single claim and other 5 had 2 two claims in each. Further, the second selection has reported 462 + (6 × 2) + 4 = 478 claims. That means, 462 poli- cies has only one claim, there are 6 policies having two and and one policy that has four claims. So, first we had the data set looked like table 2.2, later we divided 4 policy coverage into two groups. "Water" was the big group among the four, so it was considered as one single group and then the other three coverage were used as one group while modeling.
Table 2.2: Data set overview
Cover Name Element Name Time Variables
Buidlings
Water
01 August 2011-20 January 2012 21 January 2012-20 January 2013
01 January 2013-01 August 2013 Fire
01 August 2011-20 January 2012 21 January 2012-20 January 2013 01 January 2013-01 August 2013 Theft
01 August 2011-20 January 2012 21 January 2012-20 January 2013 01 January 2013-01 August 2013 Others
01 August 2011-20 January 2012 21 January 2012-20 January 2013 01 January 2013-01 August 2013
The whole data has have experienced with 1013 claims for both "Build- ings" and "Furniture". Furthermore, in the previous paragraph we mentioned that it was in total 368 and 478 claims for two categories of claims under
"Building" coverage and other 181 claims were reported for the "furniture"
coverage.
As we see in the figure 2.1, we have done one modeling only for the water
claims and another modeling was done for the combined claims due to fire,
theft and others.
Figure 2.1: Filtration of data Variables for Modeling The Claims
These variables may be considered as the actual covariates. We have "Cus- tomer Age" and "Cottage Age" giving us information about how old the policy holders and the cottages are. There are also information about the availability of fire alarms, theft alarms, water stop and cellar. "Fire alarm"
and "Theft alarms" have three different categories, where one says no alarm and another is saying local alarm that means that the cabin itself has a alarm to control fire or theft. However, a third category of alarm is also available which controls the fire or theft from the alarm stations. Losses can vary due to the distance of the cabin from the road side. So, we have a variable called
"Distance Road", which gives us information about the distance of the cabin from the road. More information about the variables are provided in table 2.3.
Since we have a data from August 2011-August 2013, the whole period has
been divided in to three sub-periods and we have information about them
separately on three different rows. Thus, we have three row for each of
the "element name". We have one important variable known as "Exposure
Days", which informs us about the number of days a policy holder was with
the policy. The variable "Number of Cabins" has value 1 after each 24 (there
are some exceptions) which means that a new policy holder has been entered
in the data set.
Table 2.3: Variable summary
Variable Name Possible Values Type Of Variables Number of Miss- ing Ob- serva- tions
Proportion of miss- ing obser- vations in the data
Customer Age any numbers numerical 696 .167%
Customer Region 01-21 character 0 0%
Customer District 0100-2100 character 0 0%
Cottage Age any number numerical 0 0%
Insurance Sum any number numerical 0 0%
Surface any number numerical 0 0%
Cellar J(Yes),N(No) character 0 0%
Fire Alarm 1(no alarm), 2(local alert), 3(alert from alarm station)
character 69628 16.7%
Theft Alarm 1(no alarm), 2(local alert), 3(alert from alarm station)
character 69628 16.7%
Distance Road 1(over 300 kilo- meter),
2(under 300 kilo- meter)
character 69628 16.7%
Water 1(not in-
clude), 2(included)
character 69628 16.7%
WaterStop J(Yes),N(No) character 0 0%
Tube J(Yes),N(No) character 0 0%
Cottage Region 01-21 character 0 0%
Cottage District 0100-2100 character 0 0%
Excess 1-15 numerical 0 0%
Annualized Premium any number numerical 0 0%
Claim Per Exposure any number numerical 414875 99.7%
Claim District 0100-2100 character 414875 99.7%
Number Of Elements numerical 0 0%
Policy Similar to
the variable ID , it is a number identifying the policies
character 0 0%
Number of Claim Events numerical 0 0%
Earned Premium numerical 0 0%
Exposure Days 0-365 days numerical 0 0%
These are the variables in table 2.3 which were considered first to look at
their descriptive statistics status and then based on that to include them in the GLM model.
We have one variable called "Excess" in table 2.3 describes the self risk of the policy holder and deducted from the payment from the company if any injuries occur. So if a person claim that he/she has a loss of 9000 kroner and the deductible amount was 1000 kroner, then that person will given 8000 kroner.
There are few missing values in the data set. We can divide them into two categories of missing values. Firstly, Some of them are just usual missing value for some random reason. Secondly, for the "Claim Per Exposure"
variable we have 414875 missing values in the main data file. The reason
behind this is very simple, because this variable in the data only contains
information when there is a claim or loss occurs, otherwise it shows the
value as 0 or NA. The same reason is also valid for another variable "Claim
District".
2.1 Descriptive Statistics: Individual Variables
Our main aim is to model the claim frequencies and know how different vari- ables have affect on the occurrence of the claims. Before starting modeling we need to get a general overview of different variables and their descriptive statistics. These helps to get closer insight of the data and the information.
2.1.1 Claim Frequency With Respect to Age Groups
We have grouped the data set based on the age of the cottages and then plotted the claim frequency(in percentages) against the age groups. Where,
Claim Frequency = N T i
i × 100
T=number of days exposed in group i N=number of claims in groups i
0.0020 0.0025 0.0030 0.0035 0.0040
CottageAgegroups
claim frequencies
agegroup1 agegroup3 agegroup5 agegroup7
Figure 2.2: Claim frequency (due to water damages) against age groups We also have the table below presenting the numerical ideas.
Table 2.4: Water claims frequency table for different age groups
Age groups 0-10 11-20 21-30 31-40 41-50 51-60 60+
claim frequency(%) 0.004096738 0.003581013 0.0036776147 0.002768758 0.001968600 0.003238116 0.002249520
The previous plot and table was only for claims due to water damages.We also have other group of data set where we have claims on fire, theft and other reasons. The claim frequencies for that group has been shown below by a plot,
0.0011 0.0012 0.0013 0.0014 0.0015
CottageAgegroups
claim frequencies
agegroup1 agegroup3 agegroup5 agegroup7
Figure 2.3: Claim frequency for damages due to fire, theft and others
Table 2.5: Fire, theft and others claims frequency table for different age groups
age groups 0-10 11-20 21-30 31-40 41-50 51-60 60+
claim frequencies(%) 0.001459295 0.001308447 0.001225871 0.001490870 0.001100100 0.001248024 0.001326640
Claim frequency is very low and this is true for this data set. Moreover, there are some fluctuations in the frequency table, which gives us the idea about the affect of "Cottage Age" on the occurrence of claims
2.1.2 Box Plots for Cottage Age Against Claims Due To Water Damage
The box plot for the age of the customers against the number of claims has
plotted below and we will have a look at the counts of the claims before that
presentation:
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●●
●●
●
●●
●
●●
●
●
●
●●
●
●●
●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●
●
●●
●
●
●
●●
●●
●●
●
●●
●●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●●
●●
●
●●
●
●●
●
●●
●
●●
●
●
●
●●
●
●
●●
●●
●
●●
●
●●
●
●●
●
●●
●
●
●●
●
●●
●
●●
●
●
●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●●
●●
●
●
●●
●
●●
●
●●
●●
●
●●
●
●●
●
●●
●●
●
●
●●
●
●●
●
●●
●
●
●●
●●
●●
●
●●
●
●●
●
●●
●
●●
●●
●
●●
●
●
●
●●
●
●●
●
●
●●
●
●●
●
●●
●
●●
●
●
●
●●
●
●●
●
●●
●●
●
●●
●
●
●
●●
●
●
●●
●
●
●●
●●
●
●●
●
●●
●
●●
●
●●
●●
●
●●
●
●●
●●
●
●●
●
●●
●
●●
●●
●●
●
●●
●
●●
●
●●
●
●●
●●
●
●●
●
●
●●
●●
●●
●
●
●●
●
●●
●
●●
●●
●
●
●
●●
●
●
●●
●●
●●
●●
●
●
●●
●
●
●●
●
●
●●
●●
●
●●
●
●
●●
●
●●
●
●●
●
●
●●
●
●●
●●
●
●●
●
●
●●
●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●
●
●●
●
●●
●
●●
●
●●
●
●
●
●●
●
●●
●●
●
●●
●
●●
●
●
●
●●
●
●●
●
●●
●
●●
●
●
●●
●
●●
●
●●
●
●●
●
●
●
●●
●
●●
●●
●
●
●
●
●●
●
●●
●
●●
●
●●
●●
●
●
●
●
●●
●
●●
●
●●
●
●
●
●●
●
●●
●
●
●●
●
●
●●
●●
●●
●
●
●●
●●
●
●●
●
●●
●●
●●
●
●●
●
●
●●
●
●
●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●
●
●
●●
●
●●
●
●●
●
●
●●
●
●●
●
●●
●●
●
●
●
●
●
●
●
●●
●●
●●
●
●●
●
●
●●
●
●●
●
●●
●
●●
●
●●
●●
●●
●
●
●●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●
●●
●●
●
●
●
●
●●
●●
●
●●
●
●●
●
●●
●●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●
●
●●
●
●●
●
●●
●
●●
●
●
●●
●
●
●
●●
●
●●
●●
●
●●
●
●●
●●
●
●●
●
●●
●
●
●●
●
●●
●
●
●
●●
●●
●
●
●
●
●●
●
●●
●
●●
●●
●●
●
●●
●
●
●●
●
●●
●
●
●
●●
●
●
●
●●
●
●●
●
●
●
●●
●●
●●
●●
●
●
●
●
●●
●●
●
●
●
●
●●
●●
●●
●●
●
●
●●
●●
●
●●
●
●●
●●
●
●●
●
●
●●
●
●●
●
●
●●
●
●●
●
●●
●
●●
●
●
●●
●
●
●
●●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●●
●
●●
●
●●
●
●●
●
●●
●
●
●●
●
●●
●
●
●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●
●●
●
●●
●●
●
●●
●
●●
●●
●
●●
●
●
●●
●
●
●●
●●
●
●
●●
●
●●
●
●
●●
●
●
●●
●●
●
●●
●
●
●
●●
●●
●●
●
●●
●
●●
●
●●
●●
●
●
●
●●
●
●●
●●
●●
●
●
●●
●
●
●●
●●
●
●●
●●
●
●●
●
●●
●
●
●●
●
●●
●●
●
●
●●
●●
●
●●
●
●
●
●
●
●
●●
●
●●
●
●●
●●
●
●●
●
●●
●
●
●
●●
●●
●
●●
●●
●
●
●●
●●
●
●●
●
●●
●●
●●
●●
●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●
●●
●
●●
●●
●●
●●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●●
●
●●
●●
●●
●
●●
●●
●●
●
●●
●●
●●
●
●●
●●
●●
●
●●
●
●●
●●
●●
●●
●
●●
●●
●
●
●●
●●
●●
●●
●
●
●
●●
●●
●
●●
●●
●●
●●
●●
●●
●●
●
●●
●
●
●●
●
●
●
●●
●●
●
●
●●
●●
●●
●●
●
●
●
●●
●
●●
●●
●●
●
●
●●
●
●
●
●●
●
●
●●
●●
●
●
●
●
●●
●●
●●
●
●●
●
●●
●●
●
●
●
●●
●●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●●
●
●
●●
0 1 2
0 20 40 60 80 100
Number of claims
Cottage Age
0=no claim(n1=0) 1=one claim(n2=358) 2= two claims(n3=5)
Figure 2.4: Box plot of cottage age against number of claims
Here, from the box plot it is clear that, the median age of the cottages which have a claim is 30 though there are some outliers. Whereas, it is also visible that some of the old cottages do not have a claim. While considering the third box, it is reasonable to mention that it only consists of five obser- vations, since we have only 5 of the policies with two claims. However, those policies with two claims have the cottage ages near to 40 years old.
We have also categorized the variable "Cottage Age" and a table with num- ber of claims and the age category of cottages has shown below:
Table 2.6: Distribution of claims over different age groups of the cottages 1
Age Groups of Cottages 0-10 11-20 21-30 31-40 41-50 51-60 60+ Total No Claims 11249 6514 5314 10378 7767 4482 7923 53627
One Claim 100 50 44 61 32 32 39 358
More Than One Claims 1 1 0 2 1 0 0 5
Total 11350 6565 5358 10441 7800 4514 7962 53990
Proportion(%) 0.89 0.78 0.82 0.60 0.42 0.71 0.49 0.66
From the table we see that most of the cottages between 0-10, 11-20 and
31-40 years old the have higher claims. However, it is also true that in the
whole data set we have large number of cottages from those age groups.
2.1.3 Claims and Cellar: Water Damages
We may have a look in comparing this variable for both claim and no claim group. If we have a look at the table of proportions:
Table 2.7: Claim percentage on cellar and no cellar groups 1
Cellar Categories Cellar No Cellar
Claims 0.006056 0.006867
No Claims 0.99394 0.9932 Total 1(approx) 1(approx)
We have the graph below to understand the scenario easily.
J N
Claims Group Frequency 0 50 100 150 200 250 300
J=IncludedN=Not IncludedJ N
No Claims Group Frequency 0 10000 20000 30000 40000
J=Included N=Not Included
Figure 2.5: Comparison of cellar in claims and no claims groups
The usual impression corresponds well to the proportions which are fairly
equal. Maybe, this information indicates us that variable is not so important while modeling the water claim frequencies.
2.1.4 Age Groups Of Policy Holders Covering Only Wa- ter Claims
Similar to the age of the cottages, the policy holders age is also important.
Customers with different age groups have different personalities and that can affect the occurrence of a damage. For example, a young man can take more care of his belongings rather than a middle aged or old person. Thus we can explore this idea by having a look at the table which displays the claim distributions over different age groups.
Histogram of Policy Holders Age
Policy Holders Age
Frequency
20 40 60 80 100
0 2000 4000 6000 8000
Figure 2.6: Age category distribution of policy holders
The categorization was made in such a way so that all groups have nearly equal number of claims.We can see that most of the policy holders are in age group 46-60 year, that is, most of them are with in fifty to sixty years.
However, the second largest age groups are 60+ and the lowest is the group
16-30 years old. People below 16 years old rarely own a cottage and this
is also same for those who are between 16 to 30. Moreover, when we have
plotted the actual ages of the customer instead of the groups, the distribution
looks quite normal actually. We can have a look to the counts of different
groups in table 2.8.
Table 2.8: Distribution of claims over different age groups of policy holders Age Groups of Policy Holders 16-30 31-45 46-60 60+ total
Number of Claims Per Group 5 88 165 110 368
2.2 Descriptive Statistics: Association Between Variables
Alongside analyzing the each individual variable we have also looked at two or more variables together with the number of claims. At this point we have considered the variables for example, "Surface", "Distance Road", "Water".
Similar to the previous section we have categorized the variable surface to understand differences between different cottages of different sizes.
2.2.1 Distance From The Road and Surface Category:
Water Claims
Here, we tried to see the Surface Categories based on Distance from the road and Number Of Claims:
1 2 3 4 5 6
Bar Plot of Surface and Distance Road
Surface Category Counts of The Policies 050001000015000
distance over 300 meter distance below 300 meter