Investigating the relation between climatic variables and springflood using statistical analysis

(1)

Investigating the relation between climatic variables and springflood using statistical

analysis

Odin Midbrød Sanner

Master thesis in hydrology at The Department for Geosciences The Faculty of Mathematics and Natural Sciences

60 ECTS credits University of Oslo

01.06.2021

(2)

Title: Investigating the relationship between climatic variables and springflood using statistical analysis

Author: Odin Midbrød Sanner

Supervisors: Chong-Yu Xu (UIO) and Kolbjørn Engeland (UIO/NVE) http://www.duo.uio.no/

Print production: Reprosentralen, University of Oslo

(3)

(4)

(5)

i

Summary

In Norway, threats of geohazards are high, one of these geohazards that causes great damage every year is the springflood. As the climate is changing, it is expected that there will be significant changes to the trend, variability, and seasonality of precipitation, as well as snow/rain ratio of the precipitation. These changes in combination with an increased temperature will change the hydrological regime in Norway and will highly affect the yearly springfloods. The aim of being able to foresee and forecast the yearly floods relies on peoples understanding of the hydrological regime and hydroclimatic drivers of the floods.

In this study the climatic drivers to yearly springflood have been investigated with regards to their relationship and effect on the yearly springflood. Twenty-five catchments in central and eastern Norway have been studied to identify, and if possible, quantify, how the interaction between the climatic variables contributes to controlling the yearly springflood.

Two main methods were used to carry out this study. Correlation analysis was used to identify the relationship between the climatic variables and the flood characteristics. Multiple linear regression approach was used to better understand the interaction between the climatic variables and the yearly springflood. With this approach regression models were developed to use in predicting future springflood events with the climatic variables as input. Results suggest that trends in the flood characteristics of the springfloods are highly depended on the climatic drivers, catchment properties and in which flood regime the catchments fit into. The regression models that were developed to predict when the springflood hits were mostly dependent on the snow cover in the catchments, the groundwater table and the temperature prior to the springflood. The regression models that were developed to predict the peak and volume (sum of the maximum seven-day discharge) of the springflood were mostly dependent on the amount of snow, number of frost days, snow-cover, snow-water equivalent and groundwater table in the catchments. When evaluating the performance of the developed regression models, the results showed that the models developed to predict the timing and volume of the springflood gave the best results.

(6)

ii

The trends in how the twenty-five studied catchments were affected by a change in precipitation patterns and snow/rain ratio due to climate change were also investigated. Negative trends in the springfloods peak and volume were found more often than positive change. This can be explained by a shift in flood generating processes as there is expected to be a shift in the snow/rain ratio in precipitation in Norway. The results of this thesis suggest that the role of snowmelt as a flood generating factor for springfloods tends to decrease, with rain becoming a more dominant factor in generating the yearly springflood.

(7)

iii

Foreword

This master thesis is a result of the work done by completing the master program in hydrology at the Department of Geosciences at the University of Oslo.

Norway is a country faced with many geohazards, one of the most destructive hazards are floods and specifically springfloods. One of the available master topics this year was to investigate which climatic factors that determine the characteristics of the springflood. This immediately caught my eyes as something I wanted to explore. To understand how these climatic variables interplays with the yearly springflood is important to be able to build models to improve seasonal forecasting for the yearly springfloods, I want to contribute to this work with this thesis and this work has been very exciting and highly educational.

I would like to thank my supervisors Chong-Yu Xu and Kolbjørn Engeland for their immense help through the entire process from start to finish. They have contributed with professional support and good advice whenever needed, which have helped me a lot in the progression.

(8)

iv

Table of content

Summary ... i

Foreword ... iii

1.0 Introduction ... 1

1.1 Motivation ... 1

1.2 Background ... 2

1.3 Objective ... 3

1.4 Study Design ... 4

2.0 Study Area & Data ... 6

2.1 Climate in Norway ... 6

2.2 Flood season and flood generating processes ... 7

2.3 Study catchments ... 11

2.4 Hydrological and climatic data ... 13

2.5 Catchment properties ... 15

2.6 Future scenarios ... 16

3.0 Methodology ... 17

3.1 Method selection ... 17

3.2 Correlation analysis ... 17

3.2.1 Pearson correlation coefficient ... 19

3.2.2 Spearman correlation ... 19

3.2.3 Kendall correlation coefficient ... 20

3.3 Regression Analysis ... 20

3.3.1 Multiple Regression ... 21

3.3.2 Model selection ... 23

3.3.3 Local regression model ... 24

3.3.4 Regional regression model ... 24

3.4 Predicting future flood characteristics ... 24

3.5 Evaluation Criteria ... 24

3.5.1 Evaluation of model assumptions ... 25

3.5.2 Evaluation of model fit ... 27

3.5.3 Evaluation of model predictions... 31

3.6 R-studio ... 31

4.0 Results ... 32

4.1 Correlation Analysis ... 32

4.2 Regression analysis... 34

4.2.1 Model selection ... 35

4.2.2 Evaluation criteria of regression ... 37

(9)

v

4.2.3 Station regression ... 47

4.2.4 Regional regression ... 54

4.3 Future scenarios ... 61

5.0 Discussion ... 66

5.1 Evaluation results for three stations... 66

5.2 Predictions ... 69

5.3 Effect of future emission scenarios ... 75

5.4 Limitations ... 78

5.5 Implications ... 80

6.0 Conclusion ... 81

7.0 Reference ... 84

7.1 Literature references ... 84

7.2 Figure list ... 86

8.0 Appendix ... 87

Appendix: Diagnostic plots ... i

(10)

vi

Table of figures

FIGURE 1: FLOWCHART DESCRIBING THE STRUCTURE OF THE THESIS... 5

FIGURE 2: MAP OF HOW THE DIFFERENT CATCHMENTS IS CATEGORIZED. STENIUS ET AL, 2014 ... 9

FIGURE 3: CATCHMENT STUDIED PLOTTED ON A MAP OF NORWAY AND THE STATIONS FLOODROSE ... 12

FIGURE 4: ILLUSTRATION OF DIFFERENT DEGREE OF CORRELATION (LONG, 2007)... 18

FIGURE 5: HOW DATA POINT COULD BE DISTRIBUTED AROUND THE REGRESSION LINE. LEFT GRAPH SHOWS A POOR R², WHILE GRAPH TO THE RIGHT SHOWS A HIGH R² (FROST, 12.04.21) ... 29

FIGURE 6: THE FLOW CHART OF HOW A 5K-FOLD CROSS VALIDATION IS CARRIED OUT (DATA DRIVEN INVESTOR, (2018)) ... 31

FIGURE 7: HEATMAP OF THE CORRELATION COEFFICIENTS BETWEEN THE CLIMATIC VARIABLES AND THE TIMING OF THE SPRINGFLOOD ... 32

FIGURE 8: HEATMAP OF THE CORRELATION COEFFICIENTS BETWEEN THE CLIMATIC VARIABLES AND THE PEAK OF THE SPRINGFLOOD ... 33

FIGURE 9: HEATMAP OF THE CORRELATION COEFFICIENTS BETWEEN THE CLIMATIC VARIABLES AND THE PEAK OF THE SPRINGFLOOD ... 33

FIGURE 10: SELECTED REGRESSORS FOR THE REGRESSION MODELS THAT PREDICT THE TIMING OF THE SPRINGFLOOD ... 35

FIGURE 11: SELECTED REGRESSORS FOR THE REGRESSION MODELS THAT PREDICT THE PEAK OF THE SPRINGFLOOD ... 36

FIGURE 12: SELECTED REGRESSORS FOR THE REGRESSION MODELS THAT PREDICT THE VOLUME OF THE SPRINGFLOOD ... 36

FIGURE 13: EVALUATION CRITERIA OF THE TIMING MODELS ... 38

FIGURE 14: EVALUATION CRITERIA OF THE PEAK MODELS ... 39

FIGURE 15: EVALUATION CRITERIA OF THE VOLUME MODELS ... 39

FIGURE 16: FITTED VS RESIDUALS PLOT FOR ATNASJØ ... 41

FIGURE 17: FITTED VS RESIDUALS PLOT FOR LENGLINGEN... 41

FIGURE 18: FITTED VS RESIDUALS PLOT FOR KRINGLERDAL ... 41

FIGURE 19: QQ-PLOT OF STANDARDIZED RESIDUAL FOR ATNASJØ ... 42

FIGURE 20: QQ-PLOT OF STANDARDIZED RESIDUAL FOR LENGLINGEN ... 43

FIGURE 21: QQ-PLOT OF STANDARDIZED RESIDUAL FOR KRINGLERDAL... 43

FIGURE 22: HISTOGRAM AND CDF OF RESIDUALS FOR ATNASJØ ... 44

FIGURE 23: HISTOGRAM AND CDF OF RESIDUALS FOR KRINGLERDAL ... 44

FIGURE 24: HISTOGRAM AND CDF OF RESIDUALS FOR LENGLINGEN... 45

FIGURE 25: PREDICTION INTERVAL FOR KRINGLERDAL WITH CORRELATION COEFFICIENT OF 0.86 ... 47

FIGURE 26: PREDICTION INTERVAL FOR BRUSTUEN WITH CORRELATION COEFFICIENT OF 0.40 ... 48

(11)

vii

FIGURE 27: PREDICTION INTERVAL FOR HALLEDALSVATN WITH CORRELATION COEFFICIENT OF 0.73 ... 48

FIGURE 29: PREDICTION INTERVAL FOR TANNSVATN WITH CORRELATION COEFFICIENT OF 0.31 ... 50

FIGURE 30: PREDICTION INTERVAL FOR ETNA WITH CORRELATION COEFFICIENT OF 0.52 ... 50

FIGURE 32: PREDICTION INTERVAL FOR TANNSVATN WITH CORRELATION COEFFICIENT OF 0.44 ... 52

FIGURE 33: PREDICTION INTERVAL FOR ETNA WITH CORRELATION COEFFICIENT OF 0.73 ... 52

FIGURE 34: CORRELATION BETWEEN PREDICTED AND OBSERVED FIGURE 35: CORRELATION BETWEEN PREDICTED AND OBSERVED TIMING OF THE SPRINGFLOOD PEAK OF THE SPRINGFLOOD ... 53

FIGURE 36: CORRELATION BETWEEN PREDICTED AND OBSERVED VOLUME OF THE SPRINGFLOOD ... 54

FIGURE 37: PREDICTION INTERVAL FOR THE MEAN MODEL WITH CORRELATION COEFFICIENT OF 0.99 ... 55

FIGURE 38: PREDICTION INTERVAL FOR THE TYPICAL EVENT OF 1995 WITH CORRELATION COEFFICIENT OF 0.95 ... 56

FIGURE 46: RELATIVE CHANGE IN PEAK VALUE FOR SCENARIO RCP45 FIGURE 47: RELATIVE CHANGE IN PEAK VALUE FOR SCENARIO RCP85 ... 61

FIGURE 48: RELATIVE CHANGE IN VOLUME FOR SCENARIO RCP45 FIGURE 49: RELATIVE CHANGE VOLUME FOR SCENARIO RCP85 62 FIGURE 50:BOXPLOT ILLUSTRATIONS OF CHANGES IN PRECIPITATION, SNOW AND RAIN AT NEDRE SJØDALSVATN FOR THE DIFFERENT FUTURE EMISSION SCENARIOS. THE STATION WITH THE LARGEST INCREASE IN PEAK SPRINGFLOOD. ... 63

FIGURE 51: BOXPLOT ILLUSTRATIONS OF CHANGES IN PRECIPITATION, SNOW AND RAIN AT AULESTAD FOR THE DIFFERENT FUTURE EMISSION SCENARIOS. THE STATION WITH THE LARGEST DECREASE IN PEAK SPRINGFLOOD. ... 63

FIGURE 52: BOXPLOT ILLUSTRATIONS OF CHANGES IN PRECIPITATION, SNOW AND RAIN AT ORSJOREN FOR THE DIFFERENT FUTURE EMISSION SCENARIOS. THE STATION WITH THE LARGEST INCREASE IN THE VOLUME OF THE SPRINGFLOOD. ... 64

FIGURE 53: BOXPLOT ILLUSTRATIONS OF CHANGES IN PRECIPITATION, SNOW AND RAIN AT KNAPPOM FOR THE DIFFERENT FUTURE EMISSION SCENARIOS. THE STATION WITH THE LARGEST INCREASE IN THE VOLUME OF THE SPRINGFLOOD. ... 64

(12)

viii

Table of tables

TABLE 1: FLOOD SEASONS (STENIUS ET AL. (2014)) ... 8

TABLE 2: CATCHMENT STATIONS, AVAILABLE DATA AND CATCHMENT PROPERTIES ... 13

TABLE 3: RESULT FROM THE KOLMOGOROV-SMIRNOV TEST ... 46

TABLE 4: CORRELATION OF OBSERVED AND PREDICTED VALUES FOR THE REGIONAL MODELS. ... 60

TABLE 5: THE NUMBER OF STATIONS IN THIS STUDY THAT HAS A CORRELATION COEFFECT BETWEEN PREDICTED AND OBSERVED VALUES OF ABOVE 0.5 ... 68

(13)

ix

(14)

x

(15)

1

1.0 Introduction

1.1 Motivation

Due to a changing climate, robust climate models are of high importance. Climate change studies indicate that a changing climate can be expected to have a huge impact on the hydrology in Norway.

As the temperature rises; the yearly precipitation will rise, there will be a shift in the snow/rain ratio, which again will impact the seasonal floods in Norway. It is predicted that rain-dominated floods will increase, and snowmelt-dominated floods will decrease (Hanssen-Bauer et al., 2015).

The impact of how different climate variables interact and work to control the springflood is yet to be fully understood. The intention with this master thesis is to fill the knowledge gap on this topic.

This will be carried out by improving the understanding of how different climatic variables interacts and contributes to controlling or accelerating the yearly springflood. Regression models will be developed by improving the understanding of how and to which degree the climatic variables affects the springflood. The main motivation for this master thesis, is that the regression models should be able to make predictions on future springfloods with high accuracy. By improving the knowledge on this topic and building good and complex hydrological models, will in turn help reducing the impact of the springfloods in the future.

In Norway, the threats of geohazards are high and flood is a common geohazard event that initiates other geohazards like landslides and rockslide. Flood is a natural hazard that causes a lot of destruction and losses for the society. On example is “Veslaofsen” in 1995, -which was a 200–300- year flood that occurred in Eastern Norway. This was a cold winter/spring with little snowmelt and with approximately 40% more snow in the catchments. Prior to the springflood it was a rapid increase in temperatures in combination with heavy precipitation. Veslaofsen caused a lot of destruction, costing 1.8 billion kroner and resulted in one fatality (Hella, 2008). This one event shows the importance of understanding the flood and how the climatic variables and the flood interacts. For this reason, and as floods are predicted to increase in magnitude and frequency, it is important to be able to predict flood characteristics accurately. Complex climate and hydrological models are needed to prepare for the future together with the challenges that comes with a changing climate. Another reason to why this is important is that the models can be extrapolated to be used

(16)

2

in ungauged catchments. Ungauged catchments are catchments where no discharge data is available.

1.2 Background

The main contribution factors of springfloods in many regions in Norway are dominated by snowmelt, but as the climate changes to a warmer climate there is expected to be a shift in contribution from snowmelt towards rain as the dominating factors in this matter. A warmer climate will have a big impact on the hydrological regime, both on a local and on a global scale. For instance, it will change the precipitation pattern, shorten the winter season and reduce the snow cover in the mountains (Hanssen-Bauer et al., 2015). Changes in the precipitation seasonality, will affect the climatic variables that contribute to and control the springflood. By identifying how the climatic variables affect the characteristics of the springflood, they can be used to build predictive models that will improve seasonal forecasting and climate projections. They will also in turn help to reduce the impact of the coming floods in years and centuries to come.

The role of snowmelt and rainfall as flood generating processes is addressed in several studies.

Vormoor et al. (2015) investigated observed trends in floods in Norway and link these trends to flood generating processes (rain and snow melt contribution to floods), and found that negative trends can be linked to snow melt floods. Investigating climate change impacts on floods, Lawrence (2020) identified a variable response of flood magnitudes in snow-dominated catchments. He suggests that this reflects the competing effects of precipitation and temperature, both of which are projected to increase during the winter season throughout Norway.

In the study by Rosenberg et al. (2011), the seasonal streamflow forecast was investigated with statistical applications using SWE (snow-water equivalent) and precipitation as independent variables. This study uses a hybrid approach combining physically based predictors with statistically based predictors and proved to give good results in its predictions. The conclusion of this study is that this approach definitely adds value to statistical forecasting and should be explored further. This study focuses only on SWE as a climatic factor and it is mentioned that to improve the model, adding more climatic variables to make a more complex model would be the way to go.

(17)

3

The background presented above identifies a knowledge gap in how changes in temperature and precipitation impact the springflood, in this master thesis these knowledge gaps will be attempted to be filled.

To improve the understanding of which climatic variables and how they contribute to controlling or accelerating the springflood, it is necessary to build robust forecasting models. Once this is understood, and by building a complex and robust model that can predict future springflood accurately, will help in many aspects of society. Models that can predict floods accurately will help NVE in its forecasting for future floods and will also help the hydropower companies to foresee how much hydropower they can produce yearly. Last but not least, accurate models will help in reducing the immense damage inflicted yearly by floods in Norway.

1.3 Objective

The main objective of this master thesis is to identify, and if possible, quantify, how the interaction between the climatic variables contributes to controlling the springflood. The main questions to be answered in this thesis are:

(i) What generates the springflood?

(ii) Which of the climatic variables contribute the most to the different springflood characteristics?

(iii) How accurate can developed regression models predicted springflood characteristics? And do the results vary depending on which springflood characteristics are being predicted?

(iv) How will a changing climate affect the yearly springfloods and the hydrological regimes in Norway?

(18)

4 1.4 Study Design

The initial work for this study was to select which catchments that should be included, and which climatic variables should be explored. As the main objective of this study was to identify, and if possible, quantify, how the interaction between the climatic variables contributes to controlling the yearly springflood, twenty-five discharge stations in central and eastern Norway were selected based on their hydrological regime. The selection of which climatic variables to be included was also carried out in this initial state of the study. The selected climatic variables for this study are:

• temperature

• accumulated snow,

• number of frostdays,

• snow coverage,

• snow-water equivalent,

• groundwater table

• soil moisture deficit

The next stage of the study was to investigate the correlation between the climatic variables and the springflood characteristics. The springflood characteristics that have been investigated in this study are:

• The timing of the springflood

• The peak of the springflood

• The volume of the springflood

These are all important springflood characteristics for several different industries and government departments in Norway, such as NVE which works with flood prediction, the hydropower industry and local communities. The catchments that have been selected to be included in this thesis have springfloods that are dominated by snowmelt and have high runoff during the spring season due to snowmelt.

(19)

5

The correlation analysis is then used in the development of the regression models. Correlation analysis was used to check whether the association between the climatic variables and the flood characteristics will be useful and significant to predict future springflood characteristics. In the developments of these regression models, model diagnostic studies were performed to build the best and most robust regression models for each catchment as possible. Once the regression models for each catchment and regional regression models were built, these could be used to carry out predictions on the flood characteristics based on the climatic variables. The climatic variables from two different emission scenarios were use as input in the regression models to make predictions, the results here will be discussed in detail in the discussion chapter. Together all these stages of the thesis helped to understand which climatic variables control and/or accelerate the yearly springflood, and to build robust and accurate regression models that have the ability to predict future flood characteristics based on climatic variables. How the work of this study was structured and carried through is shown in the following flow chart in Figure 1.

Figure 1: Flowchart describing the structure of the thesis.

Catchment Selection

Correlation Analysis

Station Regression

Regional

Regression Prediction Future

scenarios

(20)

6

2.0 Study Area & Data

In this chapter the selected catchments, their climate and the data used to make the regression models are presented.

2.1 Climate in Norway

Norway is a Nordic country. It is located on the same latitude as Greenland, Siberia and Alaska, but its climate is much milder due to its location by the Atlantic Ocean which contributes to a more stable climate due to the relatively warm ocean streams known as the Golf stream that brings warm water from lower latitudes.

There is a large variation in climate that is amongst others controlled by distance from coast, elevation and latitude. The northern part of Norway, which is located within the polar circle, experiences long periods with little to no solar energy in the winter and midnight sun in the summer months. The coastal area of Norway has a temperate rainy climate with mild winters (C-climate according to Köppens climate classification) and the eastern part of Norway has a cold-temperate climate (D-climate according to Köppens climate classification) with seasonal snow cover. The high relief with deep valleys and high mountains gives Norway a large local variation in climate across small distances.

The topographic and geographic factors in Norway reflect the temperature variation in the country.

Along the coast the temperatures are mild all year round with annual average of around 8^oC, while in the northern parts and on the mountains the annual average air temperature is below freezing point. The highest yearly variation is inland with yearly temperature variations of 25 to 30degrees, while the coastal areas have variations of 10 to 15 degrees.

Norway can be categorized as a wet country with relatively much precipitation and less evapotranspiration. This can be closely connected with the wind systems and the mountainous topography. The most dominant kind of precipitation in Norway is orographic precipitation, this is created when saturated air moves inland from the oceans and meets the mountains. The air rises and becomes cooler, and the moisture gets released in the form of precipitation. This type of precipitation is very usual in the coastal areas, these areas have an annual precipitation of 2000 to

(21)

7

3000 mm. While in the rain shadow behind the mountains there can be as low as 300 mm of yearly precipitation (Dannevig et al., 2021).

In Hanssen-Bauer et al. (2015) it was investigated how the climate in Norway changes until the end of the century, based on two emission scenarios. These two emission scenarios will also be used in this study when using the regression models developed to predict future flood characteristics. In the report by Hanssen- Bauer et al. (2015) it was concluded that the climate in Norway is expected to be highly effected if the emissions continue to rise. The main finding of this report was:

• Yearly temperature to rise 4.5^oC (Interval of 3.3 to 6.4^oC)

• Yearly precipitation to rise 18% (Interval of 7 to 23%)

• Extreme precipitation event to occur more frequently and be more powerful

• Rain floods will get bigger and occur more frequently

• Snowmelt floods will get smaller and occur less frequently

• In low-laying areas will the snow almost disappear, but for mountainous areas will there be more snow

• The glacier will melt more

• Sea-levels will rise between 15 and 55 cm depending on the location

However, if the global greenhouse emission gets significantly reduced, this will also greatly reduce the expected changes.

2.2 Flood season and flood generating processes

The Norwegian climate is the main factor of how the floods are characterized. In Norway, the largest floods are the spring and summer floods, which are snowmelt dominated in combination with rain, and the autumn flood, which is rain dominated. In a study done by Stenius et al. (2014) the flood regimes in Norway are categorized using two approaches. In the first approach, the floods are classified by dividing into regimes of when the largest flood events occurred, e.g. if the largest flood event occurred in April/may then it occurs during the spring season. The second approach is to use the main flood generating process, e.g. if the flood is dominated by snowmelt. This means

(22)

8

that if the maximum flood occurs in April/May and is snowmelt dominated then it is characterized as springflood.

In the study of Stenius et al. (2014) measuring stations around Norway were characterized regarding flood season and dominated contributing factors. Flood season was defined as which part of the year the largest flood occurred and dominated factors was defined which meteorological factor contributed to the largest flood. The largest yearly floods were first divided into months and the season as shown in table 1.

Table 1: Flood seasons (Stenius et al. (2014))

Season Spring Summer Autumn Winter

Month April-June July-August September-November December - March

Then it was divided into season and contributing factors:

• Spring/Snowmelt: In the spring months the largest floods will act as normal when the dominating factor will be snow melting, the floods can also be a combination of snowmelt and rain.

• Autumn/Rain: In the autumn months the largest floods will act as normal when dominated by rain. Also, the autumn floods can be a combination of snowmelt and rain, but this is specially in northern-Norway and in the mountains.

• Year/Combination: The largest floods with contribution from both snowmelt and rain can occur year-round, not any special season stands out.

• Glacier: Catchment with more than 5% glacier. In the summer and early autumn there is more melting of ice/snow and the largest glacier floods will therefore occur in this period.

Catchment that is located on high altitudes in the mountains with lots of snow will have similar flood characteristics as glacier floods.

The catchment that has its largest floods in the summer months is characterized as glacier catchments or autumn catchments, which is dependent on the location of the catchment and the floods characteristics. In areas with lots of surroundings mountains and no glaciers, the springflood

(23)

9

can be characterized as a snowmelt dominated flood. The catchments that have their largest floods in the winter months are usually a combination of both rain and snowmelt and are therefore characterized as a year/combination catchment (Stenius et al. (2014)).

These characterizations of catchments do not mean that the floods only occur in the given periods, but that the largest floods are dominated by the explained generating factors and large floods in these catchments rarely happen in other seasons.

Figure 2: Map of how the different catchments is categorized. Stenius et al, 2014

(24)

10

In Tollan (2006), Nordic hydrological regimes were established by dividing the flood and the low water flow periods into a system. The system which was established is as following:

Flood periods

• H1: Snowmelt as dominating flood contributor. The three months with highest discharge are during the spring/early summer.

• H2: Transition to secondary rain flood. The second or thirds month with the highest discharge is during autumn.

• H3: Rainfall as dominating flood contributor: The highest monthly discharge occurs during autumn.

Low water flow periods

• L1: Dominated by winter lowflow, due to snow accumulation. The month with the lowest discharges occurs during winter/early spring.

• L2: Transition. The two months with the lowest discharge occur in different seasons, typically February and July.

• L3: Dominated by summer lowflow, due to high evaporation. The month with the lowest discharges occurs during summer/early autumn.

Based on this, five flood regimes were divided into

• H1L1: Mountain regime

• H2L1: Inland regime

• H2L2 (H3L2): Transition regime

• H2L3: Baltic regime

• H3L3: Atlantic regime

These flood regimes system helps identify and understand the hydrological regime of the catchments. The variation of the discharge in the catchment is used as the criteria in establishing this system. This variation in discharge is a result of differences in climatic and physical properties of the catchments. Into which regimes the studied catchments of this thesis falls into is presented in table 2 in subchapter 2.4.

(25)

11 2.3 Study catchments

For this study, 25 catchments located in central and eastern Norway were selected. The selection of catchments was based on the findings done in the report “flomdata” by Engeland et al. (2016) and “Norwegian hydrological reference dataset for climate change studies” by Fleig (2013). In Engeland et al. (2016), 716 discharge stations were evaluated and quality tested, and for 530 discharge stations yearly maximum floods were extracted and the contributing factors were determined.

In Fleig (2013) it was established a hydrological reference dataset (HRD) of streamflow stations to be included in climate change studies. This was done for better understanding of climate change related to hydrology, as such HRD needs to consist of long, high quality datasets of stream that are not affected by human behavior causing non-climate related variability in the data. The final HRD consisted of 189 streamflow series, 28 groundwater series, 68 SWE series, 9 series of glacier mass balance and 11 of glacier length, 22 lake ice and 2 river ice duration series as well as 9 water temperature series for rivers and 37 for lakes. This was also a very helpful report for my selection.

By using the findings in these reports as a reference for my selection, I was able to find the 25 catchments that fitted my criteria. The criteria for selection were:

• Stations with at least 30 years of continuous flood data with no more than 10% years missing. A missing year is here defined as a year with more than 5 consecutive days having no data.

• The catchments should be snowmelt dominated catchments, this means that at least two- third of the contribution factors are from snowmelt.

To identify whether the catchment is a snowmelt dominated catchment I specifically looked for stations in the central and eastern part of Norway, preferably higher altitude. When a catchment was selected, a floodrose was made of the flood-data to make sure that it was spring/snowmelt flood dominated catchment. The majority of floods for these catchments occurred in the spring season.

(26)

12

The 25 catchments were selected and studied is shown on a map in figure 3. In figure 3, a floodrose of each of the selected catchments is included. A floodrose is a reliable tool to be used when identifying the flood regime for a catchment, whether the dominant flood is rain, snowmelt or a combination driven. A floodrose is a visual representation of the flood season and it shows the maximum monthly flood peaks on a graphical map.

Figure 3: Catchment studied plotted on a map of Norway and the stations floodrose

(27)

13 2.4 Hydrological and climatic data

Daily discharge data from hydra II database hosted by the Norwegian Water Resource and Energy Directorates (NVE) was used. The hydra II database is the national hydrological database for Norway. This discharge data is used to find yearly flood peak, flood volume and the timing of each yearly maximal flood.

Dependent variables:

• Peak springflood timing: The timing of the peak springflood event as number of days from start of the year. Unit: Days

• Peak springflood: largest springflood is calculated from the discharge data for each station.

Unit: mm/day

• Peak springflood volume: Peak springflood volume is calculated as a 7-day sliding sum of discharge of the peak springflood. Unit: mm/day

Table 2: Catchment stations, available data and catchment properties

Station Available

data period

Catchment Properties Flood

Regime Area

(km²)

Slope (%)

Effective lake percentage

Height above sea level (m)

2.11 Narsjø 1930-2019 119 6.1 2.5 737 H1L1

2.13 Nedre Sjødalsvatn

1665-2019 474 16.1 9.2 940 H1L1

2.28 Aulestad 1961-2019 870 8.2 2.4 200 H1L1

2.32 Atnasjø 1916-2019 465 14.1 1.7 701 H1L1

2.142 Knappom 1916-2019 1640 4.4 1.8 166 H2L1

2.268 Akslen 1962 – 2019 789 18.0 1.9 480 H1L1

2.280 Kringlerdal 1967-2019 266 7.4 5.3 180 H2L1

2.290 Brustuen 1667-2019 254 15.6 3.9 685 H1L1

2.291 Tora 1967-2019 260 10.7 6.1 700 H1L1

2.607 Vålåsjø 1980 – 2015 125 7.4 1.6 936 H1L1

(28)

14

2.614 Rosten 1917-2019 1830 10.6 1.4 320 H1L1

12.13 Rysna 1974 -2019 50.3 13.5 3.9 615 H1L1

12.70 Etna 1919-2019 569 6.7 4.4 400 H1L1

12.171 Hølervatn 1969-2019 79 6.1 6.3 780 H1L1

12.178 Eggedal 1972-2019 311 12.5 3.2 170 H1L1

15.49 Halledalsvatn 1962-2019 59 6.2 5.1 846 H1L1

15.53 Borgåi 1967-2016 94 9.2 4.3 770 H1L1

15.79 Orsjoren 1955-2019 1180 4.6 14 950 H1L1

16.66 Grosettjern 1949-2019 6.5 8.6 7.1 939 H2L1

16.75 Tannsvatn 1955-2019 118 9.1 6.7 697 H1L1

307.5 Murusjø 1958-2019 346 7.2 8.9 311 H1L1

308.1 Lenglingen 1925-2019 450 7.4 8.1 354 H1L1

311.4 Femundsenden 1896-2019 1790 4.4 18 662 H1L1

311.6 Nybergsund 1908-2019 4430 5.5 9.8 350 H1L1

311.460 Engeren 1911-2019 395 7.1 4 472 H1L1

Meteorological data that are used in this thesis are from the meteorological institute, which are gridded precipitation and temperature observations from seNorge. seNorge is an open internet source that gives daily data of snow, weather, water conditions and climate in Norway.

Below is an explanation of each variable that is used in this thesis.

Independent variables:

• Accumulated snow: Accumulated snow is the total amount of snow in the catchments. The unit of accumulated snow is mm/season.

• Temperature: Temperature is mean temperature of the five days prior to the peak springflood. Unit for temperature is ^oC.

• Frost days: number of frost days is calculated by counting the number of days with temperatures below freezing in the winter season prior to the springflood. Unit for frost days is days/season.

• Snow cover: Percentage of snow cover on specific dates. Unit for snow cover is percentage

(29)

15

• Soil moisture deficit: SMD is the amount of rain that is needed in order to bring the soil moisture back to saturation, the soil will be saturated when SMD=0. In this study, SMD is used on specific dates, to check its effect on the springflood. Unit for SMD is mm.

• Groundwater table: GWT is the upper level of which the ground is saturated, a high groundwater table means that the ground is more saturated. In this study, the GWT is investigated on specific dates, to check its effect on the springflood. Unit for GWT is cm.

• Snow-water equivalent: SWE is amount of water stored in a pack of snow. Unit for SWE is mm.

Every dependent variable is checked against all the independent variables for correlation and is used in regression analysis. The regressions models will be used to predict springflood event based on known independent variables and compared up to future scenarios.

2.5 Catchment properties

Climatic and physiographic catchment properties were used as additional independent variables for regression analysis. These properties have been obtained from NVE´s database. In the following list the catchment properties are listed and their effect on the flood is explained.

• Catchment area: The effect of the catchment area on the flood is that a large catchment will have the ability to self-regulate and will reduce and delay the flood.

• Effective lake percentage: The effect of high effective lake percentage in a catchment is that the ability to self-regulate will increase, this will reduce and delay the flood.

• Catchment slope: The effect of a steep catchment is that the speed of the runoff will increase and flood will hit quicker following a melting period or heavy rain event.

• Heigh above sea level: As the catchment is situated at higher altitude the precipitation will increase, while the temperatures are decreasing. These catchments often have larger springfloods due to the majority of precipitation during the winter comes as snow. The springflood from catchments situated at higher altitudes also tends to occur later than the low-lying catchments.

(30)

16 2.6 Future scenarios

The future scenarios that are used in this thesis are based on different emission-scenarios from global climate models. The results from the global climate models are downscaled and processed to fit local stations (Beldring, 2016). In this thesis two different emission scenarios were used

“RCP45” in which there is little changes in emissions to 2050 and then a complete stop in emissions, and “RCP85” in which the emissions continue to increase to 2100 (Hanssen-Bauer et al, 2015). For both these scenarios the climatic variables that are used in the regression analysis were available for all catchments, this could then be used to predict the timing, peak, and volume of the yearly springflood.

(31)

17

3.0 Methodology

This chapter presents the methods used to develop regression models for making spring flood predictions on future scenarios. The development of regression models is explained and all the components that can contribute to improve the models are discussed.

3.1 Method selection

Floods are governed by many different factors and the main dominated factors for flood are rainfall, snowmelt, glacier melt or combinations. These factors are mainly determined by the location of the catchments, as explained in chapter 2.0.

The yearly springfloods in the catchments included in this study are dominated by snowmelt. Many other different climatic factors do also contribute to the spring floods but the exact role of each of these variables has not yet been studied thoroughly. The methods selected for this study should preferably have the ability, with great accuracy, to identify and determine which factors have an impact on the spring floods, which variable has the highest contribution and which variables can be ignored. Different methods and models are available for the purpose, and two of them were considered as the most relevant methods for this kind of study. The first method was to do a correlation analysis with a purpose of identifying the most influential factors to the springflood in the study region. The second method is multiple regression analysis of the springflood with relevant climatic and geographic factors as independent variables and flood characteristics as dependent variables. Regression analysis is a good method when identifying one or more variables effect on a target variable and is also a very good tool in forecasting and predictions.

The ability to identify and quantify the climatic variables effect on the spring flood and its ability to predict future floods, makes the regression method the best suited method for this study.

3.2 Correlation analysis

Correlation analysis in this study is used to identify the contributing factors for the spring floods in the study region. A positive correlation means that both the dependent and independent variables ascend or descend simultaneously, a negative correlation means that one variable ascends and the other descends.

(32)

18

In this study, the strength of correlation is divided into five categories:

• Perfect: when correlation coefficient is close to  1.

• High degree: When the correlation coefficient lies between  0.50 and  1.

• Moderate degree: When the correlation coefficient lies between  0.30 and  0.49.

• Low degree: When the correlation coefficient lies below  0.29.

• No correlation: When the correlation coefficient is 0.

(Statistic solution, 12.04.21)

Figure 4: Illustration of different degree of correlation (Long, 2007)

Figure 4 shows different degree of correlation between and x and y value. As shown the correlation can be both negative and positive.

Many correlation coefficients are available, in the study, the following methods described in 3.2.1, 3.2.2 and 3.3.3 were used. In the end the Pearson correlation coefficient was the one selected to be the best for this study, this method is mainly used when looking for linear correlation and is based on covariance.

(33)

19 3.2.1 Pearson correlation coefficient

The Pearson correlation coefficient is the most common correlation criterion to be used when checking for linear correlation between two variables. As the Pearson correlation test only measures linear association, it might therefore be affected by outliers. The Pearson r correlation is expressed as the following formula:

𝑟_𝑥𝑦 = ^{𝑛 ∑ 𝑥}^𝑖^𝑦^𝑖^{−∑ 𝑥}^𝑖^{−∑ 𝑦}^𝑖

√𝑛 ∑ 𝑥_𝑖²−∑ 𝑥_𝑖²√𝑛 ∑ 𝑦_𝑖²−∑ 𝑦_𝑖²

(1)

Where, rxy is Pearson correlation coefficient, n is number of observations and x and y are the variables.

For Pearson correlation both variables should be normally distributed, linear and homoscedastic (Minitab express, 2021). This was tested prior to performing the correlation test when the datasets were explored.

3.2.2 Spearman correlation

Spearman rank correlation is a non-parametric test that measures rank-based degree of association between two variables. Any assumption about the disruption of the data is not needed in this type of correlation test and is most ideal to use when the variables are measured on a scale that is least ordinal. The following formula is used to find the spearman rank correlation:

𝜌 = 1 − ^{6 ∑ 𝑑}^𝑖²

𝑛(𝑛²−1) (2)

Where, 𝜌 is the spearman rank correlation, di is the difference between the two ranks of each variable and n is the number of observations. (Statistic solution, 12.04.21)

(34)

20 3.2.3 Kendall correlation coefficient

Kendall rank correlation is a non-parametric test that measures a rank-based dependency between two variables. Such a test is used when the data is not necessarily normally distributed.

The total number of possible pairings of x with y observations n (n-1) / 2, here n is the magnitude of x and y.

The test will begin by placing the pairs in order of the x-value. If there is a correlation between x and y, then they will have the same rank. Now it will start counting the number of concordant pairs (nC) and the number of discordant pairs (nD).

It uses the following formula to find the Kendall rank coefficient:

τ = ₁^𝑛^𝐶^−𝑛^𝐷

2𝑛(𝑛−1) (3)

Where, τ is Kendall rank correlation coefficient, nC is number of concordants and nD is the number of discordants (Statistic solution, 12.04.21).

3.3 Regression Analysis

Regression analysis is a statistical tool where the relationship between a dependent variable and one or more independent variables is established. This method explains the co-variance between the dependent variable and the independent variables. There are two types of regression, simple and multiple regression:

• for simple regression analysis, the relationship between one dependent and one independent variable is studied,

• for multiple regression analysis, the relationship between one dependent variable and multiple independent variables is studied.

For this master thesis, multiple independent variables are used and therefore multiple regression analysis is the method used (Moore et al., 2009).

(35)

21 3.3.1 Multiple Regression

In multiple regression, the relevant variables are used in the regression equation as the independent variables to describe how the dependent variable changes. The multiple regression equation is specified by the following linear function, which shows how the dependent variable y depends on the independent variables x1, x2,…,xp:

𝑦 = 𝛽₀+ 𝛽₁𝑥₁+ 𝛽₂𝑥₂+ ⋯ + 𝛽_𝑝𝑥_𝑝+∈ (4)

Where, 𝛽₀ is the intercept, 𝛽₁to 𝛽_𝑝 are the regression coefficients for each independent variable and ε is the residual variance that is normally distributed.

Certain assumptions should be made in a multiple regression model:

1. For the model to be reliable only relevant variables should be included (Stepwise regression was used to select significant variables to be included in regression) 2. Normality, variables shall be normally distributed

3. Homoscedasticity, the residual variance is constant (i.e., independent of the predicted value y)

To find the best estimators for the regression coefficients 𝛽, the least square method is used.

The estimators of the regression coefficients 𝛽 are denoted by b, and the following equation shows how the predicted response is estimated for the ith observation:

𝑦̂_𝑖 = 𝑏₀+ 𝑏₁𝑥_𝑖1+ 𝑏₂𝑥_𝑖2+ ⋯ + 𝑏_𝑝𝑥_𝑖𝑝 (5)

To find the ith residual, the observed response is subtracted from the predicted response.

𝑒_𝑖 = 𝑦_𝑖− 𝑦̂_𝑖 (6)

𝑒_𝑖 = 𝑦_𝑖 − 𝑏₀+ 𝑏₁𝑥_𝑖1+ 𝑏₂𝑥_𝑖2+ ⋯ + 𝑏_𝑝𝑥_𝑖𝑝 (7)

This method chooses values of the b´s that make the sum of the residuals as small as possible.

𝑒_𝑖 = ∑(𝑦_𝑖− 𝑏₀+ 𝑏₁𝑥_𝑖1+ 𝑏₂𝑥_𝑖2+ ⋯ + 𝑏_𝑝𝑥_𝑖𝑝 )² (8)

(36)

22

To calculate the regression coefficients the method of least square is used:

𝑏 = [

𝑏₀ 𝑏₁

⋮

⋮ 𝑏_𝑝]

= (𝑋^′𝑋)⁻¹𝑋^′𝑌 (9)

To measure the residual variance the following estimator is used:

𝑠² = ^{∑ 𝑒}^𝑖²

𝑛−𝑝−1=^∑(𝑦^𝑖^−𝑦̂^𝑖⁾²

𝑛−𝑝−1 (10)

The denominator is the degree of freedom associated with the standard deviation s, n is the sample size and p is the number of regression coefficients that must be estimated to fit the model. From the standard deviation s and sample size n, the standard error of estimated parameters can be calculated.

𝑆𝐸 = ^𝑠

√𝑛 (11)

(Moore et al., 2009).

Dependent variables

The purpose of this study is to examine the association between flood characteristics and different climatic variables. The main goal is then to be able to use the climatic variables to make predictions of the flood characteristics with the help of regression equations. The following flood characteristics are the dependent variables in this study.

• Observed peak springflood (m³/s) is the maximum daily discharge in the spring season each year and is found from the available data for each catchment.

• Observed springflood timing is the date that the peak springflood occurs, in this thesis number of days since 1^st of January is used. This is found for each year of the available data for each catchment.

• Observed springflood volume (m³/s) is the sum of the maximum seven-day discharge. This is found for each year of the available data for each catchment.

(37)

23 Independent variables

The independent variables are predictors in the regression analysis, the effect of the variables is studied by looking at how the dependent variables change in association with the independent variables. These independent variables are then used in the regression equation to predict the dependent variables, which are described in the above section. These independent variables are accumulated snow, frost days, air temperature, snow cover, snow water equivalent, soil moisture deficit and ground water table. These variables are explained subchapter 2.4.

3.3.2 Model selection

Stepwise regression is a tool used for selecting the best multiple regression model. This helps building the regression model by adding and/or removing potential independent variables and checking for statistical significance after each iteration. The purpose of stepwise regression is through a series of iterations and tests to find the best set of independent variables to be used in the final regression model, that have the most statistically significant influence on the dependent variable (Hayes, 02.02.21).

Stepwise regression can be performed by either backward or forward method. In the backward method, the model starts off with all the independent variables and removes those that are not statistically significant. In the forward method, the model starts off by adding independent variables one at the time that is most statistically significant. If backward method is used, the variable with the lowest “F-to-remove” statistic is removed from the model, in the forward method the highest

“F-to-add” statistic is added. The F-statistic is calculated as:

1. From the estimated coefficient of each variable a t-statistic is calculated 2. The F-statistic is found by squaring the t-statistic

In this thesis the combination of both is used, it is called bidirectional elimination, here the model starts off with one independent variable like forward selection, and then the next most significant variable is added. After each new variable is added to the model it removes any variables that is no longer of importance in the model. This continues until all independent variables are assessed (Statistics how to, 12.04.21)

(38)

24 3.3.3 Local regression model

The first part of the development of regression models is to make models for each station. For each station three regression models are developed, one for each of the flood characteristics. In total 75 regression models were developed, three for each of the 25 stations that are included in this study.

These models were developed to predict local flood characteristics in the future for each catchment individually.

3.3.4 Regional regression model

The second part of the development of regression models is to make one regional regression model that can make predictions across catchments. Such models are essential tool for predicting flood characteristics for ungauged site or site with poor quantity and quality of flood data. For developing these regional model, new variables were introduced as independent variables. The newly introduced variables are the catchments properties of the catchments and are shown in table 2. Four different regional models were made and assessed. The first model was made based on mean values of all the catchments, the other three models were based on a typical event. Two of the typical events selected were two years with large springfloods across all catchments, the other typical event was a year with small springfloods across all stations. The developed models were assessed and analyzed to find which model gave the better result.

3.4 Predicting future flood characteristics

When predicting the flood characteristics two future emission scenarios were used and these are explained in subchapter 2.6. The climatic variables were given for both the future emission scenarios and these were put into the regression equations and used to make prediction on future flood characteristics.

3.5 Evaluation Criteria

To evaluate the performance of the regression models, it was assessed how well the regression model met the basic assumptions for linear regression (normality, homoscedasticity, linearity).

Further the model fit was evaluated using R2, RMSNE and p-values. All these criterions are described below. Further, the predictive power of the model was evaluated using k-fold cross- validation.

(39)

25 3.5.1 Evaluation of model assumptions

Scatter plot

Scatterplot is way to visualize how two variables are connected. When using a scatterplot, it will quickly show if there is a good or bad association between the dependent and independent variables. Every data point on the scatterplot represents an observation, where one variable is on the x-axis and the other variable is on the y-axis. When plotting dependent variable on one axis and the independent variable on the other axis the connection, pattern and possible deviations can be evaluated. The pattern of the scatter plot can be described with direction, strength and shape, and how good it follows a certain pattern. When individual data points fall outside of the pattern in the scatter plot this can be described as outliers. An outliner may lead to an increased MSE (Moore et al., 2009). In the case of this thesis, the scatter plot is used to compare observed and predicted values of the regression models. This type of a scatter plot is an effective way to evaluate the goodness of the model.

Diagnostic plot

Diagnostic plot is another way of evaluating the regression model. In the diagnostic plots the residuals are investigated, this is to check if the linear regression assumptions are met and to look for possible ways to improve the model in an exploratory way. A good way to check if the linear regression assumptions are met such as linearity, normality and constant variance, is to investigate the diagnostic plots. The diagnostic plots that have been used in this thesis are:

• Residuals vs fitted values scattered plot

• Normal Q-Q plot

• Scale-location plot

• Histogram and cumulative distribution plot of residuals.

Residuals vs fitted values

The residuals are plotted against the fitted values in a scatterplot. The purpose of this type of scatterplot is to check if there is a linear relationship, if the residual is homoscedastic and biased (STHDA, 12.04.21). There are three characteristics that are checked in order to make these assumptions about the model, they are as following:

(40)

26

• The residuals are equally distributed around the zero-line. This suggests that the model is homoscedastic with little bias.

• The residuals do not form any patterns. This suggests a linear relationship.

• In the distribution of residuals none of them “stands out”. This suggests no outliers in the model.

(PennState, 29.04.21)

qq-plot

The purpose of a qq-plot is to see if the residuals are normally distributed. Here the best outcome is that the residual follows the straight line with little deviation, which indicate that the residuals are normally distributed. If the upper or lower end of the qq-plots deviates from the line, the distribution is said to be skewed. If both ends of the qq-plots deviates from the line, the distribution is said to be kurtosis (STHDA, 12.04.21)

Scale-location plot

The purpose of the Scale-location plot is to check the assumption of constant variance. In this plot it shows whether the residuals are equally spread along the ranges of the fitted values. Here as in the residual vs fitted value plot, you want to see the data points spread equally around the horizontal line and don’t want to see a specific pattern (Bommae, 2015).

Histogram and cumulative distribution plot of residuals

The histogram of residuals shows as the qq-plot, whether the residuals are normally distributed.

The residuals are normally distributed if it is a symmetric bell-shaped histogram and evenly distributed around zero, then the assumption of normality is met (Moore et al., 2009). The cumulative distribution plot (CDF) is used as a histogram to find if the residuals are normally distributed.

If the diagnostic plots show that the assumptions for the regression model is not met, then adjustments might be made to the model to improve the model. A solution to this could be to leave out outliner or to do transformations to the dependent or independent variables.

(41)

27 3.5.2 Evaluation of model fit

Checking for significance from p-value

The p-value of the test is the probability that the test statistic would take a value as extreme or more extreme than the actual observed values, assuming that the null hypothesis is rejected. Extreme means far from what is expected if the null hypothesis is true. The test statistic measures the similarity between the data and the null hypothesis. Stronger evidence against the null hypothesis gives a smaller p-value. A large p-value indicates that the observed has happened at random. The sampling distribution of the test statistic is the key when calculating p-value.

The purpose of testing for significance is to check if the results are useable. This is tested by assessing the strength of the evidence against the null hypothesis. The test checks the strength of the evidence against the null hypothesis and the results are expressed as a probability of how good the match between the data and the hypothesis is. The null hypothesis is usually a statement of “no change” or “no effect”. The null hypothesis is tested up against the hypothesis, which is thought to be true, this is called the alternative hypothesis. The alternative hypothesis is expressed as the suspicions or hopes we have about the data. It also must be decided whether the significance test is one-sided or two-sided. The difference to one and two-sided direction is whether the parameters differ from the null hypothesis in a specific direction or both directions. If the direction is not known prior, then two-sided shall be used.

To conclude weather the data is statistically significant, the p-value need to be equal or smaller than a fixed value 𝛼. The value 𝛼 is the decisive value which tells us weather the data is significant or not, and this value indicates how much evidence against the null hypothesis is required to reject the null hypothesis. The value 𝛼 is called the significant level. If 𝛼=0.05 (which is often the case), then the data gives evidence against the null hypothesis no more than 5% of the time when the null hypothesis is true (Moore et al., 2009).

(42)

28 There are four steps when assessing the significance:

1. Identify the hypothesis that is tested. (Null hypothesis and alternative hypothesis). The test will assess the strength of the evidence against the null hypothesis.

2. Calculating the test statistic from which the p-value is calculated. The test statistic measures how far the data is from the null hypothesis.

3. Find the p-value.

4. Chose a significance level 𝛼. If the p-value is equal or less than the significant level, then it can be concluded that the alternative hypothesis is true. If the p-value is greater than the significant level, then you conclude that there is not enough evidence to reject the null hypothesis.

(Moore et al., 2009).

R- squared

R squared is used to measure the goodness of fit. The R squared is a percentage of explained variance of the dependent variable on a linear model. R squared represents the strength of the relationship between the dependent and independent variables in the model and it is represented as a percentage. If R² is 0% then the model does not explain any of the variance in the dependent variable, if R² is 100% then the model explains all of the variance in the dependent variable. R² is found by:

𝑅² = 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑜𝑓 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑣𝑎𝑙𝑢𝑒𝑠 𝑦̂

𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 𝑣𝑎𝑙𝑢𝑒𝑠 𝑦 (12)

R² is also sometimes called the coefficients of determination and evaluates the distribution of the data points around the fitted regression line. A smaller R²represents a higher difference between fitted and observed values and a higher R²represents a smaller difference between fitted and observed values.

In Figure 5 is a visual representation of how data points are distributed around a regression line is presented.