• No results found

Risk Prediction of Cardiovascular Disease with Statistical Learning Methods

N/A
N/A
Protected

Academic year: 2022

Share "Risk Prediction of Cardiovascular Disease with Statistical Learning Methods"

Copied!
150
0
0

Laster.... (Se fulltekst nå)

Fulltekst

(1)

Risk Prediction of Cardiovascular Disease with Statistical Learning Methods

June 2021

Master's thesis

Master's thesis

Atle Wiig-Fisketjøn

2021Atle Wiig-Fisketjøn NTNU Norwegian University of Science and Technology Faculty of Information Technology and Electrical Engineering Department of Mathematical Sciences

(2)
(3)

Risk Prediction of Cardiovascular Disease with Statistical Learning Methods

Atle Wiig-Fisketjøn

Master of Science in Applied Physics and Mathematics Submission date: June 2021

Supervisor: Mette Langaas Co-supervisor: Anja Bye

Norwegian University of Science and Technology Department of Mathematical Sciences

(4)
(5)

Abstract

In this thesis we explore the potential of using statistical learning methods for predicting the risk of cardiovascular disease (CVD), using data from the Trøndelag Health Study, HUNT. The primary aim was to develop models for predicting the 10-year risk of CVD as defined for NORRISK 2, the currently leading risk prediction model in Norway. Secondary aims were to develop models for predicting the 10-year risk of CVD as defined for the Framingham model, the 5- and 10-year risk of atrial fibrillation, the 10-year risk of heart failure, and the risk of heart failure post-myocardial infarction.

After preprocessing the data, features were selected for men and women respectively. The data was split 80/20 into a training set and a test set separately for each sex. XGBoost was used to find the 20 most important features for respectively men and women, which were used for fitting three models for each sex. These consisted of an XGBoost model with at most 20 features, a logistic regression model with at most 15 features, and a logistic regression model with at most 8 features. XGBoost models were visualized with importance plots and accumulated local effects plots. For logistic regression models, the estimated coefficients and their correspondingp-values were reported.

The models were evaluated by the area under the Receiver-Operating-Characteristic curve and the Precision-Recall curve, both on the full test set and on the test set divided into age groups.

Thresholds for classification were suggested by maximizing different performance measures through 10-fold cross-validation on the training set, including the Youden Index and the Fβ score. We explored age-specific thresholds and age- and sex-specific thresholds. Estimated thresholds were reported with corresponding sensitivities, specificities and precisions computed on the test set. All results were compared with NORRISK 2, which was implemented and applied to the same test set for exact comparison.

The results verified that most features from NORRISK 2 are important for predicting CVD in both men and women, and we identified new features of high importance for women. The findings of this thesis emphasize the importance of letting predicted probabilities and thresholds for classification be both age- and sex-dependent. The thresholds found by maximizing a certain performance meas- ure resulted in large differences in the sensitivities for the different age groups. Hence, the trade-off between sensitivity and specificity for different thresholds on the test set should be inspected, and the final selection of thresholds should be guided by eventual requirements for these performance measures. This would provide much better predictions compared to NORRISK 2, resulting in the identification of a larger proportion of the patients at high risk of developing CVD.

(6)

Sammendrag

I denne oppgaven ser vi p˚a potensialet for bruk av metoder fra statistisk læring til ˚a beregne risikoen for hjerte- og karsykdommer, ved ˚a ta i bruk data fra Helseundersøkelsen i Trøndelag, HUNT. Hovedm˚alet med oppgaven var ˚a utvikle modeller for ˚a beregne 10-˚arsrisikoen for hjerte- og karsykdommer slik de er definert for NORRISK 2, den ledende risikomodellen i Norge. Sekundære m˚al var ˚a utvikle modeller for beregning av 10 ˚ars risiko for hjerte- og karsykdommer slik de er definert av Framingham, 5- og 10 ˚ars risiko for atrieflimmer, 10 ˚ars risiko for hjertesvikt og risiko for hjertesvikt etter hjerteinfarkt.

Etter ˚a ha forh˚andsbehandlet dataene ble variabler valgt ut for henholdsvis menn og kvinner basert p˚a relevant litteratur. Dataene ble splittet 80/20 i henholdsvis trenings- og testsett for hvert kjønn.

XGBoost ble brukt til ˚a finne de 20 viktigste variablene for henholdsvis menn og kvinner, som deretter ble brukt i tilpasningen av tre modeller for hvert kjønn. Disse besto av en XGBoost- modell med maksimalt 20 variabler, en logistisk regresjonsmodell med maksimalt 15 variabler, og en logistisk regresjonsmodell med maksimalt 8 variabler. XGBoost-modeller ble visualisert med importance-plott og accumulated local effects-plott. For logistiske regresjonsmodeller ble de estimerte koeffisientene og tilhørendep-verdier rapportert.

Modellene ble evaluert ved hjelp av arealet under Receiver-Operating-Characteristic-kurven og Precision-Recall-kurven, b˚ade p˚a hele testsettet og p˚a deler av testsettet, basert p˚a aldersgrupper.

Grenseverdier for klassifisering ble foresl˚att ved ˚a maksimere ulike ytelsesm˚al gjennom ti gang- ers kryssvalidering p˚a treningssettet, inkludert Youden index og Fβ score. Vi undersøkte b˚ade grenseverdier kun basert p˚a alder og grenseverdier basert b˚ade p˚a alder og kjønn. Estimerte gren- severdier ble rapportert med tilhørende sensitivitet, spesifisitet og presisjon kalkulert p˚a testsettet.

Alle resultater ble sammenlignet med NORRISK 2, som ble implementert og evaluert p˚a det samme testsettet for ˚a sikre nøyaktig sammenligning.

Resultatene bekreftet at de fleste variablene i NORRISK 2 er viktige for prediksjon av hjerte- og karsykdommer for b˚ade menn og kvinner, og vi identifiserte nye, svært viktige variabler for kvinner. Funnene i denne oppgaven understreker viktigheten av ˚a la predikerte sannsynligheter og grenseverdier for klassifisering være avhengig av b˚ade alder og kjønn. Grenseverdiene som ble funnet ved ˚a maksimere et gitt ytelsesm˚al resulterte i varierende sensitivitet for de forskjellige aldersgruppene. Derfor bør man inspisere avveiningen mellom sensitivitet og spesifisitet for ulike grenseverdier p˚a testsettet, og endelig valg av grenseverdier bør veiledes av eventuelle krav til disse ytelsesm˚alene. Dette vil gi langt bedre prediksjoner sammenlignet med NORRISK 2, og vil resultere i identifisering av en større andel av pasientene med høy risiko for utvikling av hjerte- og karsykdommer.

(7)

Preface

This thesis concludes my Master of Science (M. Sc) in Applied Physics and Mathematics at the Norwegian University of Science and Technology (NTNU). It was written during the spring of 2021 at the Department of Mathematical Sciences and serves as a part of my academic specialization within the field of statistics.

The focus of this thesis is to explore the potential of using statistical learning methods for cardi- ovascular disease (CVD) risk assessment of healthy adults. The findings could potentially play an important role in a national strategy of preventing CVD in the adult population.

The Trøndelag Health Study (HUNT) is a collaboration between HUNT Research Centre, Faculty of Medicine and Health Sciences, NTNU, Trøndelag County Council, Central Norway Regional Health Authority, and the Norwegian Institute of Public Health. We want to thank clinicians and other employees at Nord-Trøndelag Hospital Trust for their support and for contributing to data collection in this research project.

I would like to express my sincere gratitude to professor Mette Langaas (NTNU) for excellent guidance and supervision throughout this final year. I would also like to thank my co-supervisor Anja Bye (NTNU) for providing valuable feedback and assistance. I look forward to collaborating with you both on future publications. Thanks to my family for always supporting me through my entire education. Finally, a special thanks to Anna and to all my friends at NTNU for making the last five years the best and most memorable years of my life.

Trondheim, 17.06.21 Atle Wiig-Fisketjøn

(8)

Contents

Abstract i

Sammendrag ii

Preface iii

List of Figures viii

List of Tables xi

1 Introduction 1

1.1 Primary aims . . . 1

1.2 Secondary aims . . . 2

1.3 Outline . . . 3

2 Background 4 2.1 HUNT . . . 4

2.2 NORRISK 2 . . . 4

2.3 Framingham . . . 5

2.4 CHARGE-AF . . . 6

2.5 ICD . . . 6

3 Statistical methods 7 3.1 Training error vs test error . . . 7

3.2 Missing data . . . 8

3.3 Logistic regression . . . 9

3.4 Single decision trees . . . 10

3.5 XGBoost . . . 11

3.5.1 Learning objective . . . 11

3.5.2 Gradient tree boosting . . . 11

(9)

Contents Contents

3.5.3 Missing data . . . 13

3.5.4 Hyperparameters . . . 15

3.5.5 Importance plots . . . 15

3.6 k-fold cross-validation . . . 17

3.7 Aikaike information criterion (AIC) . . . 17

3.8 Partial dependence plots . . . 17

3.9 Accumulated local effects plots . . . 18

3.10 Performance measures . . . 20

3.10.1 Threshold-dependent performance measures . . . 20

3.10.2 Threshold-independent performance measures . . . 22

4 Data 24 4.1 Available data . . . 24

4.2 Initial feature selection . . . 25

4.3 Distribution of positives . . . 26

4.4 Data exploration . . . 27

5 Analysis workflow 32 5.1 Data split . . . 32

5.2 Feature selection and model training . . . 32

5.3 Hyperparameter tuning for XGBoost . . . 35

5.4 Performance evaluation . . . 35

5.5 Model comparison and missing values . . . 36

6 Results 38 6.1 XGBoost model . . . 38

6.2 Full logistic regression model . . . 42

6.3 Minimal logistic regression model . . . 45

6.4 Comparison of models using ROC and PR . . . 46

6.4.1 All ages . . . 46

6.4.2 Age groups . . . 47

6.5 Calibration . . . 49

6.6 Age-specific thresholds . . . 52

6.7 Age- and sex-specific thresholds . . . 55

6.8 Sensitivity and specificity trade-off . . . 58

7 Discussion 64

(10)

Contents Contents

7.1 XGBoost vs logistic regression . . . 64

7.1.1 Size of training data . . . 64

7.1.2 Model design . . . 65

7.1.3 Interpretation . . . 65

7.1.4 Running time . . . 66

7.2 Risk factors . . . 66

7.2.1 Factors from NORRISK 2 . . . 66

7.2.2 ALP and CRP . . . 66

7.2.3 Exercise . . . 67

7.2.4 Chronic illness or injury . . . 67

7.2.5 Height, weight, and BMI . . . 68

7.3 NORRISK 2 . . . 71

7.3.1 Implementation . . . 71

7.3.2 Limitations . . . 72

7.4 Comparing model performance . . . 73

7.5 Strengths and limitations . . . 73

7.5.1 Age- and sex-specific thresholds . . . 73

7.5.2 Data . . . 73

7.5.3 Imbalance . . . 74

7.5.4 Selective inference . . . 74

7.6 Clinical applicability . . . 75

7.6.1 Model interpretability . . . 75

7.6.2 Thresholds for intervention . . . 75

8 Highlights from other endpoints 76 8.1 10-year risk of CVDFramingham. . . 76

8.2 5- and 10-year risk of atrial fibrillation . . . 81

8.3 10-year risk of heart failure . . . 84

8.4 Risk of heart failure post-myocardial infarction . . . 87

9 Conclusion and future work 89 9.1 Conclusion . . . 89

9.2 Future work . . . 90

9.2.1 Observed risk . . . 90

9.2.2 ALE plots . . . 90

9.2.3 Age-specific models . . . 90

(11)

Contents Contents

Bibliography 91

Appendix 95

A Feature explanation 95

B 10-year risk of CVDNORRISK 2 100

C 10-year risk of CVDFramingham 110

D 10-year risk of atrial fibrillation 118

E 10-year risk of heart failure 124

F Risk of heart failure post-myocardial infarction 130

(12)

List of Figures

3.1 Derivation of ALE plots. . . 19

4.1 Box- and bar plots of feature distributions for men. . . 30

4.2 Box- and bar plots of feature distributions for women. . . 31

5.1 Visualization of analysis workflow . . . 34

5.2 Illustrating example for selecting the logistic regression models. . . 37

6.1 Importance plots for the XGBoost model for predicting the 10-year risk of CVDNORRISK 2 in men. . . 39

6.2 Importance plots for the XGBoost model for predicting the 10-year risk of CVDNORRISK 2 in women. . . 39

6.3 ALE plots for the XGBoost model for predicting the 10-year risk of CVDNORRISK 2 in men. . . 40

6.4 ALE plots for the XGBoost model for predicting the 10-year risk of CVDNORRISK 2 in women. . . 41

6.5 ROC- and PR curves based on the full test set for men. . . 46

6.6 ROC- and PR curves based on the full test set for women. . . 47

6.7 PR curves for the models for predicting the 10-year risk of CVDNORRISK, evaluated on women aged 55-64 years. . . 49

6.8 Calibration plots for the XGBoost models for predicting the 10-year risk of CVDNORRISK 2, evaluated on the test set. . . 50

6.9 Calibration plots for the full logistic regression models for predicting the 10-year risk of CVDNORRISK 2, evaluated on the test set. . . 50

6.10 Calibration plots for the XGBoost models for predicting the 10-year risk of CVDNORRISK 2, evaluated on the training set. . . 51

6.11 Sum of sensitivity and specificity for varying thresholds for models predicting the 10-year risk of CVDNORRISK 2. . . 54

6.12 Sum of sensitivity and specificity for varying thresholds for models predicting the 10-year risk of CVDNORRISK 2, separately for men and women. . . 57

(13)

List of Figures List of Figures

7.1 The bottom-left and top-right plots present the differences in distribution of age for those with and without chronic disease, whereas the top-left bar plot and the bottom-right density plot shows the distribution of the chronic disease feature and age feature in the complete dataset. . . 68 7.2 Density plots, scatter plots and correlation plots for height, weight and BMI for

women who did not develop CVD within 10 years. . . 69 7.3 Density plots, scatter plots and correlation plots for height, weight and BMI for

women who developed CVD within 10 years. . . 70 8.1 Calibration plots for the Framingham model for predicting the 10-year risk of

CVDFraminghamon the test set. . . 78 8.2 Calibration plots for the minimal logistic regression model for predicting the 10-year

risk of CVDFraminghamon the test set. . . 79 8.3 Men to the left, women to the right. The sum of sensitivity and specificity for varying

thresholds for models predicting the 10-year risk of CVDFramingham, evaluated on the test set. The vertical dashed lines represent the estimated optimal thresholds from the 10-fold CV on the training set. . . 80 8.4 Men to the left, women to the right. The sum of sensitivity and specificity on the test

set for XGBoost, the full- and the minimal logistic regression model for predicting the 10-year risk of HF. The vertical dashed lines represent the estimated optimal thresholds from performing 10-fold CV on the training set. . . 86 B.1 Complete set of box- and bar plots of feature distributions for men, split by the

value of the response feature. . . 101 B.2 Complete set of box- and bar plots of feature distributions for women, split by the

value of the response feature. . . 103 B.3 Calibration plots for the minimal logistic regression models for predicting the 10-year

risk of CVDNORRISK 2, evaluated on the test set. . . 105 C.1 Importance plots for the XGBoost model for predicting the 10-year risk of CVDFramingham

in men. . . 111 C.2 Importance plots for the XGBoost model for predicting the 10-year risk of CVDFramingham

in women. . . 111 C.3 ALE plots for the XGBoost model for predicting the 10-year risk of CVDFramingham

in men. . . 112 C.4 ALE plots for the XGBoost model for predicting the 10-year risk of CVDFramingham

in women. . . 113 C.5 Calibration plots for the XGBoost models for predicting the 10-year risk of CVDFramingham, evaluated on the test set. . . 116 C.6 Calibration plots for the full logistic regression models for predicting the 10-year

risk of CVDFramingham, evaluated on the test set. . . 116 D.1 Importance plots for the XGBoost model for predicting the 10-year risk of AF in men.119 D.2 Importance plots for the XGBoost model for predicting the 10-year risk of AF in

women. . . 119 D.3 ALE plots for the XGBoost model for predicting the 10-year risk of AF in men. . . 120

(14)

List of Figures List of Figures

D.4 ALE plots for the XGBoost model for predicting the 10-year risk of AF in women. 121 E.1 Importance plots for the XGBoost model for predicting the 10-year risk of HF in men.125 E.2 Importance plots for the XGBoost model for predicting the 10-year risk of HF in

women. . . 125 E.3 ALE plots for the XGBoost model for predicting the 10-year risk of HF in men. . . 126 E.4 ALE plots the XGBoost model for predicting the 10-year risk of HF in women. . . 127 F.1 Importance plots for the XGBoost model for predicting the risk of HF post-MI. . . 130 F.2 ALE plots for the XGBoost model for predicting the risk of HF post-MI. . . 131

(15)

List of Tables

1.1 Overview of the different endpoints explored in this thesis. . . 3

3.1 Explanation of hyperparameters in XGBoost . . . 15

3.2 Confusion matrix. . . 20

4.1 Classification of positives for each endpoint. . . 25

4.2 Size of data sets for each endpoint. . . 25

4.3 Incidence rates of CVDNORRISK 2 in men for different age groups. . . 26

4.4 Incidence rates of CVDNORRISK 2 in women for different age groups. . . 27

4.5 Summary statistics on continuous features for men. . . 27

4.6 Summary statistics on categorical features for men. . . 28

4.7 Summary statistics on continuous features for women. . . 28

4.8 Summary statistics on categorical features for women. . . 29

5.1 XGBoost hyperparameters tuned in the first grid search. . . 35

5.2 Additional XGBoost hyperparameters tuned in the second grid search. . . 35

6.1 Transformation of features for use in the logistic regression models. . . 42

6.2 Summary of the full logistic regression model for predicting the 10-year risk of CVDNORRISK 2 in men. . . 43

6.3 Summary of the full logistic regression model for predicting the 10-year risk of CVDNORRISK 2 in women. . . 44

6.4 Summary of the minimal logistic regression model for predicting the 10-year risk of CVDNORRISK 2 in men. . . 45

6.5 Summary of the minimal logistic regression model for predicting the 10-year risk of CVDNORRISK 2 in women. . . 45

6.6 AUCROC for models predicting the 10-year risk of CVDNORRISK 2 in men. . . 47

6.7 AUCROC for models predicting the 10-year risk of CVDNORRISK 2 in women. . . . 48

6.8 AUCPR for models predicting the 10-year risk of CVDNORRISK 2 in men. . . 48

6.9 AUCPR for models predicting the 10-year risk of CVDNORRISK 2 in women. . . 48

6.10 Analysis of age-specific thresholds for the XGBoost models. . . 52

(16)

List of Tables List of Tables

6.11 Analysis of age-specific thresholds for the full logistic regression models. . . 53 6.12 NORRISK 2 applied to the test sets for both men and women, using the age-specific

thresholds suggested by the Norwegian guidelines. . . 53 6.13 Optimal thresholds for the XGBoost model based on the Youden Index. . . 55 6.14 Optimal thresholds for the full logstic regression model based on the Youden Index. 55 6.15 Performance measures corresponding to NORRISK 2 and its thresholds. . . 56 6.16 Sensitivity and specificity for varying thresholds for the XGBoost model for predict-

ing the risk of CVDNORRISK 2 on the training set for men. . . 59 6.17 Sensitivity and specificity for varying thresholds for the XGBoost model for predict-

ing the risk of CVDNORRISK 2 on the test set for men. . . 60 6.18 Sensitivity and specificity for varying thresholds for the XGBoost model for predict-

ing the risk of CVDNORRISK 2 on the test set for women. . . 61 6.19 Sensitivity and specificity for varying thresholds for NORRISK 2 on the training set

for men. . . 62 6.20 Sensitivity and specificity for varying thresholds for NORRISK 2 on the test set for

women. . . 63 7.1 The number of observations used for training each model for respectively men and

women. . . 65 7.2 Modeling age as a linear function of presence of chronic disease and the occurrence

of a CVD event within 10 years. . . 68 7.3 Modeling height for women as a linear function of age and occurrence of a CVD

event within 10 years. . . 69 7.4 Modeling BMI for women as a linear function of age and occurrence of a CVD event

within 10 years. . . 70 7.5 Risk predictions based on NORRISK 2 for respectively one and two first-degree

relatives with premature CHD. . . 71 7.6 Sensitivity and specificity for NORRISK 2 for varying thresholds on the test set for

women, when only participants who reported to not have any first degree relatives with CVD are included. . . 72 8.1 Summary of the minimal logistic regression model for predicting the 10-year risk of

CVDFraminghamin men. . . 77 8.2 Summary of the minimal logistic regression model for predicting the 10-year risk of

CVDFraminghamin women. . . 77 8.3 Performance of models for predicting the 10-year risk of CVDFramingham in men,

evaluated by the AUCROC on the test set for different age groups. . . 77 8.4 Performance of models for predicting the 10-year risk of CVDFraminghamin women,

evaluated by the AUCROC on the test set for different age groups. . . 78 8.5 Summary of the minimal logistic regression model for predicting the 5-year risk of

AF in men. . . 81 8.6 Summary of the minimal logistic regression model for predicting the 5-year risk of

AF in women. . . 81

(17)

List of Tables List of Tables

8.7 Summary of the minimal logistic regression model for predicting the 10-year risk of AF in men. . . 82 8.8 Summary of the minimal logistic regression model for predicting the 10-year risk of

AF in women. . . 83 8.9 Performance of models for predicting the 10-year risk of AF in men, evaluated by

the AUCROC on the test set for different age groups. . . 83 8.10 Performance of models for predicting the 10-year risk of AF for women, evaluated

by the AUCROCon the test set for different age groups. . . 83 8.11 Summary of the minimal logistic regression model for predicting the 10-year risk of

HF in men. . . 84 8.12 Summary of the minimal logistic regression model for predicting the 10-year risk of

HF in women. . . 84 8.13 Performance of models for predicting the 10-year risk of HF in men, evaluated by

the AUCROC on the test set for different age groups. . . 85 8.14 Evaluating performance of models for predicting the 10-year risk of HF for women.

Comparing the AUCROC on different age groups in the test set. . . 85 8.15 Summary of the minimal logistic regression model for predicting the risk of HF

between 30 days and 3 years post-MI. . . 87 8.16 Performance of models for predicting the risk of HF post-MI, evaluated by the

AUCROC on the test set for different age groups. . . 87 8.17 Optimal threshold for the minimal logistic regression model based on the Youden

Index. . . 88 A.1 Details on continuous features. . . 97 A.2 Details on categorical features. . . 99 B.1 Sensitivity and specificity for varying thresholds, evaluated on the training set for

the full logistic regression model for predicting the risk of CVDNORRISK 2 in men. . 106 B.2 Sensitivity and specificity for varying thresholds, evaluated on the test set for the

full logistic regression model for predicting the risk of CVDNORRISK 2 in men. . . . 107 B.3 Sensitivity and specificity for varying thresholds, evaluated on the training set for

the full logistic regression model for predicting the risk of CVDNORRISK 2 in women. 108 B.4 Sensitivity and specificity for varying thresholds, evaluated on the test set for the

full logistic regression model for predicting the risk of CVDNORRISK 2 in women. . 109 C.1 Incidence rates of CVDFramingham in men for different age groups. . . 110 C.2 Incidence rates of CVDFramingham in women for different age groups. . . 110 C.3 Summary of the full logistic regression model for predicting the 10-year risk of

CVDFraminghamin men. . . 114 C.4 Summary of the full logistic regression model for predicting the 10-year risk of

CVDFraminghamin women. . . 115 C.5 Comparing models for predicting the 10-year risk of CVDFramingham in terms of

AUCPR on different age groups in the test set for men. . . 117

(18)

List of Tables List of Tables

C.6 Comparing models for predicting the 10-year risk of CVDFramingham in terms of AUCPR on different age groups in the test set for women. . . 117 C.7 Optimal thresholds for the minimal logistic regression model based on the Youden

Index. . . 117 D.1 Incidence rates of AF in men for different age groups. . . 118 D.2 Incidence rates of AF in women for different age groups. . . 118 D.3 Summary of the full logistic regression model for predicting the 10-year risk of AF

in men. . . 122 D.4 Summary of the full logistic regression model for predicting the 10-year risk of AF

in women. . . 123 D.5 Comparing models for predicting the 10-year risk of AF in terms of AUCPR on

different age groups in the test set for men. . . 123 D.6 Comparing models for predicting the 10-year risk of AF in terms of AUCPR on

different age groups in the test set for women. . . 123 E.1 Incidence rates of HF in men for different age groups. . . 124 E.2 Incidence rates of HF in women for different age groups. . . 124 E.3 Summary of the full logistic regression model for predicting the 10-year risk of HF

in men. . . 128 E.4 Summary of the full logistic regression model for predicting the 10-year risk of HF

in women. . . 129 E.5 Comparing models for predicting the 10-year risk of HF in terms of AUCPR on

different age groups in the test set for men. . . 129 E.6 Comparing models for predicting the 10-year risk of HF in terms of AUCPR on

different age groups in the test set for women. . . 129 F.1 Population, the number of positives and the observed risk of HF between 30 days and

3 years after MI within each age group, including men and women. The observed risk is calculated as the number of positives divided by the population. . . 130 F.2 Summary of the full logistic regression model for predicting the risk of HF post-MI. 131 F.3 Comparing models for predicting the risk of HF post-MI in terms of AUCPR on

different age groups in the test set. . . 132 F.4 Sensitivity and specificity for varying thresholds on the test set for the minimal

logistic regression model for predicting the risk of HF post-MI. . . 132

(19)

Chapter 1

Introduction

Worldwide more than 17 million people die every year from cardiovascular disease (CVD). Even though the mortality rates have had a significant decrease in many countries during the last decades, CVD remains a dominant cause of mortality in Norway and Europe. Statistics from the European Union (EU) show that CVD kills more Europeans than all types of cancer together, and costs the EU e210 billion a year (Timmis et al., 2017). To address this major health problem, primary prevention should be the key focus, and preventing the occurrence of CVD is considered a major objective in a national strategy. Since risk prediction forms the basis of the national guidelines for individual primary prevention of CVD, it is extremely important to provide the healthcare personnel with the most accurate risk prediction models, as these determine all further interventions in the individual patients. CVD can be prevented, delayed, or even controlled through lifestyle changes and pharmaceuticals when diagnosed at an early stage. Early identification of high-risk individuals may therefore reduce the burden of CVD, allowing more effective intervention and thus more disease-free years (Cooney et al., 2009).

Several different risk prediction models are available for determining the 10-year risk of CVD (D’Agostino et al., 2008; Conroy et al., 2003; Selmer et al., 2008). However, many of the currently available risk prediction models were developed more than a decade ago, and only explain a modest proportion of the incidents of CVD. Khot et al. (2003) estimated that 15-20% of myocardial infarction patients have none of the traditional risk factors, and would be classified as low risk by the prediction models existing at the time. Hence, there is a need for new models, to identify the individuals at risk with greater precision than today’s models.

1.1 Primary aims

The term cardiovascular disease covers a broad specter of different diseases. Current national guidelines from the Norwegian Directorate of Health on risk assessment and drug treatment in the prevention of CVD are based on NORRISK 2 (Selmer et al., 2017), which defined CVD as an event of myocardial infarction (MI) or acute cerebral stroke, or death from coronary heart disease (CHD). This Norwegian risk model is used for identifying high-risk patients in age groups from 40 to 79 years, and patients with predicted risk above certain age-specific thresholds are offered pharmaceutical treatment. The model is based on a regression method known as Fine-Gray. The primary aim of this thesis is to use statistical learning methods to develop precise models for risk prediction of CVD for healthy adults, in comparison with NORRISK 2.

The methods which will be explored are XGBoost (Section 3.5) and logistic regression (Section 3.3). For a model to have clinical applicability, the number of variables needed for prediction must be reasonable. With a large amount of data, one could potentially develop models based on hundreds of features, yielding great predictive performances. However, clinical practitioners would most likely never accept a model that requires data on e.g. 100 features from the patient. Thus, all models developed will be using at most 20 features. XGBoost models will contain at most 20

(20)

Chapter 1. Introduction 1.2. Secondary aims

features, and a full- and minimal logistic regression model will respectively contain at most 15 and 8 features.

The performance of the models will be evaluated with the area under the ROC- and PR curves for the same age groups as defined by Selmer et al. (2008). For the models to be used clinically it is also important to suggest thresholds for intervention, with a reasonable trade-off between sensitivity and specificity. Another primary aim of this thesis is therefore to perform an analysis of several performance measures used for threshold selection, presented in Section 3.10.1.

1.2 Secondary aims

It is also of interest to explore the potential of using statistical learning for risk prediction of cardiovascular diseases in general, not only as defined by Selmer et al. (2008). The Framingham model (D’Agostino et al., 2008) defines CVD differently than NORRISK, and includes both atrial fibrillation (AF) and heart failure (HF) in addition to myocardial infarction and stroke. Hence, we chose to explore the possibility of developing risk prediction models in comparison with the Framingham model. This endpoint will hereafter be referred to as CVDFramingham, corresponding to the definition which is explicitly stated in Table 4.1.

Atrial fibrillation, a common cardiac arrhythmia, has emerged as a major public health problem as a result of wide prevalence and close relation to stroke events and mortality (Alonso et al., 2013). Most prediction models consider the 10-year risk, hence we wanted to develop models predicting the 10-year risk of AF. However, as one of the most acknowledged models, CHARGE- AF, predicts the 5-year risk of AF, we also wanted to develop models with this time horizon. Thus two more secondary aims include developing models for predicting the 5- and 10-year risk of AF, in comparison with the CHARGE-AF model.

Heart failure is when the heart does not pump sufficiently to maintain the necessary blood flow, and the prevalence of heart failure is expected to rise significantly (Echouffo-Tcheugui et al., 2015). The prognoses of HF patients are poor, with a projected 50% mortality rate within 5 years. However, risk prediction of incident heart failure remains at an early stage, hence there is an urging need for risk prediction models. Thus, one of the secondary aims of this thesis was to develop models for predicting the 10-year risk of HF for adults aged 40-79 years, with no prior history of heart diseases of any kind.

The most common cause of heart failure worldwide is myocardial infarction (Cahill and Kharbanda, 2017). However, accurate risk prediction tools of heart failure post-MI are lacking, according to Pocock et al. (2020). Thus, another secondary aim of this thesis was to look into the possibility of developing a simple risk model for identifying patients at risk of developing heart failure after MI. Timely initiation of guideline-directed HF therapy can decrease the HF burden (Jenˇca et al., 2021). As most heart failures occurring due to myocardial infarction develop within a few years, we chose to predict the risk of heart failure between 30 days and 3 years from the date of the myocardial infarction.

In total, six different combinations of endpoints and time frames were considered for this thesis, as seen in Table 1.1.

(21)

Chapter 1. Introduction 1.3. Outline

Table 1.1: Overview of the different endpoints explored in this thesis.

Endpoint Description Primary

CVDNORRISK 2 10-year risk of cardiovascular disease for adults aged 40-79 years old, using the definition of CVD as given by NORRISK 2. Including incident myocardial infarction and stroke, and death from stroke or coronary heart disease.

Secondary

CVDFramingham 10-year risk of cardiovascular disease for adults aged 40-79 years old, using the definition of CVD as given by the Framingham model. Including coronary heart disease, stroke, atrial fibrillation, and heart failure.

AF 5- and 10-year risk of atrial fibrillation for adults aged 40-79 years old, in comparison with the CHARGE-AF model.

HF 10-year risk of heart failure for adults aged 40-79 years old.

HF post-MI Risk of developing heart failure between 30 days and 3 years after the event of myocardial infarction for adults aged 20-99 years old.

1.3 Outline

The structure of this thesis will be to follow the process of developing the models for predicting the risk of CVDNORRISK 2, in comparison with NORRISK 2. Starting from the exploration of the available data in Chapter 4, we walk through the general analysis workflow in Chapter 5 and end up presenting the results in Chapter 6. The discussion in Chapter 7 revolves around the results corresponding to the primary aim of this thesis. For each of the secondary endpoints in Table 1.1, i.e. CVDFramingham, AF, HF, and HF post-MI, this process is repeated. A selection of highlighted results from these endpoints is presented in Chapter 8, but not discussed in detail. Note that the amount of time spent on exploring the other endpoints has been approximately the same as for CVDNORRISK 2, hence the results are included to demonstrate the total effort that has been made.

They are not discussed as much in detail as the CVDNORRISK 2 endpoint, but the results will serve as a basis for future work on the topic. Finally, conclusions and discussion of future work are presented in Chapter 9.

Note that throughout the thesis, some terms are referred to with different names. One isfeature, which is also referred to aspredictor,risk factor,covariate andvariable. Another one isresponse, which is also called target and outcome. Finally, note that the term dichotomizing refers to transforming a continuous feature into a binary one.

All code is written in R (R Core Team, 2020b), but due to the use of sensitive data the code is not included in this thesis. However, as the analysis workflow is explained in detail in Chapter 5, and implemented methods from the XGBoost (Chen et al., 2020) and glmulti (Calcagno, 2020) packages have been used, the reader should still be able to follow what has been done. The data used for this thesis was stored using NICE-1 (NTNU, 2021), which uses a virtual private network (VPN) to protect the file area. VPN is software that ensures that the data traffic between the computer and NTNU’s systems is encrypted so that it cannot be monitored or read by others.

(22)

Chapter 2

Background

This chapter begins with a short presentation of the HUNT studies, which provided the data used for this thesis. Further, a short overview of the risk prediction models NORRISK 2, Framingham, and CHARGE-AF, which are used for comparison with the models developed and presented in this thesis. Finally, the system for international classification of diseases is presented, which will be used to identify the HUNT participants developing the relevant endpoint events. See Chapter 3 for explanations of statistical concepts mentioned in the following.

2.1 HUNT

The Trøndelag Health Study (HUNT, previously known as Nord-Trøndelag Health Study) is a large population-based cohort study used for medical research (Krokstad et al., 2012). So far four health surveys have been completed, HUNT1 (1984–86), HUNT2 (1995–97), HUNT3, (2006–08) and HUNT4 (2017–19). The HUNT studies were carried out in Nord-Trøndelag county of Norway and every citizen above 20 years of age was invited. More than 125 000 Norwegians have participated in the studies. The HUNT study includes data from surveys, interviews, clinical measurements, and biological samples.

The study was primarily set up to address arterial hypertension, diabetes, screening of tuberculosis, and quality of life. The scope of the study has expanded over time, and the surveys now contribute to important knowledge regarding health-related lifestyle, prevalence, and incidence of somatic and mental illness and disease, health determinants, and associations between disease phenotypes and genotypes.

The regional committee for medical and health research ethics approved the study. All participants gave informed written consent before participating.

2.2 NORRISK 2

The NORRISK 2 model was developed by Selmer et al. (2017) as a part of a revision of the Norwegian guidelines for the prevention of CVD. The risk is estimated using a Fine and Gray regression model (Fine and Gray, 1999), adjusting for competing risk, fitting separate models for men and women. The risk factors included in the model are sex, age, total cholesterol, high- density lipoprotein (HDL) cholesterol, daily smoking status, systolic blood pressure, the present use of antihypertensive drugs, and a family history of premature (before the age of 60) CHD.

The data used for model training was collected from CONOR surveys from 1994-1999, and val- idation was performed on CONOR surveys from 2000-2003. CONOR is a collection of data from various regional health surveys in Norway carried out between 1994 and 2003. Participants aged

(23)

Chapter 2. Background 2.3. Framingham

40-79 years with no previous records of CVD events or hospitalizations were selected for both the training set and the test set. The participants in the test set only had 8 years of follow-up instead of 10, therefore the 8- and 10-year observed cumulative risks were computed.

There exist international and national guidelines on what thresholds should be considered as “high risk”, for when patients should be offered pharmacological treatment. Most cases of CVD in the population occur among those with moderate risk levels, and hence there is a trade-off between sensitivity and specificity when deciding thresholds for intervention. If sensitivity and specificity are considered to be equally important, one option is to set the threshold for which the sum of the two reaches its maximum. The results in Selmer et al. (2008) showed that the sum of specificity and sensitivity for NORRISK 2 was close to the maximum for thresholds at 5%, 10%, and 15%

in age groups 45-54, 55-64, and 65-74 years for men and women combined. Current Norwegian guidelines for intervention are based on these thresholds.

The model was evaluated using the area under the ROC curves (AUC), using respectively the training and the test set. When evaluated on the training set, the AUC was found to be 0.79 in men and 0.84 in women. The test set, which was based on the 8-year risk, had a slightly lower AUC both for men and women in general.

In addition to the risk diagram presented in Selmer et al. (2017), a web-based scoring tool was developed. The model is intended to be used primarily by general practitioners (family doctors), but the scoring tool is open and accessible by anyone (Helsedirektoratet, 2017).

To perform an accurate comparison between the models developed in this thesis and NORRISK 2, we decided to implement the NORRISK 2 model using the coefficients from the literature. One problem arises due to NORRISK 2 including the factor “family history of premature CHD”, which is the number of first-degree relatives with CHD before the age of 60, with levels “None”, “1” and

“2+”. However, the data from HUNT does not contain the necessary information to implement this accurately. Instead, this was approximated by using the question “Do you have any first-degree relatives with CVD before age 60?” from HUNT, with possible answers “Yes” and “No”. For the implementation of NORRISK 2, these answers were set to correspond to “1” and “None”. This might lead to NORRISK2 performing worse than it would have done if we had information on the exact number of first degree relatives with premature CHD.

2.3 Framingham

The Framingham heart study is a long-term cardiovascular cohort study of the population in the city of Framingham, Massachusetts in the US (D’Agostino et al., 2008). The study, which is a project of the National Heart, Lung and Blood Institute in collaboration with Boston University, began in 1948 with a small original cohort, and since then several offspring cohorts have been added. Careful monitoring of the population over the years has led to the identification of major CVD risk factors and their effects. The Framingham Risk Score is a sex-specific multivariate risk factor algorithm for estimating the 10-year risk of CVD and was published back in 2008.

The model uses Cox proportional-hazard regression to evaluate the risk of developing a first CVD event, based on 8491 participants aged 30-74 years and free of CVD. Sex-specific risk functions were derived, using the risk factors age, total- and HDL cholesterol, systolic blood pressure, treatment for hypertension, smoking- and diabetes status. Framingham defined diabetes as having a fasting serum glucose≥126 mg/dL (for the offspring cohort) or ≥140 mg/dL (for the original cohort), or using insulin or oral hypoglycemic medications. Considering the data from HUNT, fasting serum glucose was not measured. To approximate this feature when implementing the model for comparison, we used to question “Have you had, or do you have any of the following diseases?:

Diabetes” and the measurement of non-fasting serum glucose. Participants answering “Yes” to the question on diabetes, or having a non-fasting serum glucose≥11.1 mmol/L (indicating diabetes according to MayoClinic (2020)), were considered a diabetic.

(24)

Chapter 2. Background 2.4. CHARGE-AF

2.4 CHARGE-AF

The CHARGE-AF risk score was developed to predict the 5-year risk of incident atrial fibrillation, for identifying high-risk individuals more likely to benefit from preventive intervention. The data used for training the model consisted of three American cohorts, including the Framingham Heart Study, and the model was validated on data from two European cohorts (Alonso et al., 2013).

Among the 18 556 participants in the training set, 1186 experienced incident AF, corresponding to 6.3%. The model includes the risk factors age, race, height, weight, systolic and diastolic blood pressure, current smoking, use of antihypertensive medication, diabetes, and history of myocardial infarction and heart failure. In addition, the results showed that adding variables from the electrocardiogram did not improve the model in terms of area under the ROC curve (Section 3.10.2). When implementing CHARGE-AF for comparison with the developed models, the race was assumed to be white for all participants in HUNT3, as the data we had access to did not contain information on this. While the model is primarily intended for predicting the 5-year risk of AF, it was also used for comparison with the models developed for predicting the 10-year risk.

2.5 ICD

International Statistical Classification of Diseases and Related Health Problems (ICD) is a medical classification list developed by the World Health Organization. ICD-10 is the 10th revision of the list, and the Norwegian version of this has served as the official classification in Norway since 1999 (Worth Health Organization, 2021). The benefits of having a standardized classification of diseases are many, one of which is to give as much and as precise information as needed using only a small set of codes. For example, “E10” is the code for “Type 1 diabetes mellitus”. For any hospitalization of a patient in Norway, up to two main diagnoses and 19 secondary diagnoses are registered in the Norwegian Patient Registry using ICD-10. All deaths and the underlying causes are registered using the same system in the Norwegian Cause of Death Registry (D˚AR). If the death occurred due to a sequence of causes, they will all be registered, but only one as the underlying cause of death. As an example, if a person falling down the stairs resulted in a hip fracture, which again led to pulmonary embolism (blockage of an artery in the lungs), the acute cause of death would be the embolism. However, since the embolism was a consequence of the fall, the underlying cause is set to falling down the stairs (Folkehelseinsituttet, 2021).

The ICD codes will be used to identify HUNT participants suffering from the endpoint events in 1.1, and to remove participants who died from other diseases within follow-up time for the corresponding endpoint of interest. See Section 4.1 for how ICD codes were used to define each of the endpoints of interest.

(25)

Chapter 3

Statistical methods

3.1 Training error vs test error

Given a dataset of observations to be used for fitting a model, it is common practice to divide the data into a training set and a test set. The training set is used to fit the model, whereas the test set is used to measure the error of the fitted model. Let the training set consist of observation pairs (x1, y1),(x2, y2), . . . ,(xn, yn) for featuresxi and responsesyi. The model is trained to fit the training set, measured by some loss function. For regression, the most common error measure is themean squared error (MSE), given by

MSE = 1 n

n

X

i=1

yi−fˆ(xi)2 ,

where ˆf(xi) is the prediction of observationi. For classification, one possible error measure is the misclassification rate, given by the proportion of mistakes made by the predictor:

Eclassif= 1 n

n

X

i=1

I(yi6= ˆf(xi)),

where I is the indicator function defined as I(a6= ˆa) =

(1 ifa6= ˆa 0 else .

The output from the error functions will be smaller the closer the predictions are to the true responses in the training set. However, one is more interested in how well the model performs on unseen data. The model could potentially fit the training observations perfectly, but it would be useless if it cannot make accurate predictions for new observations. Hence a regression model will be evaluated by itstest MSE, the mean squared error calculated on the test set. A similar strategy is used for classification models. The ideal model would capture the underlying structure of the data while ignoring any noise patterns. A model not capturing the underlying structure is said to be underfitting the data, while capturing the noise instead is known asoverfitting (James et al., 2014, Chapter 2).

(26)

Chapter 3. Statistical methods 3.2. Missing data

3.2 Missing data

It is rather common for observations in the dataset to have missing values for one or multiple features. In surveys like HUNT, it is almost inevitable that participants either refuse to answer certain questions or do not remember the answer, e.g. a certain date. Missing data problems can be classified into three categories based on the probability of data points being missing (van Buuren, 2018, Chapter 1).

Missing completely at random (MCAR): If the probability of being missing is the same for all values in the dataset, data is said to be missing completely at random. An example would be a weighing scale running out of power in the middle of an experiment, leading to missing data based on bad luck. In general, MCAR is a very strong assumption, and usually unrealistic for most real-world data.

Missing at random (MAR): If the probability of being missing depends on the observed data only, and not on its unobserved value, data is said to be missing at random. For continuation of the weighing scale example: if the scale is placed on a soft surface, this might lead to more missing values than if it was placed on a harder surface. If we know the surface type and can assume MCAR within that specific surface, then the data is MAR.

Not missing at random (NMAR): If neither MAR nor MCAR holds, data is said to be not missing at random. If the weighing scale wears out over time, leading to more frequent missing values without us noticing, we obtain a distorted distribution of the weight values. Another example would be missing values for a certain measurement because the doctor felt the patient was too sick to take it.

Dealing with missing data: There are many different ways of dealing with missing values in the dataset.

• Discard observations with any missing values, known as a complete case analysis. If the relative amount of missing data is small this approach can be used, but otherwise, it should be avoided.

• Let the learning algorithm, like XGBoost, deal with the missing values when training.

• Impute (fill-in) all missing values before training the model. The simplest strategy is to use the mean or mode of the non-missing values for that specific feature.

Note that most simple strategies only work under the strong assumption of MCAR. Even for complete-case analysis, which is the default way of handling missing values in most statistical packages, violating this assumption could lead to poor performances. In general, if there is a systematic difference between cases with missing values and the completely observed cases, this could bias the complete-case analysis. However, if the probability to be missing does not depend on the outcome, complete-case analysis outperforms multiple imputation according to van Buuren (2018, Chapter 2.7).

(27)

Chapter 3. Statistical methods 3.3. Logistic regression

3.3 Logistic regression

Assume we have a dataset (xi, yi), i= 1, . . . , n, consisting of n independent observations with p features and a binary response yi, where each predictor is xi = (1, xi1, xi2, . . . , xip). Assuming the response to be binomially distributed with one trial, Yi ∼ Bin(1, πi), we can use a type of generalized linear models known as logistic regression (LR) (Fahrmeir et al., 2013, Chapter 5.1).

The main goal is to model the probability

πi=P(yi= 1) =E(yi).

With parametersβ= (β0, β1, . . . , βp)> and linear predictor ηi=x>i β, the logisticresponse func- tionis given by

πi=h(ηi) = exp (ηi) 1 + exp (ηi).

The inverse of the response function is known as thelink function. Thelogit link function is given by

g(πi) = log( πi 1−πi

) =ηi01xi1+. . .+βpxip.

One of the advantages of logistic regression is that it provides a linear model for the logarithmic odds (log1−ππi

i). This can be transformed to πi

1−πi

= exp(β0) exp(β1xi1)· · ·exp(βpxip),

which implies that the covariates affect the odds in an exponential-multiplicative form. Fitting a lo- gistic regression model involves estimation of the parameter vectorβ, which is done by maximizing the log-likelihood

l(β) =

n

X

i=1

yilog(πi) + (1−yi) log(1−πi)).

The maximum likelihood estimatorβˆhas no closed-form, so the Fisher scoring algorithm described in Appendix B.4.2 of Fahrmeir et al. (2013) is used to compute the estimator numerically. The significance of the estimated coefficientsβcan be tested by performing aWald test. The hypothesis test for the significance of a single parameterβ is formulated as

H0:β=β0 versus H1:β6=β0. The Wald statistic is given as

W =

βˆ−β0

SE( ˆc β)

and is asymptotically standard normally distributed, i.e. W ∼ NA (0,1). The Wald statistic meas- ures the distance between the estimate ˆβ and hypothetical valueβ0under the assumption thatH0

is true. This can be interpreted as larger the distance, less likely isH0. Thep-value is informally given as the probability of observing results as least as extreme as the computed test statistic underH0. Smallp-values indicate that the obtained results are very unlikely given thatH0is true, and if thep-value is smaller than a certain significance levelα∈[0,1], thenH0is rejected and the alternative hypothesis is accepted. Typical significance levels are 0.01,0.05 and 0.1. The output from fitting a logistic regression model is the estimated coefficients ˆβ0,βˆ1, . . . ,βˆp. In addition, we will report the standard deviation and standardized coefficients, as well as thep-value for the estimated parameters in the model.

(28)

Chapter 3. Statistical methods 3.4. Single decision trees

3.4 Single decision trees

As in the previous section, we assume we have a dataset (xi, yi), i = 1, . . . , n, with xi = (xi1, xi2, . . . , xip). Tree-based methods involve dividing the predictor space into non-overlapping regionsR1, . . . , RJ by using a set of splitting rules (James et al., 2014, Chapter 8). Every observa- tion that falls into regionRj is assigned the same prediction. For a regression tree the prediction is the mean of the responses for the training observations that fall intoRj. This project deals with classification problems only, hence regression trees will not be elaborated further before section 3.5. For a classification tree withK classes, one of the following is used for prediction:

• Majority vote, i.e. predict based on the most commonly occurring class of the training observations inRj

• Estimate the probabilitypjk that an observation in region Rj belongs to classk, and then classify according to some threshold value c ∈ [0,1]. The probability is estimated by the proportion of training observations from classkin region Rj, given by ˆpjk= nNjk

j , whereNj

is the total number of observations in regionRj.

The predictor space is divided into regions by using a greedy approach known asrecursive binary splitting. One begins at the top of the tree and divides the predictor space into two regions by finding a splitting point for one of the features. Splitting is based on the impurity of the child nodes compared to the parent node. Two commonly used measures of impurity are theGini index

G=

J

X

j=1 K

X

k=1

ˆ

pjk(1−pˆjk) and thecross-entropy

D=−

J

X

j=1 K

X

k=1

ˆ

pjklog ˆpjk. (3.1)

The feature and its splitting point is selected by maximizing the reduction in impurity. At the top of the tree, the predictor space is divided into the regionsR1 and R2, by making a decision rule for one of the predictors x1, . . . , xp. Defining the regions as R1(j, s) = {x|xj < s} and R2(j, s) ={x|xj ≥s}, the decision rule involves finding the predictorxj and the splitting points that minimizes the impurity. For the cross entropy loss function this means minimizing

X

i:xi∈R1(j,s) K

X

k=1

ˆ

p1klog ˆp1k+ X

i:xi∈R2(j,s) K

X

k=1

ˆ

p2klog ˆp2k.

This way the first two branches of the tree are created. The splitting process is repeated iteratively until some stopping criterion is reached: it could be when the depth of the tree has reached some pre-specified limit, or the number of observations in a region is smaller than some pre-specified limit.

(29)

Chapter 3. Statistical methods 3.5. XGBoost

3.5 XGBoost

The XGBoost algorithm was first introduced by Chen and Guestrin (2016). For a dataset D = (xi, yi) withnexamples andpfeatures (xi=xi1, . . . , xip), atree ensemble model usesK additive functions (trees) to predict the output.

ˆ yi=

K

X

k=1

fk(xi),

where fk(xi) = wk,qk(xi) is the prediction from tree k, given by the leaf-weights wk. Let the instance set Ik,j ={i|qk(xi) =j} be the set of observations that fall into leafj of treek, where the tree structureqk maps a feature vectorxi to the corresponding leaf indexj. With Tk being the total number of leaves in treek, we can express this relationship mathematically as

fk(xi) =wk,qk(xi), qk :Rp→ {1,2, . . . , Tk}, wk∈RTk fork= 1,2, . . . , K.

When applying such a tree ensemble model to a new observation, eachqk will be used to classify the observation into the correct leaf for treek, before summing the corresponding leaf weights.

3.5.1 Learning objective

For each tree of the model, the functionfkis learned by minimizing the following general regularized learning objective.

L=

n

X

i=1

l(yi,yˆi) +

K

X

k=1

Ω(fk) where Ω(fk) =γTk+1

2λ||wk||2.

(3.2)

Herel is a differentiable convex loss function, measuring the difference between the targetyi and the prediction ˆyi. For binary classification one use Equation (3.1) withK= 2, which is known as thebinary cross-entropy loss function. For an observationithe loss is given as

l=−(yilog(πi) + (1−yi) log(1−πi))

where πi is the probabilistic output defined in Section 3.3. Together with the logistic response functionπi=h(ˆyi), the loss function for classification in XGBoost can be written as

l(yi,yˆi) =yilog(1 +eyˆi) + (1−yi) log(1 +eyˆi).

The Ω-term represents the complexity of the model. In XGBoost, the complexity of each tree is defined as the number of leavesTttimes a factorγ, added to what is known as the L2 regularization term.

3.5.2 Gradient tree boosting

Boostingis a tree ensemble model where the trees are learned iteratively, and for each tree, the goal is to reduce the error from the previous ones. Let ˆyi(t)be the prediction of training set observation xiat iterationt. One begins with a constant prediction, and in each iteration, a new tree is added to the model, as

(30)

Chapter 3. Statistical methods 3.5. XGBoost

ˆ yi(0)= 0 ˆ

yi(1)=f1(xi) = ˆy(0)i +f1(xi) ˆ

yi(2)=f1(xi) +f2(xi) = ˆy(1)i +f2(xi) ...

ˆ y(t)i =

t

X

k=1

fk(xi) = ˆyi(t−1)+ft(xi).

At iteration t, the tree ft is added greedily to improve the model by minimizing the learning objective from Equation 3.2:

L(t)=

n

X

i=1

l

yi,yˆi(t−1)+ft(xi)

+ Ω(ft).

A second-order approximation is used to quickly optimize the learning objective:

L(t)'

n

X

i=1

l(yi,yˆi(t−1)) +gift(xi) +1

2hift2(xi)

+ Ω(ft),

wheregi =∂yˆ(t−1)l(yi,yˆi(t−1)) and hi =∂y2ˆ(t−1)l(yi,yˆ(t−1)i ) are the first and second order gradient statistics on the loss function. As the objective function will be differentiated the constant term l(yi,yˆi(t−1)) can be removed, which gives the simplified function

(t)=

n

X

i=1

gift(xi) +1

2hift2(xi)

+ Ω(ft).

The instance setIt,j was in Section 3.5 defined as the set of training observations falling into leaf j of the tree. Using this, together with the fact thatft=wt= (wt,1, wt,2, . . . , wt,Tt) and writing out the penalizing term Ω(ft) as defined in Equation 3.2, lets us rewrite the objective function as

(t)=

n

X

i=1

gift(xi) +1

2hift2(xi)

+γTt+1 2λ

Tt

X

j=1

w2t,j

=

Tt

X

j=1

 X

i∈It,j

gi

wt,j+1 2

 X

i∈It,j

hi

wt,j2

+γTt.

(3.3)

Observe that the part inside square brackets of Equation (3.3) is the second order polynomial a·wt,j+b·w2t,j, witha=P

i∈Ijgi andb= 12P

i∈Ijhi+λ. Hence, the optimal weightwt,j of leaf j is found as follows:

∂L˜(t)

∂wt,j

= 0 a+ 2b·wt,j = 0

wt,j =−a 2b. Inserting the values foraandb yields

wt,j=− P

i∈It,jgi P

i∈It,jhi+λ.

Referanser

RELATERTE DOKUMENTER

The Framingham equation chosen for this study calculates 10-year risk of general CVD (including both fatal and non-fatal events) based on the risk factors age, gender,

Table 2 - Coefficients from in-sample regression using full data set 38 Table 3 - Analysis of the risk premium using asset pricing models on full data set 40 Table 4a - MSPE

A minimum level of such requirements has been defined by Piql AS with regards to specified storage conditions, which includes protective safety measures, and to physical

Based on the work described above, the preliminary empirical model was improved by adding both the receiver height and weather parameters to the explanatory variables and considering

In April 2016, Ukraine’s President Petro Poroshenko, summing up the war experience thus far, said that the volunteer battalions had taken part in approximately 600 military

For solid nitrate esters, the bond dissociation energy divided by the temperature of detonation showed promising results (R 2 = 0.85), but since this regression was based on only a

Overall, the SAB considered 60 chemicals that included: (a) 14 declared as RCAs since entry into force of the Convention; (b) chemicals identied as potential RCAs from a list of

An abstract characterisation of reduction operators Intuitively a reduction operation, in the sense intended in the present paper, is an operation that can be applied to inter-