Checking proportionality for Cox's regression model

(1)

Cox’s regression model

by

Hui Hong Zhang

Thesis for the degree of Master of Science

(Master i Modellering og dataanalyse)

Department of Mathematics

Faculty of Mathematics and Natural Sciences University of Oslo

May 2015

Det matematisk- og naturvitenskapelige fakultet Universitetet i Oslo

(2)

(3)

Cox’s regression model is one of the most used methods in medical statistics, and the method also finds applications in other fields. The purpose of the model is to explore the relationship between the effect of covariates and the hazard rate of experiencing an event for each individual. By finding the regression coefficients in the model, one can obtain the relative risk for each covariate.

One of the crucial assumptions in Cox regression is that the hazard rates of any two individuals have to be proportional, that is, independent of time. The model is called a proportional hazards model when all covariates are fixed. A number of graphical methods and formal tests have been suggested for checking this assumption. An important method for checking this assumption is the tests based on the scaled Schoenfeld residuals (Gramb- sch and Therneau, 1994). Another method for model checking is the tests based on the martingale residuals which include tests based on the score process (Lin et al., 1993).

In this thesis we provide an updated and systematic review of the tests that have been proposed for checking the proportionality assumption for Cox’s regression model and to study how they perform. Simulation studies are important when studying model checking.

It give us the possibility to obtain information about the performance and adequacy of the model, and bias and efficiency of the estimated regression coefficient under a variety of scenarios for non-proportional hazards. Consequently, a thorough comparison of the performance of the tests under different circumstances will be performed by using both real and simulated data. The real data used for illustration is the German Breast Cancer Study Data, and we are coming to study the time to recurrence of breast cancer.

i

(4)

(5)

This thesis accomplishes my degree of Master of Science under the program Modeling and Data Analysis with specialization in Statistics and Data Analysis from the University of Oslo. It corresponds to 30 credits and is written during the period of January 2015 to May 2015.

Acknowledgements

I’d really like to take the opportunity to thank my great supervisor Ørnulf Borgan (profes- sor in statistics at the Department of Mathematics, University of Oslo), for providing me this interesting topic as well as splendid guidance throughout this semester. I appreciate that you always showed enthusiasm on the survival analysis study and want to thank you for all those meaningful and useful conversations we had together. Without your help it would never go so smooth as it has been.

I’d also like to thank my family and especially my lovely wife Sophia for endless support and understanding.

Oslo, Norway Hui Hong Zhang

May 2015

iii

(6)

(7)

Abstract i

Preface iii

1 Introduction 1

2 Survival analysis 3

2.1 Basic concepts and notations . . . 3

2.2 The German Breast Cancer Study: An overview . . . 4

2.3 Survival data and censoring . . . 5

2.4 Nonparametric analysis . . . 6

2.4.1 Nelson-Aalen estimator . . . 6

2.4.2 Kaplan-Meier estimator . . . 6

2.4.3 Illustration: The German Breast Cancer Study Data . . . 7

2.5 Counting processes . . . 8

2.6 Cox regression . . . 9

2.6.1 The model . . . 9

2.6.2 Estimation of the regression coefficients . . . 10

2.6.3 Estimation of the cumulative baseline hazard . . . 11

2.6.4 Test of hypotheses for the regression coefficients . . . 11

3 Methods for model checking 17 3.1 Model assumptions . . . 17

3.2 Model with time-dependent terms . . . 18

3.2.1 Case I: Time-dependent on a known function . . . 18

3.2.2 Case II: Time-dependent on intervals . . . 18

3.3 Tests based on scaled Schoenfeld residuals . . . 21

3.3.1 Case I: Whenβββ is known . . . 22

3.3.2 Case II: Whenβββ is unknown . . . 24

3.3.3 Approximation of the score test whenβββ is unknown . . . 24

3.4 Tests based on martingale residuals . . . 28

3.4.1 The martingale residual processes . . . 28

3.4.2 Tests based on the score process . . . 29 v

(8)

3.4.4 The χ²-test based on grouped martingale residual processes . . . 33

4 Model checking by simulations 37 4.1 General considerations of survival times . . . 37

4.2 When the model is correctly specified . . . 38

4.2.1 A general procedure for simulating a data set of survival times . . . 38

4.2.2 Check the proportionality of the correctly specified model . . . 39

4.3 When the model is incorrectly specified . . . 43

4.3.1 Model with time-varying coefficients . . . 43

4.3.2 Check the proportionality of the incorrectly specified model . . . 47

4.3.3 Another case of model with time-varying coefficients . . . 51

4.3.4 Check the proportionality of the incorrectly specified model . . . 53

5 Concluding remarks 57

A Figures 59

Bibliography 71

(9)

Introduction

Survival analysis is widely applied in a variety of scientific studies, especially in medicine, demography, biology, sociology, and econometrics. It is a set of statistical methodologies for studying the occurrences of an event of interest over time for a number of subjects. The subjects under study may be humans, animals, components, etc. The event of interest in this context may be deaths, divorces, births, or failure of components, and we are most interested in the “survival time” or the failure time of the event. Survival data is a collection of survival times for the corresponding event from a study, and will often come with a mixture of complete and incomplete observations. Censoring of an incomplete observation is a common case in survival analysis, and the main reason for censoring is that the event of interest has not occurred at the closure of the study. The occurrence of an event for an individual is described by means of hazard rates and survival curves. The two well-known nonparametric methods applied to estimating the cumulative hazard rate and the survival function from censored survival data are, respectively, the Nelson-Aalen estimator and Kaplan-Meier estimator.

In survival analysis, the dependence on covariates is described by means of regression models. One of the most used and important statistical methods in medical research is Cox’s regression model, and the method also finds applications in other fields as well like demography, technical reliability, and insurance. According to a recent review (Van Noorden et al., 2014), Cox’s original paper (Cox, 1972) is the second most cited paper in the history of statistics. The purpose of the model is to explore the relationship between the effect of covariates and the hazard rate of experiencing an event for each individual.

By finding the regression coefficients in the model, one can obtain the relative risk for each covariate. For instance, in general insurance, this method is quite often applied for analyzing the relative risk of accidents caused by drivers with various skill levels.

One of the crucial assumptions in Cox regression is that the hazard rates of two individuals are proportional, that is, independent of time. The model is called a proportional hazards model when all covariates are fixed. A number of graphical methods and formal tests have been suggested for checking this assumption. An example of a method is to extend the Cox regression model with one or more time-dependent terms of the form γjg(t)xj, where g(t) is a known function and xj is a covariate, and test the null hypothesis that γ_j = 0 by using the likelihood ratio, score or Wald test. Other methods that have been used for checking this assumption are the tests based on the scaled Schoenfeld residuals (Grambsch and Therneau, 1994) and the tests based on the martingale residuals,

1

(10)

where the last one include tests based on the score process (Lin et al., 1993).

A main aim of the thesis is to give an updated and comprehensive review of the tests that have been proposed in the literature for testing proportionality for Cox’s regression model and to study how they perform. Simulation studies are important when studying model checking. It give us the possibility to obtain information about the performance and adequacy of the model, and bias and efficiency of the estimated regression coefficient under a variety of scenarios for non-proportional hazards. Consequently, a thorough comparison of the performance of the tests under the variety of situations will be performed by using both real and simulated data. The real data used for illustration is the German Breast Cancer Study Data, and we are coming to study the time to recurrence of breast cancer.

The thesis is organized as follows: In Chapter 2 we introduce the basic concepts and notations in survival analysis that will be used throughout the thesis, including a brief overview of the German Breast Cancer Study Data. Counting processes and Cox regression will also be introduced and discussed. In Chapter 3 we will give an updated and comprehensive review of the tests that have been proposed in the literature for testing proportionality for Cox’s regression model. How these tests have been used in practice will be discussed and applied by using the German Breast Cancer Study Data and the statistical softwareR (R Development Core Team, 2014). In Chapter 4 we use simulation to generate our own data set of survival times in different situations of proportional and non-proportional hazards. In order to get an insight of which tests are most reliable, we will consider both cases of correctly and incorrectly specified Cox model, and then perform a thorough comparison of the tests. In Chapter 5 we summarize and make some concluding remarks.

(11)

Survival analysis

The material from this chapter is based on Sections 1.1, 1.4, 3.1, 3.2 and 4.1 in the book by Aalen, Borgan and Gjessing (2008). The purpose is to introduce some basic concepts and ideas in survival analysis. In Section 2.1 we introduce some basic concepts and notations in survival analysis. In Section 2.2 we give a brief overview of the German Breast Cancer Study, which will be used to illustrate some result in later sections. Incomplete observation of survival times due to right-censoring, and possibly also left-truncation will be explained in Section 2.3. In Section 2.4 we introduce the two important non-parametric estimators for the cumulative hazard rate and the survival function. Further, in Section 2.5 we introduce the basic concept on counting processes which will be used in later chapter. Finally, in Section 2.6 we introduce the Cox regression model, how the regression coefficients are estimated, and how to test whether a specific null hypothesis is true, which is common in survival analysis.

2.1 Basic concepts and notations

Survival analysis is a set of statistical concept, models and methods for studying the occurrences of an event of interest for a number of subjects. The subjects under study may be humans, animals, components of a machine, etc., while the event of interest may for instance be death, myocardial infarction, birth of a child, and failure of a component or a system. Survival analysis is much applied in different fields, especially in medicine, demography, biology, sociology, econometrics and insurance. A survival time is the time elapsed from an initiating event to a well defined endpoint where the event of interest occurs. Some more concrete examples of survival times are:

• Time to death of a patient after start of certain treatment.

• Time from entrance to discharge from a hospital.

• Time from pregnant to birth of a child.

• Time to failure of a component or a system.

The most basic concepts we need in survival analysis are the survival function and the hazard rate. The survival function S(t) can be written in the following form

S(t) =P(T > t), (2.1)

3

(12)

which describes the probability that the event of interest has not happened by time t.

The random variable T indicates the survival time. Note that at time t= 0 the survival function S(t) = 1, and as time goes by it will decline to zero or a positive value as t increases. That is because over time more and more individuals will experience the event of interest, but for events that do not necessarily happen to all individuals, like divorce or getting cancer, the random variableT may be infinite.

The hazard rateα(t) is defined as α(t) = lim

∆t→0

1

∆tP(t≤T < t+ ∆T |T ≥t), (2.2) which is the instantaneous probability of the event per unit of time. That is,α(t)dt is the probability that the event will occur between timetand timet+dtgiven that it has not occurred earlier (before timet). Since the survival curve is a function that starts at 1 and declines over time, the hazard rate can be essentially any nonnegative function. Therefore, by integrating the hazard rate, we get the cumulative hazard rate, which is defined as

A(t) = Z t

0

α(s)ds. (2.3)

There is a relation between the survival function and the (cumulative) hazard rate which we obtain by using (2.2) and (2.3):

A⁰(t) =α(t) = lim

∆t→0

1

∆t

S(t)−S(t+ ∆t)

S(t) =−S⁰(t) S(t) =−d

dtlog{S(t)}.

SinceS(0) = 1, one gets by integration that−log{S(t)}=^R₀^tα(s)ds, and therefore S(t) = exp

− Z t

0

α(s)ds

= exp{−A(t)}. (2.4)

2.2 The German Breast Cancer Study: An overview

The following section is a brief overview of the German Breast Cancer Study data based on a paper by Sauerbrei and Royston (1999). The data will be used for illustration.

In the period from July 1984 to December 1989, 720 patients with primary node positive breast cancer were recruited into a breast cancer study. Only 686 of the patients have complete data and are included in the data set used here. In the whole study period, patients were followed from the date of breast cancer diagnosis until recurrence or death of the disease or censoring. At the end of the study, 299 of 686 patients had a recurrence of the disease, whereas 171 of them died of breast cancer.

The German Breast Cancer Study Data contains the following eight variables which are divided into:

• Numeric coded variables: Age at diagnosis, tumor size, number of nodes involved, number of progesterone receptors, and number of estrogen receptors.

• Categorical coded variables: Menopausal status, hormone therapy, and tumor grade.

A summary of the variables for the 686 patients from the German Breast Cancer Study Data is given in Table 2.1.

(13)

Table 2.1A summary of the variables for the 686 patients from the German Breast Cancer Study Data: How they are coded and some summary results.

Variable Codes/Values Mean Sd Quartiles

Numeric: 25% 50% 75%

Age at Diagnosis Years 53.05 10.12 46 53 61

Tumor Size mm 29.33 14.30 20 25 35

Number of Nodes involved 1-51 5.01 5.48 1 3 7

Number of Prog. Recep. 0-2380 110.00 202.33 7 33 132

Number of Estrg. Recep. 0-1144 96.25 152.08 8 36 114

Categorical: Codes: Number

Menopausal status 1=Yes, 2=No Yes: 290; No: 396 Hormone Therapy 1=Yes, 2=No Yes: 440; No: 246 Tumor Grade 1-3 (I-III) I: 81; II: 444; III: 161

2.3 Survival data and censoring

Survival data is a collection of survival times for the corresponding event from a study.

Since not all individuals will experience the event of interest within the time frame of a study, the survival data will often come as a mixture of complete and incomplete observations. The incomplete observations will consequently be censored, which is common in survival analysis. For instance, we may want to use the survival data on twenty light-bulbs which are turned on simultaneously to study whether they reach the defined lifetime. If the defined lifetime is one-thousand hours and our study ends after that, for light-bulbs which have exceeded the defined lifetime, the data will be censored.

Right-censoring is the most common form of censoring. A common way of presenting right-censored data is as follows: nindividuals are observed, with survival timesT₁,T₂,..., T_n. For eachi, we observe a timeT^e_iwhich is either the true survival timeT_i, or a censoring time Ci, in which case the true survival time is “to the right” of the censoring time Ci. The observation from an individualiis the pair (Te_i, D_i) where the censoring indicatorD_i is defined by

Di =

(1, ifTe_i =T_i

0, ifTei =Ci in which case it is known that Ti >T^ei. (2.5) In real-life studies, right-censored observations will occur when an individual withdraws from the study, is lost to follow-up or due to closure of the study. Take the German Breast Cancer Study as an example, where we are mainly interested in the recurrence-free survival time for the patients. That is how long the patients live before they either have a recurrence of the disease or die from breast cancer. Totally 387 out of 686 patients are censored. The main reason for censoring in this case would be due to closure of the study.

In later sections, when we talk about a recurrence it will also consists of death caused by cancer.

A concept related to right-censoring is that of left-truncation, which may be that the individuals come under observation some time after the initiating event. For left-truncated survival data, if the time of event that truncates individuals is y, only individuals with Ti ≥ y are observed. A common case of left-truncation occurs when individuals enter a

(14)

study at random ages and are followed from this delayed entry time until the event of interest occurs or until the event is right-censored. For instance, we may want to study the time to death for seniors at a retirement community. Those who enter the study are of random ages and corresponding to the case of left-truncation. Another example is how different types of diet (e.g. vegetarian) will get a decreased or increased chance of health disease. Participants in similar studies are usually of random ages at the entrance of the study, which is also a common case for left-truncation.

2.4 Nonparametric analysis

2.4.1 Nelson-Aalen estimator

The common nonparametric estimator applied to estimate the cumulative hazard rate from censored survival data is the Nelson-Aalen estimator. Assume that the hazard is the same for all individuals, and let t1 < t2 < ... < td be the observed survival times for the events, that is, theTei’s with the censoring indicatorDi= 1 in ascending order. Then the Nelson-Aalen estimator is a sum over the observed survival times, which is given by

A(t) =b ^X

tj≤t

1

Y(t_j), (2.6)

whereY(t) is the number at risk “just before” timet. Further, the variance of the Nelson- Aalen estimator may be estimated by

σb²(t) = ^X

tj≤t

1

Y(t_j)². (2.7)

It can be shown that the Nelson-Aalen estimator, evaluated at a given time t, is ap- proximately normally distributed in large samples. Furthermore, a standard 100(1−α)%

confidence interval forA(t) takes the form

A(t)b ±z1−α/2σ(t),b

wherez_1−α/2 is the 1−α/2 fractile of the standard normal distribution.

2.4.2 Kaplan-Meier estimator

The Kaplan-Meier estimator is a nonparametric method to estimate the survival function from a sample of censored survival data. It can be written in the following form

S(t) =b ^Y

tj≤t

1− 1 Y(tj)

. (2.8)

By using Greenwood’s formula, the variance of the Kaplan-Meier estimator may be estimated by

τb²(t) =S(t)^b ² ^X

tj≤t

1

Y(tj){Y(tj)−1}. (2.9)

(15)

In large samples, evaluated at a given timet, it can be shown that the Kaplan-Meier estimator is approximately normally distributed aroundS(t) with the corresponding variance estimated by (2.9). Thus a standard 100(1−α)% confidence interval for S(t) is given by

S(t)b ±z_1−α/2τ_b(t),

wherez_1−α/2 is the 1−α/2 fractile of the standard normal distribution.

2.4.3 Illustration: The German Breast Cancer Study Data

The Nelson-Aalen and Kaplan-Meier plots in Figure 2.1 show how the size of tumor and the number of positive lymph nodes may affect the time from breast cancer diagnosis to recurrence of the disease. To ease the interpretation for the Nelson-Aalen plot, we will use, for short, “the recurrence rate” to denote “the recurrence rate of breast cancer”.

By considering the slope of the Nelson-Aalen plot for tumor size in the upper left panel, we see that the levels of the recurrence rate are not much different at the first year. After that, they are all fairly parallel. Moreover, the slopes of the plot indicates that patients with large tumor size have a bit larger recurrence rate than those with small tumor size

Figure 2.1 Nelson-Aalen (upper panel) and Kaplan-Meier plots (lower panel) for the effect of tumor size and number of positive lymph nodes for the patients in the German Breast Cancer Study.

(16)

(smaller than 20 mm). The plots are also fairly linear which implies that the recurrence rate is approximately constant. The Kaplan-Meier plot on the left lower panel gives us much more the same conclusion as the Nelson-Aalen plot. We see that patients with tumor size smaller than 20 mm are more likely to not get a recurrence of the disease within the study compared with patients with large tumor size, but the difference is not large. More specifically, one can find that the estimated recurrence-free survival probability in three years after the breast cancer diagnosis for patients in the first and second category are 0.739 and 0.640, respectively. The corresponding estimates for patients with tumor size larger than 30 mm is only 0.570.

How the number of positive lymph nodes affects the recurrence rate can be seen from the upper right panel. According to this plot, it is clear that patients with more than 9 positive nodes have a higher recurrence rate than the other two levels. For patients with no more than 3 nodes, the recurrence rate seems to be fairly low at the first year of the study, and increased not much at the end of the study. Hence, patients that experience a recurrence from breast cancer are those with a larger number of positive lymph nodes. For the Kaplan-Meier plot in the lower right panel, the estimated recurrence- free survival probability for the third years of study are 0.771 for patients with fewer than three nodes. The corresponding estimates for the second and third levels are 0.572 and 0.314, respectively.

2.5 Counting processes

The focus in survival analysis is on observing the occurrence of events over time. By counting the number of events as they come along yields a counting process. For instance, one may count the number of times an individual is buying a new smartphone during the period of ten years. Or, one may also count the number of deaths from a disease in a patient group in a study.

LetTê₁,Tê₂,...,Tê_nbe the observed event times fornindividuals and denote byα_i(t) the hazard rate of individual i. For a given time t, let Ni(t) be the counting process which counts the number of events that have occurred for individualiin the time interval [0, t].

The process is constant between events and jumps one unit at each event time. For survival data which contains censored event times, the counting process Ni(t) for individual i is given by

N_i(t) =I(T^e_i ≤t, D_i = 1); i= 1, ..., n, (2.10) where Di is an censoring indicator of observing the true event time. The occurrence of future events will typically depend on “the past” for a counting process. We may then (informally) define an intensity processλ_i(t) for N_i(t) by

λ_i(t)dt=P(dN_i(t) = 1|past) =P(t≤Te_i< t+dt, D_i= 1|past), (2.11) wheredN_i(t) is the number of jumps of the process in [t, t+dt) for individuali. Obviously λi(t) = 0 when T^ei < t. Consequently, the intensity processλi(t) for the counting process Ni(t) is expressed as

λ_i(t) =α_i(t)Y_i(t), (2.12)

where

Yi(t) =I{Tei≥t} (2.13)

(17)

is an at risk indicator for individual i “just before” time t. One may obtain the aggre- gated counting processN•(t) by adding together the individual counting processesN₁(t), N2(t),..., Nn(t):

N•(t) =

n

X

i=1

N_i(t) =

n

X

i=1

I(Te_i≤t, D_i = 1). (2.14) The corresponding aggregated intensity process takes the form

λ•(t) =

n

X

i=1

λi(t) =

n

X

i=1

αi(t)Yi(t). (2.15)

In the case ofαi(t) =α(t) for all i, the intensity process is given as

λ(t) =α(t)Y(t), (2.16)

whereY(t) =^Pⁿ_i=1Y_i(t) is the number of individuals at risk “just before” time t.

2.6 Cox regression

In general considerations of survival analysis, a covariate in a regression model is an explanatory variable that is either of numeric or categorical type which influences the hazard rate of an individual. Usually, it is common that we deal with more than one covariate in a regression model. For instance from the German Breast Cancer Study, we are interested in the covariates Tumor Size, Hormone Therapy, Tumor Grade and Number of Positive Lymph Nodes. Further, the numeric covariates Tumor Size and Number of Positive Lymph Nodes are coded in, respectively, size in millimeter and a number of nodes counts from 1 to 51. The categorical covariate Tumor Grade and Hormone Therapy are coded in, respectively, three different grades and 1=Yes/2=No. In the following section, we will consider Cox’s regression model, which is a widely used regression model in survival analysis for censored survival data.

2.6.1 The model

The Cox regression model is common in survival analysis. We assume that the vector of covariates x_i = (x_i1, ..., x_ip)^T for an individual iinfluence the hazard rate α(t|x_i), which is given by the form

α(t|x_i) =α₀(t) exp{βββ^Tx_i}, (2.17) whereα₀(t) is the baseline hazard rate, the exponential function exp{βββ^Tx_i}is the relative risk function, andβββ = (β1, ..., βp)^T is a vector of regression coefficients. Note that when all covariates are equal to zero, the relative risk function is equal to 1, such that the hazard rate corresponds to the baseline hazard.

The hazard rate ratio between two individuals, denoted 1 and 2, with the vector of covariatesx1 and x2, respectively, is

α(t|x₂)

α(t|x₁) = exp{βββ^Tx₂}

exp{βββ^Tx1}. (2.18)

This ratio is constant over time when all the covariates are fixed, such that the model (2.17) in this case is the so called proportional hazards model.

(18)

If we assume that x₁ and x₂ are exactly the same, except the kth component, which isx_2k=x_1k+ 1, then (2.18) becomes

α(t|x₂)

α(t|x₁) = expⁿβββ^T(x2−x1)^o=e^β^k, (2.19) wheree^β^k is the hazard rate ratio or the relative risk of thekth covariate. Hence, increasing the kth covariate with one unit is the same as increasing the hazard rate with a factor e^β^k, while the other covariates are kept unchanged.

2.6.2 Estimation of the regression coefficients

The Cox regression model (2.17) is semiparametric since the baseline hazard, α0(t), is a nonparametric component and the relative risk function is a parametric component. To estimate the regression coefficients, ordinary likelihood methods cannot be used. There- fore, we have to look at the partial likelihood. We lett1< t2 < ... < tdbe the times when events are observed, and assume that there are no tied event times. Ifi_j is the index of the individual who experiences an event att_j, then the Cox’s partial likelihood forβββ becomes

L(βββ) =^Y

tj

exp{βββ^Tx_i_j} P

l∈R_jexp{βββ^Tx_l}, (2.20) whereR_j ={l|Y_l(t_j) = 1}is the risk set att_j andY_l(t) is the at risk indicator for individual l“just before” timet.

The log partial likelihood isl(βββ) = logL(βββ). The maximum partial likelihood estimate βb

ββofβββ is found by maximizing (2.20) or solving the score equations U_k(βββ) = ∂

∂β_kl(βββ) =^X

tj

( x_i_j_k−

P

l∈Rjx_lkexp{βββ^Tx_l} P

l∈Rjexp{βββ^Tx_l} )

= 0; (2.21)

fork= 1,2, ..., p. In large samples, it can be shown thatβββ^b is approximately multivariate normally distributed around the true value of βββ with a covariance matrix that may be estimated by I(βββ)^b ⁻¹ (with diagonal elements to be the estimated variances of β^b’s, while the elements outside the diagonal are the estimated covariances between the differentβ’s)b where

I(βββ) = (

− ∂²

∂β_h∂β_j logL(βββ) )

(2.22) is the observed information matrix.

A 95% confidence interval of the relative risk (2.19) can be obtained by exponentiating the lower and upper limits of the standard 95% confidence interval for the regression coefficient,βb_k±1.96se(β^b_k), that is

expⁿβb_k±1.96se(β^b_k)^o. (2.23)

(19)

2.6.3 Estimation of the cumulative baseline hazard

To obtain an estimator for the cumulative baseline hazard A₀(t) = ^R₀^tα₀(u)du, we start out by introducing the aggregated counting process

N•(t) =

n

X

l=1

N_l(t).

By using the aggregated intensity process (2.15) with the hazard rate (2.17), its intensity process takes the form

λ•(t) =

n

X

l=1

Y_l(t) exp{βββ^Tx_l}

! α₀(t).

Ifβββ had been known, we could have estimated A0(t) by the Nelson-Aalen estimator Ab0(t;βββ) =

Z t 0

dN•(u) Pn

l=1Y_l(u) exp{βββ^Tx_l}. (2.24) Since βββ is unknown, we use the maximum partial likelihood estimator βββ^b to obtain an estimator for the cumulative baseline hazard

Ab₀(t) = Z t

0

dN•(u) Pn

l=1Yl(u) exp{βββ^b^Txl}

= ^X

tj≤t

1 P

l∈R_jexp{βββ^b^Txl}

. (2.25)

The estimator (2.25) is also denoted as the Breslow estimator.

2.6.4 Test of hypotheses for the regression coefficients Case I: When the null hypothesis isβββ=βββ0

For Cox regression, we may want to test whether the null hypothesisβββ=βββ0 is true, where usuallyβββ0 =0. Different test statistics could be used, but the following three types are the most widely adopted:

• The likelihood ratio test statistic:

χ²_LR= 2{logL(βββ)^b −logL(βββ₀)} (2.26)

• The score test statistic:

χ²_SC =U(βββ₀)^TI(βββ₀)⁻¹U(βββ₀) (2.27) where U(βββ) = _∂β^∂ logL(βββ) denotes the vector of score functions.

• The Wald test statistic:

χ²_W = (βββ^b−βββ₀)^TI(βββ)(^b βββ^b−βββ₀) (2.28) The test statistics (2.26), (2.27) and (2.28) are asymptotically equivalent, and under the null hypothesis, they are all approximatelyχ²-distributed withp degrees of freedom.

(20)

Case II: When the null hypothesis isβββ₁ =βββ₁₀

Generally, one wants to test the hypothesis that q of the regression coefficients are zero (or equivalently, after a re-parameterization, that there are q linear restrictions among the coefficients). Then if βββ = (βββ^T₁,βββ^T₂)^T, the null hypothesis may be of the form of β

ββ₁ = βββ₁₀, where usually βββ₁₀ = 0. Here βββ₁ is a q ×1 vector and βββ₂ is a (p−q)×1 vector. Letβββ^b = (βββ^b^T₁,βββ^b^T₂)^T be the maximum partial likelihood estimator ofβββ. Consider the partitioned observed information matrixI =I(βββ) based on (2.22) expressed as

I =

"

I11 I12

I21 I22

#

, (2.29)

whereI11 (I22) is the q×q [(p−q)×(p−q)] submatrix with second partial derivatives with respect toβββ₁ (βββ₂), whileI₁₂andI₂₁are submatrices defined by mixed second partial derivatives. Further, the inverse of the partitioned observed information matrix is also a partitioned matrix

I⁻¹ =

"

I¹¹ I¹² I²¹ I²²

#

. (2.30)

It can be shown that the inverse of the submatrixI¹¹ is given by

(I¹¹)⁻¹ =I₁₁−I₁₂(I₂₂)⁻¹I₂₁. (2.31) Let now βββ^∗ = [βββ^T₁₀,βββ^b₂(βββ10)^T]^T be the maximum partial likelihood estimator under the null hypothesis, whereβββb₂(βββ₁₀) is the maximum partial likelihood estimate ofβββ₂ withβββ₁ fixed at the valueβββ10. Then the three test statistics takes the form:

• The likelihood ratio test statistic:

χ²_LR= 2{logL(βββ)^b −logL(βββ^∗)} (2.32)

• The score test statistic:

χ²_SC =U₁(βββ^∗)^T[I¹¹(βββ^∗)]U₁(βββ^∗) (2.33) where U1(βββ^∗) is the q×1 vector of scores fromU(βββ) = [U1(βββ)^T,U2(βββ)^T]^T, where U(βββ) is a p×1 vector of score functions.

• The Wald test statistic:

χ²_W = (βββ^b₁−βββ₁₀)^T[I¹¹(βββ)]^b ⁻¹(βββ^b₁−βββ₁₀) (2.34) where I¹¹(βββ) is the upper^b q×q submatrix of (2.30) or using (2.31) directly in the formula.

We may find the score function and the corresponding observed information matrix as follows

U(βββ) =^X

tj

(

x_i_j−S⁽¹⁾(βββ, t_j) S⁽⁰⁾(βββ, tj)

)

, (2.35)

(21)

where

S⁽⁰⁾(βββ, t_j) = ^X

l∈Rj

expⁿβββ^Tx_l^o (2.36)

and

S⁽¹⁾(βββ, t_j) = ^X

l∈R_j

x_lexpⁿβββ^Tx_l^o. (2.37) The observed information matrix may be written as

I(βββ) =− ∂

∂βββ^TU(βββ) =^X

tj

V(βββ, t_j), (2.38) where

V(βββ, tj) = S⁽²⁾(βββ, tj)

S⁽⁰⁾(βββ, t_j) − S⁽¹⁾(βββ, tj) S⁽⁰⁾(βββ, t_j)

!^⊗2

, (2.39)

whileu^⊗2 =uu^T, and

S⁽²⁾(βββ, t_j) = ^X

l∈R_j

x^⊗2_l expⁿβββ^Tx_l^o. (2.40) The test statistics (2.32), (2.33) and (2.34) are asymptotically equivalent, and under the null hypothesis, they are all approximatelyχ²-distributed withq degrees of freedom.

2.6.5 Illustration: The German Breast Cancer Study Data

We start out by performing a univariate Cox regression analysis for one covariate at a time by using the German Breast Cancer Study Data. After that, we will perform a multivariate Cox regression analysis where the importance of the covariates is studied simultaneously.

To ease the interpretation, we will use, “for short”, “the recurrence rate” to denote “the recurrence rate of breast cancer”. Note that the recurrence rate for the ith patient with Tumor Grade as the only covariate is given by

α(t|x_i) =α₀(t) exp{β₂x_i2+β₃x_i3}, where the reference group are patients in Tumor Grade 1, while

x_i2 =

(1, if patienti has Tumor Grade 2 0 else

and

xi3 =

(1, if patienti has Tumor Grade 3 0 else.

According to Table 2.1 in Section 2.2 the distribution to the numeric covariates Tumor Size, Number of Nodes, Progesterone Receptors and Estrogen Receptors are very skewed and it will be reasonable to use the base-2-logarithms transform. Since some of the observations from the covariates Progesterone Receptor and Estrogen Receptor are equal to zero, we have to add “one” to all observations before using the base-2-logarithms transform. The results of a univariate Cox regression analysis for one covariate at a time are shown in Table 2.2.

(22)

Table 2.2 Results from a univariate Cox regression analysis for one covariate at a time. Note thate^βbis the estimated hazard ratio.

Covariates βb e^β^b se(β)b z-values Pr(>|z|) 95%-CI for e^β Age at Diagnosis -0.004 0.996 0.006 -0.762 0.446 [0.984, 1.007]

log₂(Tumor Size) 0.384 1.468 0.089 4.323 1.5×10⁻⁰⁵ [1.233, 1.747]

log₂(No. of Nodes) 0.376 1.457 0.044 8.574 <2.0×10⁻¹⁶ [1.337, 1.588]

log₂(No. of Prog. Recep.) -0.149 0.862 0.021 -7.231 4.8×10⁻¹³ [0.828, 0.897]

log₂(No. of Estrg. Recep.) -0.095 0.909 0.021 -4.434 9.2×10⁻⁰⁶ [0.998, 1.000]

Menopausal 2 0.063 1.065 0.118 0.530 0.596 [0.844, 1.342]

Hormone 2 -0.364 0.695 0.125 -2.911 0.004 [0.544, 0.888]

Tumor Grade 2 0.872 2.391 0.246 3.543 3.9×10⁻⁰⁴ [1.476, 3.873]

Tumor Grade 3 1.154 3.170 0.262 4.411 1.0×10⁻⁰⁵ [1.899, 5.293]

Table 2.3Results from a multivariate Cox regression analysis where all the estimated coefficients are significant.

Covariates β^b e^β^b se(β)^b z-values Pr(>|z|) 95%-CI for e^β log₂(No. of Nodes) 0.352 1.422 0.044 8.082 6.7×10⁻¹⁶ [1.306, 1.549]

log₂(No. of Prog. Recep.) -0.122 0.885 0.022 -5.599 2.2×10⁻⁰⁸ [0.849, 0.924]

Hormone 2 -0.367 0.693 0.126 -2.916 0.004 [0.541, 0.887]

Tumor Grade 2 0.524 1.688 0.252 2.081 0.037 [1.031, 2.764]

Tumor Grade 3 0.570 1.768 0.275 2.074 0.038 [1.032, 3.029]

By focusing on the relative risks e^β (or the recurrence rate ratios) in Table 2.2, we find that the estimated relative risk for Tumor Size ise^0.384≈1.468. Thus the recurrence rate for breast cancer patients are 46.8% larger per twice increase of the size of tumor.

The 95% confidence interval for the relative risk does not include the value of 1, which corresponds to a significant effect of the covariate Tumor Size. Hence, the size of tumor has a positive effect on the recurrence rate for the patients. The other numerical covariates have similar interpretations.

For the categorical covariate Tumor Grade in Table 2.2, the relative risk for breast cancer patients in the second tumor grade are almost two and a half times larger than those in the reference group, e^0.872 ≈ 2.391, while the third tumor grade are more than three times larger,e^1.154 ≈3.170. Both covariates are significant, which implies that the risk of recurrence of breast cancer for both tumor grades are significantly different from each other. The other categorical covariates have similar interpretations.

Finally, we may fit a multivariate Cox regression model where all covariates in Table 2.2 are taken into account. At the first glance of the results from the full model fit (which is not presented here), we may observe that thep-value for the estimated coefficient for the log-transformed Number of Estrogen Receptors is 0.428, which can be omitted from the model. Then, by fitting a “new” full model without the Number of Estrogen Receptors, we may find that the estimated coefficient for the Age at Diagnosis is not significant (p-value = 0.406) and can be omitted from the next fit. Continuing in this way, we end up with the model as shown in Table 2.3 where all the estimated coefficients are significant.

(23)

From the estimated coefficients in Table 2.3, we can conclude that patients who have a large number of Progesterone Receptors and have not experience of hormone therapy, are more likely to not have a recurrence of breast cancer. Compared with patients who have a larger number of positive lymph nodes and experience of hormone therapy, they will get an increased recurrence rate. Note that in further illustrations, only the covariates in Table 2.3 will be considered.

(24)

(25)

Methods for model checking

The following chapter is based on Section 4.1 in the book by Aalen, Borgan and Gjessing (2008), Sections 6.2 and 6.3 in the book by Therneau and Grambsch (2000), the paper by Grambsch and Therneau (1994), and the paper by Lin et al. (1993). In Section 3.1 we give a brief overview of the two crucial assumptions that have to be satisfied in Cox regression.

In Section 3.2 we introduce two similar cases for models with one or more time-dependent terms, and how the proportionality may be checked by using the German Breast Cancer Study Data as an illustration. In Section 3.3 we introduce the scaled Schoenfeld residuals which is widely used for checking the proportionality assumption. Finally, in Section 3.4 we introduce the two tests based on the martingale residual processes, which include tests based on the score process (Lin et al., 1993) and theχ²-test based on grouped martingale residual processes (Aalen et al., 2008).

3.1 Model assumptions

Consider Cox’s regression model with fixed covariates:

α(t|x) =α0(t) exp(βββ^Tx), (3.1) whereβββ= (β1, β2, ..., βp)^T and x= (x1, x2, ..., xp)^T. There are two assumptions that have to be satisfied for Cox’s regression model. The first one is thelog-linearity for the hazard rate, which is given by taking “log” on the both sides of (3.1):

log{α(t|x)}= log{α₀(t)}+βββ^Tx. (3.2) Hence, log{α(t|x)}has to be a linear function of the numeric covariates.

The second assumption is proportional hazards, which means that the hazard rates for any two individuals, denoted by 1 and 2, with the vector of covariates x₁ and x₂, respectively, have to be proportional. Hence, by using (3.1) we get the hazard ratio

α(t|x₂)

α(t|x₁) = exp{βββ^T(x2−x1)}, (3.3) which is independent of time. This is the main assumption that will be checked by using different methods and formal tests in this thesis.

17

(26)

3.2 Model with time-dependent terms

A model is time-dependent if one or more covariates in (3.1) are dependent on a known function of time. In this case, the time-dependent model will violate the assumption of proportional hazard. In the following section, we will show how the model is defined and the assumption of proportional hazard can be checked.

3.2.1 Case I: Time-dependent on a known function

One way to check for proportionality is to consider a model that extends (3.1) with one or more time-dependent terms of the formγjg(t)xj. Such a model can be written as

α(t|x) =α₀(t) exp (

β β β^Tx+

q

X

j=1

γ_jg(t)x_j )

, (3.4)

where the γ_j are coefficients to be estimated, and g(t) is a known function (usually g(t) = logt) with q first time-dependent terms. A test for the proportional hazards assumption for theq first covariates corresponds to test the null hypothesis γ=0, where γ = (γ₁, γ₂, ..., γ_q)^T, by using the likelihood ratio, score or Wald test defined in (2.32), (2.33) and (2.34), respectively.

Computation of the tests by using R

The likelihood ratio, score and Wald tests are implemented in the statistical software packageR (R Development Core Team, 2014). How to use this to check the proportional hazards assumption may be explained as follows: At the beginning of the program, fit the null model (3.1) by using thecoxph function. Then create a vector of coefficients of the form of (β^b1,β^b2, ...,β^bp,0,0, ...,0)^T, which includes the estimated coefficients under the null model andq zeros corresponding to the null hypothesisγ = (γ₁, γ₂, ..., γq)^T =0. Now fit the model (3.4) by using the coxph function and including a number q of tt(covariate) arguments in the model formula. In addition, let theinit-argument in thecoxphfunction be the vector of coefficients as mentioned above. The tt argument is a list of time- transform functions, and by giving the argument “tt=function(x,t,...) x*log(t)”

(exactly in this form) in the coxph function lets the time-transform function to be logt.

Finally, the summary command may be used to produce a summary of the fit, and to obtain the results from the likelihood ratio, score and Wald test. Note however that the p-values from the output will be wrong in this case, since under the null hypothesis, the tests should beχ²-distributed with q degrees of freedom.

3.2.2 Case II: Time-dependent on intervals

Another similar case related to the model (3.4) is when the model is time-dependent on intervals. How the intervals and the model are defined can be explained as follows: Let t₁ < t₂ < ... < t_d denote the times of the observed event of interest sorted in ascending order, and then divide them into three intervals with equal number of events (e.g. by using quantiles). Then the first interval goes from 0 to τ1, the second interval goes from τ1 to

(27)

τ₂, while the last interval goes from τ₂ to the end. The model can be written as α(t|x) =α0(t) exp

( ββ β^Tx+

q

X

j=1

xj

h

ρjI(τ1 ≤t < τ2) +κjI(t≥τ2)ⁱ )

, (3.5)

where ρj and κj are coefficients to be estimated. The null hypothesis for the test of proportional hazard in this case corresponds to testρ=κ=0, whereρ= (ρ₁, ρ₂, ..., ρ_q)^T andκ= (κ₁, κ₂, ..., κ_q)^T, by using the likelihood ratio, score or Wald test defined in (2.32), (2.33) and (2.34), respectively.

Computation of the tests by using R

By using the statistical software packageR (R Development Core Team, 2014), one may first divide the observed event times into three intervals with equal number of events in each of them by using thequantilecommand. It will be needed to create two variables, where the first one catch which intervals an event or censoring occurs (with index 1, 2, or 3), while the second one is a censoring indicator for the corresponding event or censoring (1 for occurrence and 0 for censoring). Then use afor-loop that run over all individuals, while inside the loop, use theif/elsestatement to detect in which interval the observed event or censoring occurs. The procedure may be explained as follows:

If an event or censoring is observed in the 1st interval: Create a row that contains all the covariates for the corresponding individual (with start time at 0 and stop time at the event of occurrence or censoring), that is, let the index for interval be 1 and the censoring indicator be 1 or 0.

If an event or censoring is observed in the 2nd interval: Create two identical rows that contain all the covariates for the corresponding individual. In this case, for the first row, since the event or censoring is not observed in the first interval (between 0 andτ1), let the index for the interval be 1 and the censoring indicator be 0. The second row corresponding to the observed event or censoring in the second interval (with start time atτ₁and stop time at the event of occurrence or censoring), that is, let the index for interval be 2 and the censoring indicator be 1 or 0.

If an event or censoring is observed in the 3rd interval: Create three identical rows that contain all the covariates for the corresponding individual. Since the event or censoring is not observed in the first interval (between 0 andτ₁) or second interval (between τ₁ and τ₂), let the index for the interval be 1 and 2 for, respectively, the first and second row, and the censoring indicator be 0 for both. The third row corresponding to the observed event or censoring in the third interval (with start time at τ₂ and stop time at the event of occurrence or censoring), that is, let the index for interval be 3 and the censoring indicator be 1 or 0.

Now one may use the observed (start and stop) event times with the corresponding censoring indicator for all intervals to estimate the coefficients from the null model (3.1) by using thecoxphfunction. Then create a vector of coefficients of the form of

(βb₁,β^b₂, ...,β^b_p,0,0, ...,0,0,0, ...,0)^T, which includes the estimated coefficients under the null model and a number 2q of zeros corresponding to the null hypothesis ρ =κ=0, where