A class of parametric survival models via Dirichlet processes

(1)

A class of parametric survival models via Dirichlet processes

Anne Catharina Stoltenberg

Master’s Thesis, Autumn 2021

(2)

Mathematics, University of Oslo. The scope of the thesis is 60 credits.

The front page depicts a section of the root system of the exceptional Lie groupE₈, projected into the plane. Lie groups were invented by the Norwegian mathematician Sophus Lie (1842–1899) to express symmetries in differential equations and today they play a central role in various parts of mathematics.

(3)

Abstract

In this thesis, we present a class of parametric models based on a first-hitting time framework with Dirichlet processes for analyzing right-censored survival data. This model class has many advantages. First, parametric models tend to offer more straightforward statistical inference than the nonparametric and semiparametric alternatives. The model class allows for covariates influencing the survival time distribution through different parts of the model. Moreover, for the Cox proportional hazard model, covariates can only influence the hazard directly through a log-linear combination of covariates, which is limiting. We show that the model class adapts to a wide range of shapes of the survival time distribution. Especially, we show that the presented model with a given base measure for the Dirichlet process can be seen as a 2-parameter extension of a parametric model with survival time assumed to be distributed by the given base measure. We show that the model class adapts to crossing hazards between groups of a given target population. The model class is therefore not dependent on the widely assumed but arguably never fulfilled assumption of proportional hazards of the Cox proportional hazard model. We will, however, need to assume specifications of the survival time distributions. A precise derivation of the model class is presented, and important properties are derived. A full likelihood given right-censored survival data under the presence of independent competing risks is obtained, leading to inferences about the unknown regression coefficients under the assumption of non-informative and independent censoring.

Some generalizations are outlined. To demonstrate the model class, applications are considered.

(4)

I am deeply grateful to my supervisors, Céline Cunen and Nils Lid Hjort. It is my understanding that it is rare to find supervisors who show such enthusiasm as they have demonstrated towards both the content of this master’s thesis and me as a student. It has been an honor to witness their profound love for the field of statistics and constant drive for new knowledge. Both as a team and individuals, they make up, from my perspective, the perfect mix of the exploratory and the established. Their immense amount of exploratory ideas are always based on a solid theoretical foundation. They inspire me to strive for this combination in my own work.

I would like to thank Ernst Hansen for teaching me the beautiful language of mathematics and measure theory, and Niels Richard Hansen for introducing me to the challenging but important task of trying to use this language to describe the world we observe. This encourages me every day.

I would like to thank Emil Stoltenberg and Johan Trovik for helpful feedback and discussion.

Last, I would like to thank Dina Heider Hov, Emir Hindic, Henriette Motzfeldt, Ingrid Schulerud, Jens Stoltenberg, and Kettil Myrstrand. Their love and support have had a substantial impact on me being able to finish this thesis in its present form.

(5)

CHAPTER 1 Introduction

We want to learn about the distribution of the time to the occurrence of an event of interest, referred to as thesurvival time. Regression models for survival time data differ in a couple of fundamental ways from linear and generalized linear regression models, primarily due to the following two issues. First, survival time distributions are skewed distributions on [0,∞), where the shape of the distribution rather than the location is of interest. Second, there is almost always a censoring mechanism, and certain aspects of the data are consequently missing (N. R. Hansen 2013). This poses certain methodological challenges, as for each individual, we potentially only observe the censoring time and not the actual survival time.

In this thesis, we present a class of parametric models based on a first hitting time (FHT) framework with Dirichlet processes for analyzing right-censored survival data. We will refer to this model class as the FHTD model class, or simply the FHTD model. In the following we give an informal presentation of the FHTD model framework, before we give a brief motivation of the model class for analyzing censored survival data.

We will imagine that there is an underlying stochastic processR = (R_t)_t≥0, tracking the development of unspecified underlying causes that lead to the event of interest. We will imagine that the event of interest occurs when the underlying processRexceeds a thresholdbfor the first time. We will refer to the time of the occurrence of the event of interest as the survival time. We want to learn about the distribution of the survival time.

Consider «breakup» as the event of interest. We will imagine that each individual of a given target population has an underlying process, describing the development of the underlying causes that lead up to the potential breakup.

The level of the underlying process at time t indicates the current level of severity towards breaking up at timet. We will imagine that the breakup will occur when, if at all, the underlying process exceeds the individual threshold for the first time.

For the FHTD model, the underlying processes are obtained from the Dirichlet measure. A wide range of different processes have been studied; see Lee and Whitmore (2006) for an accessible review. By properties of the Dirichlet measure,

(8)

the FHTD model is characterized by underlying processes with sample paths that are weakly increasing over time. For this reason, the FHTD model provides a suitable framework for analyzing the survival time of phenomena where we are willing to assume that the underlying process is weakly increasing towards the threshold over time, after which the event of interest will occur. This could be a plausible assumption when analyzing the survival time distribution of an incurable disease, where we believe that the course of the disease is constantly getting worse until the patient dies. For the breakup example, the FHTD model would implicitly assume that once the underlying process towards a breakup has started, it can potentially be slowed down but never reversed.

For some applications it is of less importance whether the above described framework is plausible or not, but rather that the FHTD model based on this framework proves to be statistically fruitful.

The FHTD model is a parametric regression model. One could argue that parametric models are more applicable for statistical inference, than the nonparametric and semi-parametric alternatives. For instance, when wanting to estimate the survival function (the probability of experiencing the event of interest after timet, as a function oft) for a particular individual, parametric models, as opposed to the nonparametric Kaplan-Meier estimator (Kaplan and Meier 1958) and the semi-parametric Cox proportional hazard model (Cox 1972), provide smooth estimates of the survival function. In an interview with Nancy Reid, Sir David Cox addressed this advantage when asked about his thoughts on the cottage industry that had grown up around his semi-parametric proportional hazard model. He responded, “. . . I think I would normally want to tackle problems parametrically, so I would take the underlying hazard to be a Weibull or something. I’m not keen on nonparametric formulations usually.”

Later in the interview, he elaborates; “If you want to do things like predict the outcome for a particular patient, it’s much more convenient to do that parametrically” (Reid 1994). We emphasize, however, that parametric models, as opposed to the nonparametric and semi-parametric survival time models, may involve stronger distributional assumptions than it is suitable to make.

The FHTD model class allows for covariates influencing the survival time distribution through different parts of the model. We show that this makes the FHTD model flexible and able to adapt to a wide ranges of shapes of survival time distributions, especially wide ranges of shapes of hazard functions.

Informally, the hazard at any given time is the probability of experiencing the event of interest in the next interval among individuals who had not yet experienced the event by the start of the interval.In Cox regression, the covariates can only influence the hazard directly. Estimates obtained from the FHTD model, however, may suggest that covariates influence the survival time distribution in several different ways.

Consider the breakup example again. Under the FHTD model framework, we can think of different covariates influencing the breakup time distribution through different parts of the model. «Patient» individuals may have a high threshold, while «restless» individuals may have a low threshold. «Always doubting»

individuals may have an underlying process that steadily increases towards the

(9)

threshold but possibly never exceeds it, while «unstable» individuals may have an underlying process that during bad periods increases drastically towards the threshold, while in good periods stays constant. «Cheating» could lead to an immediate shift in the survival time distribution, while «poor maintenance of the relationship» could lead to a gradual change in the underlying process that finally leads to the breakup.

We show that the FHTD model class adapts to a wide range of shapes of survival time distributions. Especially, we show that the FHTD model with a suitable probability measureµ₀ specified as the base measure for the Dirichlet process, can be seen as a 2-parameter flexible extension of a parametric model with survival time assumed to be distributed by the probability measureµ₀. The FHTD model will not depend on the widely assumed, but arguably never obtained, assumption of proportional hazards between groups of the Cox proportional hazard model (Stensrud and Hernán 2020). We will see that the FHTD model can adapt to data with crossing hazards.

In this thesis, we give a precise presentation of the FHTD model class. In Chapter 2, relevant preliminary measure theory and notation are presented.

Relevant survival time theory is presented in Chapter 3. In Chapter 4, the Dirichlet measure is presented, and relevant properties for later results are derived. In Chapter 5, we present the FHTD model class. In Chapter 6 we demonstrate and partly evaluate the previously introduced theory on right- censored survival data. Finally, in Chapter 7 we discuss the analysis and present our conclusions.

(10)

Preliminaries

In this section, we review particular notation and measure-theoretic preliminaries. This enables us to use the precise language of mathematics for our later results and presentation. In addition, this ensures that all readers of this thesis speak the same language of mathematics.

We begin in Section 2.1 by reviewing the measure-theoretic preliminaries for later results and presentation. The concepts presented here will enable us to present the Dirichlet measure and review relevant properties in Chapter 3.

These properties will enable us to present the FHTD model in Chapter 4. We assume a level of familiarity with basic real analysis and measure theory. In Section 2.2 we introduce some particular notation that will be used throughout this thesis.

2.1 Measure-theoretic preliminaries

This section is based on the writings of E. Hansen (2009) and Sokol and Rønn-Nielsen (2013).

Let X be some set. A σ-algebra E on X is a set of subsets of X with the following three properties: X ∈E, ifE∈EthenE^c∈Eas well, and if (En)n≥1

is a sequence of sets withEn∈Eforn≥1, thenS∞

n=1En∈Eas well. We refer to the elements ofEas events and say that a subsetA⊂ X isE-measurable if A∈E. We refer to the pair (X,E) as a measurable space.

A measure ν on (X,E) is a mapping ν : E→[0,∞) such that ν(∅) = 0 and whenever (E_n) is a sequence of disjoint sets inE, it holds thatP∞

n=1ν(E_n) is convergent andν(S∞

n=1E_n) = P∞

n=1ν(E_n). We refer to the latter property as theσ-additivity of the measureν. The measure satisfies that for any pair of events F, G ∈ E with E ⊆ G, ν(G/F) = ν(G)−ν(F). Also, if (Fn) is an increasing sequence in E, then ν(S∞

n=1Fn) = lim_n→∞ν(Fn), and if (Fn) is a decreasing sequence inE, thenν(T∞

n=1Fn) = lim_n→∞ν(Fn).These two properties are known as the upwards and downwards continuity of measures, respectively. We refer to the triple (X,E, ν) as a measure space.

A probability measure µon (X,E) is a measure with the propertyµ(X) = 1.

We refer to the triple (X,E, µ) as a probability space.

(11)

2.1. Measure-theoretic preliminaries

Next, assume given a measurable space (X,E), and letDbe a set of subsets of X. Theσ-algebraσ(D) generated byDonX, is defined as the intersection of allσ-algebras containingD.

Using this construction, we may define a particularσ-algebra on the Euclidian spaces: the Borelσ-algebraBd onR^d ford≥1 is theσ-algebra generated by the open sets. We denote the Borelσ-algebra onRbyB.

Throughout this thesis, we will assume given an unspecified probability space (Ω,F, P), referred to as the probability background space, or simply the background space. We will think of the background space as the space where fortune and destiny operate.

Assume given a measurable space (X,E). Given a mappingX : Ω→ X, we say thatX is F−Emeasurable if it holds for allE∈EthatX⁻¹(E)∈F, where we use the notation X⁻¹(E) = {ω ∈ Ω|X(ω)∈ E}. Letting the σ-algebras involved be implicit, we may simply say thatX is measurable. A measurable mapping X : Ω → X is referred to as an X-valued random variable. We may view a random variable as a translation mechanism of a world destiny ω ∈ Ω to the outcome X(ω) ∈ X. For convenience, we also write (X ∈E) instead of X⁻¹(E) when E ∈E. Measurability of X ensures that whenever E ∈ E, the subset (X ∈ E) of Ω is F-measurable, such that P(X ∈ E) is well-defined. Furthermore, the integralR

|X|dP is well-defined. In the case where|X|^p is integrable for somep >0, we say that X hasp’th moment and writeEX^p=R

X^pdP. If a random variableXattains only a countable number of distinct values we refer to it as a discrete random variable.

Given anX-valued random variableX, we define the image measureX(P) as the measure on (X,E) given byX(P)(E) =P(X⁻¹(E)) for allE∈E. X(P) is called the distribution of the random variableX. We will often denote the image measureX(P) byµX. Let (X,E, ν) be a measure space, (Y,K) a measurable space, and letf :X → Ybe a map. We say thatf is measurable iff⁻¹(K)∈E for allK∈K. Letf :X →[0,∞) be a measurable function. The set-function γ :E→[0,∞] defined byγ(A) =R

Af dν for A ∈Eis a measure on (X,E).

We will refer to the measureγ as having densityf with respect toν, denoted byγ=f·ν.

The Lebesgue measureλon (R,B) is the measure satisfying thatλ((a, b)) =b−a fora, b∈Rfora < b.

Let (X,E) be a measure space. The counting measure κ on (X,E) is the measure satisfying that

κ(A) =

(|A| if A is finite, +∞ if A infinite, forA∈E, where|A|denotes the cardinality of the setA.

A probability measure µ on (R,B) given by a density with respect to the Lebesgue measure λ is referred to as an absolutely continuous probability

(12)

measure.

Let ν be a probability measure on (R,B). The distribution function for ν is the function F : R → R given by F(x) = ν((−∞, x]) for x ∈ R. A probability measure ν on (R,B) is uniquely determined by its distribution function. IfX is a real-valued random variable, defined on a background space (Ω,F, P), we will frequently refer to the distribution function forX, when we mean the distribution function for the image measure µ_X. If necessary, we will denote the distribution function forX by F_X. The defining formula is F_X(x) =P(X ∈(−∞, x]) =P(X ≤x) forx∈R. The distribution function F for a probability measure ν on (R,B) has the following properties: F is weakly increasing,F is right-continuous,F(x)→1 forx→ ∞, andF(x)→0 forx→ −∞.Any functionF :R→R, satisfying these four conditions is the distribution function for a probability measure on (R,B). If X is a discrete random variable,FX will be a discontinuous step-function.

We will need the following two modes of convergence of random variables. Let (Xn) be a sequence of random variables, and let X be some other random variable. We define Xn to converge in probability to X if for all > 0, limn→∞P(|Xn −X| ≥) = 0. We define Xn to converge in distribution to X if limn→∞FXn(x) =F(x) for allxat whichF(x) is continuous. We write Xn

→P X andXn

→D X, respectively.

Assume given a background space (Ω,F, P) and a measurable space (X,E). A stochastic process indexed by some setT is a familyX = (Xt)t∈T ofX-valued random variables.

Let (X,E, µ) be a probability space. A subset A⊆ X is aµ-nullset if there exists an E-set E such that A ⊆ E and µ(E) = 0. The complement of a µ-nullset is known as a set that occursν-almost surely. Letting the probability measureµbe implicit, we may simply say that an event occurs almost surely.

2.2 General notation

If nothing else is specified, we will denote the cumulative distribution function of a random variableX byF_X(t). The density, if it exists, is denoted byf_X(x).

Letf :R^p→R^q for some p, q∈N. When necessary, we will use the notation f(x;θ), meaning that we are consideringxas a variable andθas a parameter.

LetS be a set and consider a subsetA∈ S. Theindicator functionofAis a function1A:S → {0,1} defined as

1_A(x) :=

(1 ifx∈A, 0 ifx /∈A.

When convenient, we will denote the indicator function of Aby1(x∈A).

(13)

2.2. General notation

TheBeta function is defined as the integral B(λ, µ) =

Z 1 0

x^λ−1(1−x)^µ−1 dx forλ, µ >0.

TheGamma functionis defined as the integral Γ(λ) =

Z ∞ 0

x^λ−1e^−x dx forλ >0.

It is well-known that

B(λ, µ) =Γ(λ)Γ(µ)

Γ(λ+µ) forλ, µ >0.

Theincomplete Beta functionis defined as the integral Bγ(λ, µ) =

Z γ 0

x^λ−1(1−x)^µ−1 dx forλ, µ >0, γ∈[0,1].

Thedigamma functionis defined as ψ(λ) =Γ⁰(λ)

Γ(λ) forλ >0.

(14)

Survival analysis

Survival analysis is a branch of statistics for analyzing data involving the duration between two events. In this chapter, we present the basic survival analysis theory necessary for this thesis. In Section 3.1 we present the survival analysis framework and the typical assumptions made. In Section 3.2 we consider the extension to competing risks. In Section 3.3 we obtain the full likelihood given right-censored survival data under the presence of competing risks. Finally, in Section 3.4, we present standard regression models. The following is based on the writings of N. R. Hansen (2013), Lindqvist (2014), and Tsiatis (1975).

3.1 Framework and assumptions

Consider a population of interest of units or individuals, referred to as the target population. When referring to a study sample, we refer to a sample of sizen, randomly sampled from the target population.

Consider the real random variablesT_i^∗: Ω→[0,∞) andCi : Ω→[0,∞) both corresponding to subject i in the study sample. We will refer to T_i^∗ as the survival time and Ci as the censoring time. We will interpret the survival time variable as the time to failure, where “failure” is used as a generic term and may in practice correspond to any event of interest depending on the application at hand. We will interpret the censoring time variable as the time after which we will not be able to observe the survival time. “Time” need not mean calendar time but can, in principle, correspond to any measurement which is non-decreasing with calendar time. Consider the real random variable Xij : Ω→R. We will refer toXij as thejthcovariate corresponding to subject i. The vector Xi = (Xi1, . . . , Xiq) represents the q observed covariates for subjecti.

Due to right-censoring, we do not observe realizations of theT_i^∗’s but rather realizations of the potentially censored variables

T_i= min{T_i^∗, C_i}.

We will refer toTi as the potentially censored survival time for subject i. In addition to observing a realization of the censored variableTi, we observe the

(15)

3.2. Competing risks framework and assumptions

censoring indicator

Di= 1(T_i^∗≤Ci),

indicating whether observationiis censored (Di = 0) or not (Di= 1).

3.1.1 Properties

We will now introduce some important quantities associated with survival time.

For simplicity of presentation, we will leave out the individual specific indexes in the following. Consider the survival time T^∗. The distribution of T^∗ is completely specified by the distribution function

FT^∗(t) =P(T^∗≤t) fort≥0.

The corresponding density function, when it exists, is fT^∗(t) =F_T⁰∗(t) fort≥0.

Thesurvival function ofT^∗is defined by

S_T^∗(t) =P(T^∗> t) fort≥0, (3.1) expressing the probability that the survival time will exceed a specified time pointt. Note that the survival function also characterises the distribution of T^∗. Thehazard function, if it exists, is

λT^∗(t) = lim

4t→0

P(T^∗∈(t, t+4t]|T^∗> t)

4t fort≥0.

Informally, the hazard at any given time is the probability of experiencing the event of interest in the next interval among individuals who have not yet experienced the event by the start of the interval. By standard results for conditional probability, we note that

λT^∗(t) = fT^∗(t)

S_T^∗(t)=−∂logST^∗(t)

∂t fort≥0. (3.2)

Define thecumulative hazard function by ΛT^∗(t) =

Z t 0

λ_T^∗(u) du fort≥0, (3.3) and note by (3.2) that

ST^∗(t) =e^−Λ(t) fort≥0. (3.4)

3.2 Competing risks framework and assumptions

Based on the writings of Robins and Hernán (2021), we will define a competing event as an event that prevents the outcome of interest from happening. A typical example of a competing event is death because, once an individual dies, no other outcome can occur.

(16)

Again, we consider a study sample of sizen, randomly sampled from the target population. We will assume that there existmdistinct types of competing events indexed by j ∈ {1, . . . , m}. We will assume that failure must be due to one (and only one) of themcauses. If two types of failure can occur simultaneously, we define the combination of the two as a new type of failure to maintain this assumption.

Consider the real random variablesT_i1^∗, . . . , T_im^∗ : Ω→ [0,∞) and C_i : Ω → [0,∞) all corresponding to subject i in the study sample. We will refer to T_i1^∗, . . . , T_im^∗ as thelatent survival timesfor subjecti, whereT_ij^∗ represents the hypothetical survival time due to the j-th cause if the other risks were not present for subjecti. Again,Ci denotes the censoring time.

Due to the competing events, we will potentially (if not censored) only observe realizations of the shortest of the latent survival times

T_i^∗= min{T_i1^∗, . . . , T_im^∗ }.

We will refer toT_i^∗ as the survival time for subjectiand let

Ji=











1 forT_i1^∗ ≤T_i^∗ ...

m forT_im^∗ ≤T_i^∗,

denote its index. We note thatJiis a well-defined random variable. Again, due to censoring, we dont observe realizations of theT_i^∗’s but rather realizations of the censored variables

Ti= min{T_i^∗, Ci}.

In addition to observing a realixation of the potentially censored variableTi, we observe the index indicator

Di=











0 forCi< T_i^∗

1 forT_i1^∗ ≤Ci and T_i1^∗ ≤T_i^∗ ...

m forT_im^∗ ≤Ci and T_im^∗ ≤T_i^∗,

indicating whether subjectiis censored (D_i= 0) or not (D_i6= 0) withD_i=j indicating failure due to cause j. We note thatD_i is a well-defined random variable.

We will now introduce some important quantities associated with survival times under the presence of competing risks. For simplicity of presentation, we will leave out the individual-specific indexes in the following. In the formulas given in the following, the range oft is [0,∞) and the range ofj is{1, . . . , m}, and will be mostly suppressed.

Consider the survival timeT^∗. Thesub-distribution function is defined by Fj(t) =P(T^∗≤t, J =j).

(17)

3.2. Competing risks framework and assumptions

If it exists, thesub-density function, is given by

f_j(t) =F_j⁰(t). (3.5)

By the law of total probability, we may express the cumulative distribution function ofT^∗ by

F_T^∗(t) =P(T^∗≤t) =

m

X

j=1

F_j(t).

Thesub-survival function is defined by

Sj(t) =P(T^∗> t, J =j).

By the law of total probability, we may express the survival function ofT^∗ by S_T^∗(t) =P(T^∗> t) =

m

X

j=1

S_j(t). (3.6)

Note that the assumption of failure being due to one (and only one) of them causes, ensures that the events (J=k) are mutually disjont fork= 1, . . . , m.

Hence, we may note that the joint distribution of (T^∗, J) is determined by the sub-distribution functions, as

P(T^∗≤t, J ≤j) =

j

X

k=1

F_k(t), or alternatively by the sub-survival functions, as

P(T^∗> t, J > j) =

m

X

k=j+1

S_k(t).

The hazard function of T^∗, when it exists, is defined, as usual, by λ_T^∗(t) = lim

→0+

P(T^∗≤t+|T^∗> t)

.

Thesub-hazard function, when it exists, is defined by λj(t) = lim

→0+

P(T^∗≤t+, J=j|T^∗> t)

. (3.7)

The sub-hazard function is the instantaneous rate of failure due to causej at timet. By standard results for conditional probability, it follows that

λj(t) = fj(t)

S_T^∗(t). (3.8)

By the law of total probability, it follows that λT^∗(t) =

m

X

j=1

λj(t). (3.9)

(18)

By (3.8) we note that we may express the sub-distribution function by Fj(t) =

Z t 0

λj(u)S(u) du. (3.10)

The cumulative hazard function ofT^∗is given by ΛT^∗(t) =

Z t 0

λT^∗(u) du.

Define thecumulative sub-hazard functionby Λj(t) =

Z t 0

λj(u) du.

Define

S_j^∗(t) =e^−Λ^j^(t).

Note that S_j^∗ will not, in general, be equal to the survival function of T_j^∗, denotedS_T^∗

j = P(T_j^∗ > t) if m >1. We shall see later, however, that S_j^∗ is equal toST_j^∗ for independent latent survival times. By (3.9), we obtain

ST^∗(t) =e^−Λ(t)=e⁻ Pm

j=1,Λ_j(t)

=

m

Y

j=1

S_j^∗(t). (3.11) The joint distribution of the latent survival times may be characterized by the joint survival functionofT₁^∗, . . . , T_m^∗

SM(t1, . . . , tm) =P(T₁^∗≥t1, . . . , T_m^∗ ≥tm).

We will assume that this function has continous partial derivatives with respect to all of its arguments. Note the following identity

ST^∗(t) =SM(t, . . . , t). (3.12) 3.2.2 The identifiability problem

The following is based on the writings of Tsiatis (1975). LetT₁^∗, . . . , T_m^∗ be dependent variables characterized by the joint distributionS_M(t₁, . . . , t_m) and with sub-distribution functionsF₁(t), . . . , F_m(t). By Tsiatis (1975) there exists a unique set of independent variables ˜T₁^∗, . . . ,T˜_m^∗ characterized by the joint distribution ˜S_M(t₁, . . . , t_m) and with sub-distribution functions ˜F₁, . . . ,F˜_msuch thatS_m(t₁, . . . , t_m)6= ˜S_m(t₁, . . . , t_m) butF_j(t) = ˜F_j(t) for all jand t≥0.

Consequently, without further assumptions, given a set of sub-distribution functions, the joint distribution of the latent survival times which give rise to these sub-distribution functions, are not uniquely given. From a statistical point of view, this means that the joint and marginal distributions ofT₁^∗, . . . , T_m^∗ are said to be non-identifiable from observations of (T, D).

There are ways of overcoming the identifiability problem under the latent variable representation of competing risks. Note, however, that for nonparametric models,

(19)

3.3. The likelihood

this can only be done by imposing additional restrictions in the model, which under observation of the right-censored survival data form competing risks, will always be non-testable.

The typical assumption made is the assumption of independent risks. It then follows from Tsiatis (1975) that we have identifiability of the distributions in question (under regularity assumptions). Note that the identifiability problem as discussed above has been in the nonparametric sense. For the parametric FHTD model, the identifiability problem is different since it now has to do with identifying a finite set of parameters.

3.3 The likelihood

In this thesis, we use likelihood methods for inference about the distribution of survival time. Methods based on likelihood functions adapt to censored data quite easily and can be used to obtain estimates from the distribution of survival time without direct consideration of the censoring mechanism. The objective of this section is to present a rule of how to construct a valid likelihood given right-censored survival data under the presence of competing risks.

The following is based on the framework and notation presented for competing risks in Section 3.2, with the standard framework presented in Section 3.1 as a special case withm= 1. We will refer to realizations of independent

(T₁, D₁) ⊥⊥ (T₂, D₂) ⊥⊥ . . . ⊥⊥ (T_n, D_n),

asright-censored survival data under the presence of competing risks, and as right-censored survival datafor m= 1. We will assume that the distribution of the survival timesT_i1^∗, . . . , T_im^∗ fori= 1, . . . , n are parameterized by some parameter of interestθ. Throughout this thesis, we will assumenon-informative censoring, meaning that the distribution of the censoring time Ci hold no information aboutθ. We will also assumeindependent censoring, meaning that

T_i^∗⊥⊥Ci ∀i∈ {1, . . . , n}.

The assumption of independent censoring is used to derive the full likelihood function given realizations of right-censored survival data. The assumption of non-informative censoringallows us to ignore the distribution of the censoring mechanism when maximizing the likelihood function with respect to the parameter of interestθ. For simplicity, the censoring timeC_i is assumed to be absolutely continuous with densityf_C(c). Letf_ij denote the sub-distribution function for causej associated with subjecti.

We want to derive the full likelihood given right-censored survival data under the presence of competing risks. First, we will need the following density result.

(20)

Theorem 3.3.1.LetT₁^∗, . . . , T_m^∗ be latent survival times and let C: Ω→[0,∞) be a censoring time. LetT_j^∗ have sub-density functionfj and letC have density fC and survival functionSC(c). Let(T₁^∗, . . . , T_m^∗)be independent fromC. Then the joint distribution of(T, D)has density

f_{(T ,D)}(t, d) =











fC(t)ST^∗(t) ifd= 0, SC(t)f1(t) ifd= 1, SC(t)f2(t) ifd= 2, ...

SC(t)fm(t) ifd=m,

(3.13)

wrt. the product measureλ⊗κ. ¹

Proof. LetA∈B[0,∞). We obtain that

P(T ∈A, D= 0) =P(C∈A, C < T^∗)

= Z

A

fC(c) Z ∞

c

fT^∗(t) dtdc

= Z

A

fC(c)ST^∗(c) dc.

Forj= 1, . . . , m we obtain

P(T ∈A, D=j) =P(T^∗∈A, T_j^∗< C, J=j)

= Z

A

Z ∞ t

fC(c) dc fj(t) dt

= Z

A

SC(t)fj(t) dt

By definition of a density wrt. the product measurem⊗κwe may conclude that the joint distribution of (T, D) has densityf_{(T ,D)} given by (3.13).

We are now ready to introduce the full likelihood. Given a realization of right- censored survival data under the presence of competing risks, we obtain the full likelihood function by Theorem 3.3.1, under the assumption of independent and non-informative censoring,

L=

n

Y

i=1

fC_i(ti)ST_i^∗(ti)^1(di=0) m

Y

j=1

[SC_i(ti)fij(ti) ]^1(dⁱ^=j) (3.14) The derivation of the full likelihood (3.14) makes it clear how the distribution of the censoring mechanism C enters, and why it can be ignored under the

1Hereλdenotes the Lebesgue measure onRandκdenotes the counting measure onN.

See Section 2.1 for further details.

(21)

3.3. The likelihood

assumption of non-informative censoring. We will mainly consider the likelihood L(θ) =

n

Y

i=1

ST_i^∗(ti;θ)^1(di=0) m

Y

j=1

[fij(t;θ) ]^1(dⁱ^=j), (3.15) where we have dropped the constants that do not depend upon the parameter θ.

For later use, we will, in the following two examples, present the likelihood under the assumption of independent risks and for the special case ofm= 1.

Example 3.3.2(Likelihood for independent risks).Assume T_i1^∗ ⊥⊥ T_i2^∗ ⊥⊥. . .⊥⊥ T_im^∗ ⊥⊥ C_i ∀i, and that

(Y₁, D₁) ⊥⊥ (Y₁, D₁) ⊥⊥. . .⊥⊥ (Y_n, D_n). (3.16) Let the distribution of the latent survival timesT_i1^∗, . . . , T_im^∗ be parameterized byθand assume non-informative censoring. By independence ofT_i1^∗, . . . , T_im^∗ we obtain

ST_i^∗(t) =P(min

j T_ij^∗ > t)

=P(T_i1^∗ > t, . . . , T_im^∗ > t)

=

m

Y

j=1

ST_ij^∗(t).

Hence, by Theorem 3.3.1, the likelihood for a realization of data (3.16) is

L(θ) =

n

Y

i=1





m

Y

j=1

ST_ij^∗(ti)





1(di=0) m

Y

j=1

[fij(ti)]^1(dⁱ^=j), (3.17)

where we have dropped the constants that do not depend upon the parameter θ.

Example 3.3.3(Likelihood without competing risks).Letm= 1 and assume that

T_i^∗ ⊥⊥ C_i ∀i, and that

(Y₁, D₁) ⊥⊥(Y₂, D₂) ⊥⊥. . .⊥⊥ (Y_n, D_n). (3.18) Let the distribution of the latent survival timesT_i1^∗, . . . , T_im^∗ be parameterized byθand assume non-informative censoring. By Theorem 3.3.1, the likelihood for a realization of (3.18) is

L(θ) =

n

Y

i=1

ST_i^∗(ti)^1(di=0)

fT_i^∗(t)^1(di=1)

(3.19)

(22)

where we have dropped the constants that do not depend onθ. Note that, by (3.2) and (3.4), we may also express the likelihood by

L(θ) =

n

Y

i=1

λT_i^∗(ti)^dⁱe^−Λ^Tⁱ^∗^(tⁱ⁾. (3.20)

◦

3.4 Standard methods in survival analysis

We now discuss the analysis of survival data without parametric assumptions about the form of the distribution. We present the semiparametric Cox proportional hazard model and the nonparametric Kaplan-Meier estimator.

When introducing the FHTD model in Chapter 5, we will use these standard methods for comparison.

3.4.1 The semiparametric proportional hazards model

To ease notation we introduceη =X^Tβ to denote thelinear predictor for a vector of covariatesX and a vector of parametersβ. Letλ0denote a baseline hazard function. The proportional hazard model, also referred to as theCox model, is given by the hazard function

λ(t) =λ0(t)e^η (3.21)

withη the linear predictor. By (3.3), it follows that the cumulative hazard function is

Λ(t) = Λ₀(t)e^η.

Sir David Cox discovered that for the proportional hazard model, the parameters that enter into the linear predictor could be estimated without specifying the baseline hazard function (Cox 1972). For this reason, the Cox proportional hazard model is categorized as a semiparametric model.

3.4.2 The nonparametric Kaplan-Meier estimator

The Kaplan-Meier estimator is a nonparametric estimator of the survival function (3.1). Introducethe individuals at risk

Y(t) =

n

X

i=1

1(t≤T_i), thecounting process of deaths(non-censored events)

N(t) =

n

X

i=1

1(Ti≤t, δi= 1), and the jums for the counting process given as

4N(t) =N(t)− lim

4t→0N(t− 4t),

at timet. Based on this notation, theKaplan-Meier estimator is defined as S(t) =ˆ Y

s≤t

1−4N(s) Y(s)

.

(23)

CHAPTER 4 The Dirichlet measure

In this section, we give a precise and cohesive presentation of the Dirichlet measure and define the Dirichlet distribution function process. The Dirichlet distribution function process will work as the underlying process in the FHTD model presented in Chapter 5. The notation, definitions, and properties presented here enable us to construct and derive important properties of the FHTD model.

The first precise mathematical definition of the stochastic process, known as the Dirichlet process, was made by Thomas S. Ferguson in 1973 (Ferguson 1973). Sethuraman (1994) gives a simple and constructive definition of the Dirichlet measure for arbitrary measurable spaces, referred to as the stick- breaking construction. This construction of the Dirichlet measure establishes the property of the Dirichlet measure being a distribution over probability measures. This is not guaranteed by the construction presented by Ferguson (1973). Further, the definition presented by Sethuraman (1994) simplifies the proof of existence as presented by Ferguson (1973) and Blackwell and MacQueen (1973) and avoids unnecessary regularity assumptions.

We begin in Section 4.1 by introducing the finite-dimensional Dirichlet measure.

This enables us to present the general Dirichlet measure in Section 4.2 and derive and present relevant properties. In Section 4.3 we present the Dirichlet distribution function process and present and derive relevant properties. The following is based on the writings of Ferguson (1973), Sethuraman and Tiwari (1982), Sethuraman and Tiwari (1982) and Blackwell and MacQueen (1973).

4.1 The finite dimensional Dirichlet measure

In this section we present the finite Dirichlet measure. The following is based on Ferguson (1973).

Denote the Gamma distribution with shape parameterβ ≥0 and scale parameter γ >0 by Gam(γ, β). Forβ = 0, this distribution is degenerate at zero; for β >0, this distribution has density with respect to the Lebesgue measureλon

(24)

the real line

f(x) = ( ₁

Γ(β)γ^βe^−x/γx^γ−1 ifx∈(0,∞),

0 otherwise.

Based on Ferguson (1973), we define thek-dimensional Dirichlet distribution slightly more general than what is normally done, by allowing some of the variables to be degenerate at zero.

LetZ1, . . . , Zk be independent random variables, whereZi ∼ Gam(γ, βi) for i= 1, . . . , k, where β_i ≥0 for all iandβ_i >0 for somei. Thek-dimensional Dirichlet distribution with parametersβ₁, . . . , β_k, denoted by Dir(β₁, . . . , β_k), is defined as the distribution ofY = (Y₁, . . . , Y_k), where

Yj = Zj

Pk i=1Zi

.

The distribution of Y will not depend on γ, and hence γ can be set to any positive, real value when deriving the Dirichlet distribution, and not necessarily 1 as in Ferguson (1973) and Sethuraman (1994). We denote the k-dimensional Dirichlet distribution by Dir(β1, . . . , βk) indicating that βj≥0 for allj, and βj >0 for somej. If anyβj = 0, the correspondingYj is degenerate at zero.

If k = 1, Y is degenerate at 1. However, if βj >0 for all j, andk ≥2, the k-dimensional distribution ofY is absolutely continuous with density

fY(y) =







Γ(Pk i=1βi)

Qk i=1Γ(βi)

Qk

i=1y_i^βⁱ⁻¹ ify∈ 4k

0 otherwise,

where4k denote the (k−1)-dimensional probability simplex inR^k, 4k ={y= (y1, . . . , yk)∈R^k :yi≥0∀i ,

k

X

i=1

yi= 1}.

Fork= 2, the Dirichlet distribution reduces to the Beta distribution with shape parametersβ₁ andβ₂.

In particular, the marginal distribution ofY_j is a Beta distribution with shape parametersβj andPk

i=1βi−βj for allj. Hence, we record for future use that Yj has mean and variance given by,

E(Y_j) = β_j Pk

i=1β_i, V ar(Y_j) =

βj

Pk i=1β_i

1−Pk^β^j i=1β_i

Pk

i=1β_i+ 1

. (4.1)

We may interpretPk

i=1βi as a strength parameter, controlling how tight the distribution is around the mean. This is illustrated in Figure 4.1.

(25)

4.2. The Dirichlet measure

Figure 4.1: 10000 realizations of a 3-dimensional Dirichlet distribution with parametersβj = 1/3 (top left), βj = 1 (top right), βj = 10/3 (bottom left), andβj= 20/3 (bottom right) forj= 1,2,3. Note thatα=P3

i=1βi controlls how tight the realizations are around the mean (¹₃,¹₃,¹₃) withα= 1 (top left), α= 3 (top right) andα= 10 (bottom left) andα= 20 (bottom right).

4.2 The Dirichlet measure

In this section, we present the Dirichlet measure. The Dirichlet measure may be interpreted as an infinite-dimensional generalization of the finite Dirichlet measure.

4.2.1 Definition and existence

We begin with the definition of the Dirichlet process as stated by Ferguson (1973).

Definition 4.2.1.Let (X,E, µ₀) be a probability space. A real stochastic process (M_A)_A∈_Edefined on a probability space (Ω,F, P) and indexed byEis a Dirichlet

process on (X,E) with base measureµ₀ and concentrationα >0 if (MA₁, . . . , MA_k)∼Dir(αµ0(A1), . . . , αµ0(Ak)), for every finite measurable partition (A1, . . . , Ak) ofX.

Ferguson (1973) proves, by using Kolmogorov’s consistency theorem (see Theorem II.30.1 of Rogers and Williams (1994)), that for any probability measure µ0 on a given measure space (X,E) and any α >0 there exists a Dirichlet process on (X,E) with base measureµ0 and concentrationα.

(26)

We will writeM ∼DP_(X,_E₎(α, µ0) for the phrase “M is a Dirichlet process on (X,E) with base measureµ0 and concentrationα” indicating that (X,E, µ0) is a probability space and α > 0. We will write M ∼ DP_(X,_E₎(α, µ0) and M ∼DP(α, µ0) interchangeably, understanding that the measurable space is implicit in the latter case.

Given the base measureµ₀, we denote its distribution function byF₀, and its survival function by S₀. If the base measure µ₀ is absolutely continuous, we denote its density byf₀. We denote

I0={t : 0< F0(t)<1},

and refer to I0 as the support of µ0. If the base measure is absolutely continuous we denote I0 = (τ0, τ1) with τ0 = inf{t : F0(t) > 0} and τ1= sup{t : F0(t)<1}, where we allow forτ1∈R∪ {∞}.

In the following, we state some relevant properties of the Dirichlet measure.

Let M ∼ DP_(X_,_E₎(α, µ0). Note that we may regard M as being a map M : Ω→R^E, where R^E is the space consisting of all set functions x:E→R. Let P denote the class of probability measures on (X,E). LetC denote the smallestσ-algebra onP generated by the sets{µ∈ P :µ(A)< r} forA∈E andr∈[0,1]. LetDdenote the class of discrete probability measures on (X,E).

Clearly,D ⊂ P ⊂R^E.

Sethuraman (1994) shows thatM is a measurable map from (Ω,F) into (P,C).

By the definition presented by Sethuraman (1994), we may conclude that the Dirichlet measure is a probability measure on (P,C). We refer to the distribution M(P) on (P,C) as the Dirichlet measure on (P,C) with base measureµ0and concentration α. We denote the Dirichlet measure by µ. By the definition presented by Sethuraman (1994), we may conclude that the Dirichlet measure is a probability measure on the space of all probability measures on (X,E).

Ferguson (1973) shows that the Dirichlet measure gives probability one to the set of discrete probability measures on (X,E),

µ(D) =M(P)(D) =P({ω∈Ω|M(ω)∈ D}) = 1. (4.2) That is, the realization of a Dirichlet process is a discrete probability measure almost surely.

In the rest of this section, we focus on the concentrationαand the base measure µ0. The following shows a close relationship between properties of the Dirichlet processM and properties of the base measureµ₀.

Lemma 4.2.2.Let M ∼DP_(X,_E₎(α, µ₀). LetA∈E. Then µ0(A) = 0 ⇒ P(MA= 0) = 1, and

µ0(A) = 1 ⇒ P(MA= 1) = 1.

A class of parametric survival models via Dirichlet processes