The Instability of Cross-Validated Lasso

(1)

by

Kine Veronica Lund

THESIS for the degree of Master of Science

(Master i Modellering og dataanalyse)

Faculty of Mathematics and Natural Sciences University of Oslo

December 2013

Det matematisk-naturvitenskapelige faultet Universitetet i Oslo

(2)

(3)

Abstract

In a situation where the number of available covariates greatly exceeds the number of observations, the fitting of a regression model to explain the connection between the response and the explanatory variables can be a challenging task.

The problem can be compared to a set of equations with more unknowns than there are equations and requires application of a regularisation method to result in a useful solution. There are several such methods, with different properties.

This thesis focuses on one such method: the Lasso in combination with cross- validation (CV) to determine the level of regularisation. Specifically, we consider the method when applied on survival data where the covariates are thousands of gene expression levels.

The combination of Lasso and CV proves to be unstable in the sense that repeated application of the standard R implementation often give varying results.

This study’s main focus is to investigate what the causes of this instability may be.

Data was simulated to map the factors that affect the stability. The simulated data sets’ properties are easy to control and the effects on the regularisation results are easily observed.

The tests show that the CV process cause marked instability (varying results) when the division into training and test sets involve test sets with size larger than one. Moreover, the stability of the regularisation depends on the properties of the data set.

A unique prediction result is preferable to easily choose a prognostic gene signature. However, a range of signatures from repeated regularisations can be utilised to indicate the accuracy of the suggested signature.

This thesis maps several factors that affect the stability of Lasso and CV, and will hopefully contribute to caution - be a warning flag - when utilising the Lasso method to find a prognostic model.

iii

(4)

(5)

Acknowledgments

This thesis is the completion of more than five years of a great time at the University of Oslo. First, I would like to thank my supervisors Knut Liestøl and Ole Christian Lingjærde. Thank you for taking the time to give great advice while having busy schedules, and for the useful feedback given continuously from the very beginning of the project. The thesis of their previous master student, Hege Størvold, has been of great use throughout my work, especially during the first weeks when everything seemed a bit greek to me.

The lively environment in the study room during my next-to-last semester prepared me well for the final semester. Thanks for the coffee, the race for being first there in the morning and the positive study influence.

There are many people who have contributed to the many happy happenings and memorable memories during my time at UiO. Without HRH Ursus Major and his Realistforeningen, these years would have been a drag. RF has also supplied me with two of my best friends, Ida and Helena. Without your support and our frequent tea breaks, I would have consumed less tea and been less sane.

Thanks to my family for always being there for me and for supporting my choices.

Last but not least, thank you, Øyvind. Your patience, motivation, pep talks and focus have been essential these last months and always are.

v

(6)

(7)

List of Figures

1.1 Flow of optL1. . . 4

1.2 Signature size bar diagram. . . 6

2.1 Biological illustration . . . 9

2.2 Flow of CV. . . 18

2.3 CV and PSE relation. . . 19

2.4 CVL plot from profL1 function . . . 23

2.5 Path plot from profL1 function . . . 24

2.6 glmnet CV plot. . . 26

3.1 Time spent by running LOOCV, 10-fold CV and 4-fold CV. . . . 30

3.2 Gene effect slope . . . 32

3.3 Example of signature size heat map and bar diagram . . . 35

3.4 Example of gene index heat map and coefficient density plot. . . . 37

4.1 One dataset, 10-fold CV, heat map. . . 40

4.2 One dataset, LOO and 10-fold CV, heat map. . . 41

4.3 One dataset, different folds, heat map. . . 43

4.4 100 observations, LOOCV, heat map and density pair. . . 45

4.6 One dataset, 10-fold CV, n=300, heat map. . . 47

4.8 Variation of correlation parameter, signature size heat map. . . . 50

4.9 Variation of correlation parameter, weak correlation, signature size heat map. . . 51

4.10 Variation of correlation parameter, gene index heat map. . . 52

4.11 Speed of the gene effect reduction, signature size heat map. . . 54

4.12 Speed of the gene effect reduction, gene index heat map. . . 55

4.13 ROC grid . . . 56

4.14 ROC of cutoff effect whenλ has been decreased by 10. . . 57

4.15 Tests with profL1. . . 59

4.16 Visualisation of how well optL1 picks out the correct genes for varying λ . . . 61

xi

(12)

4.17 Visualisation of how well optL1 picks out the correct genes for

varyingλ,n = 300. . . 62

4.18 glmnet plots, heat map and density pairs. . . 63

4.19 Comparison of glmnet and penalized . . . 65

4.20 p-value and variance filter. . . 67

5.1 10- and 4-fold CV applied on CRC data . . . 71

A.1 Close-up of Figure 4.3. . . 86

A.4 Larger version of Figure 4.4 . . . 89

A.5 Larger version of Figure 4.4 . . . 90

A.6 Larger version of Figure 4.5. . . 91

A.12 As in Figure 4.3. . . 97

A.13 As in Figure 4.3. . . 98

A.14 Another example of Figure 4.15. . . 99

A.15 Another example of Figure 4.15. . . 100

A.16 Zoom out of Figure 4.19 . . . 101

(13)

List of Abbreviations

λ penalty parameter, page 3 ρ correlation, page 31

k the k’th has gene effect 0.5, page 32 n number of observations, page 8 p number of genes, page 8

CRC colorectal cancer, page 2 CV cross-validation, page 3

CVL cross-validated likelihood, page 20

DNA deoxyribonucleic acid, genetic molecules, page 1 FPR false positive rate, page 56

GLM generalised linear model, page 27

Lasso least absolute shrinkage and selection operator, page 11 LOOCV leave-one-out cross-validation, page 17

mRNA messenger RNA, page 1

PSE predicted squared error, page 17

RNA ribonucleic acid, genetic molecules, page 1 ROC receiver operating characteristic, page 56 TPR true positive rate, page 56

xiii

(14)

(15)

Chapter 1 Introduction

Both scientists and non-scientists are fascinated by the blueprint of life, our DNA. The information stored in the DNA tells much about you beyond what is visible on the surface. There are for example numerous conditions that have been established as being completely or partially genetically determined, ranging from rare diseases such as sickle-cell anemia, to common and harmless properties such as baldness. Investigation of our genome, in order to reveal links between diseases and the DNA, is important to be able to predict future illness. Furthermore, the progress and response to a particular treatment may depend on the patient’s genotype (the complete set of genes).

Some diseases, including cancer, are not normally hereditary, even though they are genetically caused. A mutation may occur in a cell, e.g. because of external factors such as radiation. All cells deriving from that cell will inherit the mutation. Some mutations may be tumourigenic and increase the chance that the cells multiply rapidly and uncontrolled, resulting in a tumour.

If a disease is completely or partially genetically determined, one faulty gene or a larger group of genes may account for some of the reason. These genes may be linked to each other, or they may be independent. Since genes (or gene products) interact, a change in one gene may cause altered expression of several genes; genes express themselves in mRNA (messenger RNA). The DNA string of the gene is the recipe for a protein through protein synthesis. The messenger that brings the recipe from the DNA string to the production of proteins is the mRNA. The quantity of mRNA (and therefore also the protein product) for a gene may vary over time and is referred to as the gene’s expression. (For more motivation and introduction to biology, see Section 2.1.)

Changed expression levels may signal malignant changes. Humans have more than 20,000 genes and there is usually a limited number of patients with the illness or property in interest. The great number of genes compared to the number of patients brings on a challenge for every biologist when trying to find features separating malignant cells from healthy ones. Statistical methods may reduce the number of candidate cancer genes until the few genes supposedly most relevant

1

(16)

to the patient’s condition are left.

Several methods exist for this purpose. One, the Lasso method, reduces the supposed significance of certain genes to zero such that a strict subset of genes is left. Other methods, like ridge regression, reduces the supposed significance close to zero, but not quite, such that it is harder to rule out which genes affect the disease. The focus in this thesis is on the Lasso alternative because of the useful variable selection property when coefficients are set to zero and because it is the method of choice in a recent and well acknowledged paper[1]. In this paper the Lasso is applied on colorectal cancer (CRC) data[1] and discusses which genes make up a prognostic gene signature. The study concluded with a seemingly useful answer, a prognostic 7-gene signature, despite the fact that the repeated application of Lasso gave different answers. It also raised several questions:

• Why is the result so different when the input is identical every time? The instability is worrying.

• Is the method dependent on details of the data set?

• Should other, or modified, methods be applied to verify the result?

In this thesis, the reader is thought of as a person with a certain interest and understanding of basic statistic topics such as linear regression and probability models. These topics are well explained in [2]. No prior knowledge about biology or genetics is required to understand and find interest in the study. There will be few biological details, but the necessary information will be explained.

1.1 Motivating example

Globally, about 1.2 million people are diagnosed with CRC and about 600,000 die of the disease every year. It is the fourth most common cause of death by cancer, more present in developed countries than in less developed countries[3].¹ Four out of five CRC patients are treated with surgery where the tumour is removed[5].

It is common to use cancer stage as a prediction for the success of treatment of CRC[6]. This is an unreliable prediction, and it would be useful if genomic data could be utilised to see if the patient is likely to respond well to treatment.

A paper published in 2012 by Sveen et al.[1] uses such data from 95 patients to find a connection between treatment success and gene expression. Specifically, a gene signature (i.e. a combination of genes) predicting the treatment result is sought. The authors’ aim was to put the patients in a high risk group and in a low risk group. The paper uses the R package penalized in order to do the cross validation and Lasso regularisation. There is also another package called

1The disease is strongly correlated with a western diet consisting of e.g. red meat, fat and alcohol[4].

(17)

glmnet, which is applied on the same sort of problems aspenalized, authored by Friedman, Hastie and Tibshirani, the latter known for proposing the Lasso, see Section 2.2.2. glmnethas been used among others by [7][8]. For more information about the R packages, see Section 2.3.

When approaching the two methods with the same data and parameters, the results may be different, as described below and in Section 3.1.

1.1.1 Rough summary of the method used in Sveen’s paper

Filtering. The data set used in the CRC paper[1] initially contains information about the expression level of about 20,000 genes in 95 patients. Genes with p-values from Cox regression greater than 0.5 or variance less than 0.2 were considered non-interesting, see Figure 1.1. Such p-values suggest that the genes linked to them probably are irrelevant to any difference between the patients.

Note that the null hypothesis here is that each gene isnot relevant to the survival of patients.

Similarly, low variance genes are unlikely to be relevant to the response. Con- sequently, genes that differ between the patients are sought.

The R package penalized. After filtering based on variance and p-value, Sveen et al.[1] were left with a little more than 3,000 genes. Starting with these genes, the challenge was to end up with a predictive model with just a few explanatory variables: the genes relevant to the response. In order to achieve this, they used the optL1 function in the R package penalized. optL1 is an implementation of the Lasso method, see Section 2.2.2 and can perform Cox regression, as well as other types of regression.

Lasso. Lasso introduces a penalty parameterλ >0which decides how strict the punishment is for including more genes in the model. Varyingλwill vary the number of genes included. The optL1function basically works like this:

1. Divide the data in subsets, e.g. 10 groups of size 9-10 (95 patients in total).

2. For a fixedλ, use the Lasso on the data from all patients, except those from one of the subsets to fit a (Cox) model.

3. Find the error on the excluded subset.

4. Repeat step 2 and 3 until all subsets have been removed once.

5. Sum the errors.

This process is calledcross-validation (CV), see Section 2.2.7, and is repeated on a sequence of reasonableλ-values. The λthat gave the smallest error is noted and used to make a predictive model based on the complete data set.

(18)

Figure 1.1: Flow of optL1. 1. The original gene set consisting of nearly 20,000 genes is filtered to include only those that vary the most amongst the individuals and with small Cox p-values. 2. These are then run through CV and the Lasso method 1000 times, taking note of the size of the supposedly relevant gene signature (group) each time. 3. The result is presented using a bar diagram.

(19)

Variation in results. Ideally, there is just one solution which concludes with a suitable prediction model. However, for every new call tooptL1the result varied, similarly to what is shown in Figure 1.2.

To obtain a conclusive answer, Sveen et al. repeated Lasso a thousand times and took note of the signature size (the number of coefficients different from zero) each time. Eight gene expression signatures of size zero to twelve showed up more than fifty times each. These signatures were then run on a first validation series;

the five larger signatures were significant here. The 7-gene expression signature gave a prognosis closest to the actual survival outcome. Note that the smaller signatures were always contained in the larger ones. A third run on a second validation series verified that the 7-gene signature gave a reasonable prognosis.

Reproduction. In the attempt to reproduce the results, the same recipe was followed and nearly the same result was achieved. The seven genes that make up the prognostic gene expression signature in the paper, are the same seven that were left in the 7-gene signature in my calculations. There were fewer hits on the 7-gene signature, however. It was not the one with the most hits in the paper either, but the next steps in [1] were used to justify the result.

Unstable Lasso results permeate my calculations, as well as those performed by the paper’s authors. Thus, in this thesis, the reasons for instability are at- tempted unraveled and an answer to the question “Is a result based on such Lasso calculations reliable?” is sought.

1.2 Chapter overview

Chapter 2gives a general introduction to the required biological and statistical concepts. Lasso and CV have already been mentioned in the introduction, but the motivation for choosing these methods and the theory behind them will be given more attention there. Also, there is an introduction to the R packages penalized and glmnet which performs Lasso regularisation and CV, as well as other types of regularisation. Chapter 3 discusses the methods of investigation and how to display the results. Data is simulated to mimic real data so that the properties of the data set can be controlled and the effect of these can be measured. Simulated data sets also ease the process of testing CV parameters on different data sets. Chapter 4 presents the results from the investigations described in Chapter 3. Then a conclusion on the question “Is the algorithm described in Sveen et al.’s paper[1] reliable?” is approached. This is discussed in Chapter 5. In Chapter 6 the results are further discussed and suggestions for further work are presented.

(20)

Figure 1.2: Signature size bar diagram. The numbers along the x-axis are the sizes of the gene signatures (the number of genes included by Lasso).

The height of each bar describes the number of times a signature with the given number of genes is found. The 0-, 1- and 2-signatures are the most frequent, but the 8-, 10- 11- and 12-signatures are also interesting. Judging from this diagram, obtained by the author (and deviating slightly from the one obtained by Sveen et al.[1]), the signatures of size 3 to 7, 9 and larger than 12 are less likely to be of any interest.

(21)

Chapter 2 Background

Genes, gene expression, mRNA and other biological terms were mentioned in the introduction. In this chapter these terms are further explained. The statistical terms Lasso, Cox, cross-validation, etc. are also described.

2.1 Biological background

The genetic material in humans is stored as 46 chromosomes: 22 pairs of non- sex chromosomes (autosomes) and two sex chromosomes (XX for females and XY for males)[9], see example of a chromosome map in Figure 2.1. Each chromosome consists of a DNA molecule complexed with proteins and winds up to form chromatin. Most DNA, about 97% in humans, is non-coding¹. The rest of the DNA is coding and form genes. A gene is a subset of the chromosome DNA string and is the recipe of (a part of) a protein. The DNA string includes nucleotides where each nucleotide includes one of the nucleobases guanine (G), adenine (A), thymine (T) and cytosine (C). A DNA string is recognised by its order of nucleobases.

Through protein synthesis, the gene is read and instructs the creation of a protein. If a gene is missing, some proteins may not be made, or if the gene has been altered from the original, a protein may function differently. In cancer, there is a fault that promotes non-normal growth or reduces a cell’s ability to carry out suicide (apoptosis) when malfunctioning. Usually, cells multiply and die in balance, at varying speed in different part of your body. In fact, during a period of a few years your whole body will consist of new cells, with some exceptions such as the brain.

The translation of DNA to proteins happen via mRNA, messenger RNA.

mRNA is the complementary string of a DNA string (gene), i.e. the mRNA have complementary nucleobases to the DNA string. The mRNA is the gene

1Non-coding DNA was earlier called ”junk DNA” because it was unclear if it had any biological function. However this has been disproved, and the term is no longer used.

7

(22)

expression. Some genes provide many copies of mRNA, whereas other genes generate few. This can vary between individuals and also over time in a cell. The number of mRNAs from a gene is called the gene expression level.

A missing or incorrect gene is caused by a mutation. If the mutation causes the creation of a tumour, the mutation is tumourigenic.

2.1.1 Data structure

In a gene expression study involvingpgenes andnindividuals, the gene expression levels may be expressed as an n×p matrix.

X =

gene 1 gene 2 . . . gene p patient 1 4400 340 . . . 40 patient 2 4450 2 . . . 41

...

patient n 4349 358 . . . 43

(2.1)

The survival times of all observations may be expressed in a column vector:

(y1, y2, . . . , yn)^T = (8,3, . . . ,10)^T

For more information about survival times and other survival analysis terms see Section 2.2.3.

Suppose the patients should be divided into two or more groups, for example a high risk group and a low risk group. In the above example, gene 1 is expressed at about the same level in patients 1, 2 andn, and say this is the case for patients 2,3, . . . , n−1 as well. Then the gene most likely has nothing to do with the response in interest. (It may have something to do with the survival time, but it is the differences between the patients that are interesting. That way the observations can be split into different groups which means that common factors can be ignored.)

The difference in gene 2 is interesting however. The gene is hardly expressed in patient 2, relative to patients 1 and n. It is now interesting to look upon:

• What is the normal gene expression level?

• Which patient(s) differ from this?

• Does it have anything to do with the response, or is the high/low gene expression level harmless? Say that patient 2 did not survive as long as the other patients. Can this be explained by the difference in gene 2?

Working with the CRC data[1], interesting gene expression levels are sought to see if they are linked to the response. Can genes that predict the survival time of the patient be found?

(23)

Figure 2.1: Sketch of the protein synthesis. 1. Each chromosome contains a long DNA string. 2. Some parts of the chromosome are called genes; they code for proteins. 3. The gene (DNA substring) is transcribed to mRNA, a complementary string. 4. The mRNA is translated to a protein consisting of a chain of amino acids. Chromosome map from [10].

(24)

2.2 Statistical background

Survival analysis is the underlying theme of this study. In survival analysis the data set contains information about a set of patients: a number of covariates (in our case the expression level of one of more genes), an observation time and whether the observation time reflects time to a specific event (such as death) or time to censoring (e.g. due to the subject being alive at the end of the study period).

The challenge is to find out if there is a link between the covariates and survival, and if so find the link. In order to perform the analysis, a presentation of some statistical tools discussed follows in this section.

2.2.1 Introduction

A regression model is a way of explaining the connection between the covariates and the response: in our case the gene expression levels and survival time. A linear regression model looks like this

y =β₀+β₁x₁+. . .+β_px_p+=β₀+β^Tx+. (2.2) The y represents the response, e.g. some clinical variable measured on a continuous scale. The βis are the weights of the explanatory variables, the xs, e.g.

gene expression levels. Often, the β₀ is included in β by adding an element, 1, in x, making y=β^Tx+. is the error, or the noise term. When there is more than one observation available, the βis can be estimated using the least squares method.

2.2.2 The Lasso method

When dealing with genetic data there will often be information about many more genes than observations, i.e. p >> n. The problem of fitting a regression model such as the Cox model may then be compared to solving a system of equations with more unknowns than equations, which makes it possible to choose the values of the unknowns in infinitely different ways. If no precautions are taken, this process may result in a fit adapted to the noise instead of, or in addition to, the variables that are actually related to the response. The resulting model may be overfitted, and will be useless for the prediction in a general case[11]. For further motivation, see Section 2.2.7.

In order to settle on a few well-chosen variables, the explanatory variables that are irrelevant to the response should be removed. There are several ways to do this; some include shrinking the regression coefficients until they approach zero, while others actually reduce them to zero. For more information, see Section 2.2.6. Lasso is one of the latter methods.

(25)

Lasso

The Lasso (Least Absolute Shrinkage and Selection Operator) estimate for ordinary continuous data is defined by[12][13]:

βˆ=min_β

N

X

i=1

y_i−β₀−

p

X

j=1

x_ijβ_j

!2

, s.t.

p

X

j=1

|β_j| ≤t, (2.3) wherey_i is responseiandx_ij is thejth covariate of observation i. The difference yi −β0 −Pp

j=1xijβj is actually the difference between the observed response yi and the predicted response yˆi = β0 +Pp

j=1xijβj. So, the solution βˆ to the problem, is the β that minimises the error, under the constraint Pp

j=1|β_j| ≤t.

By choosing t very large, there will be no constraint at all and the fit will approach that of an ordinary linear regression model. But, by lettingt→0some of the coefficientsβ_j will move towards zero which is the intent. In the extreme case oft = 0, theβ_js will all be forced to be zero, and the model has no variables with non-zero coefficients.

Notice that Lasso may be written in the Lagrangian form βˆ=minβ



 1 2

N

X

i=1

yi−β0 −

p

X

j=1

xijβj

!2

+λ

p

X

j=1

|βj|



. (2.4)

The solution to this is equivalent to (2.3) by application of the Lagrange multiplier theorem[13]. Decreasing the t is the same as increasing the λ; λ → ∞ ⇔ t = 0 because a largeλpenalty will lead the minimisation problem to chooseβ_j = 0for alljs such that the term λPp

j=1|β_j| is as small as possible, zero. The problem, λ = 0 ⇔ t → ∞, represents having no penalty at all. The term λPp

j=1|β_j| is called the L1 penalty.

To find β, methods based on quadratic programming may be used[12]. Theyˆ are iterative, and in every iteration a least squares problem is solved.

There exists variations of Lasso, such as adaptive Lasso. The adaptive Lasso method adds weights to the restrictions Pp

j=1|β_j| ≤ t, so instead it becomes Pp

j=1|β_j|τ_j ≤ t, τ_j > 0 for all j. The choice of weights is very important, and Zhang and Lu[14] propose using τ_j = 1/|β˜_j|, which is just one of many possible choices. They motivate the choice by the consistency of the β˜_js; the β˜_js reflect the importance of the covariates.²

2.2.3 Survival analysis

Survival models in cancer usually describe the risk of death or relapse; from here on death will be used as an example, as in [1]. The survival function is the

2|β_j|/|β˜_j| →I(β_j 6= 0) ={1 whenβ_j 6= 0; 0whenβ_j= 0}whenn→ ∞.

(26)

risk of death occurring later than a given time t:

S(t) =P(T > t), (2.5)

where T is a random variable representing time of death[15]. Note that the survival function S is decreasing, approaching zero as no person can live forever.

However, survival models may be applied in other areas such as mechanical en- gineering where it makes more sense to talk about eternal life. Also, S(0) is assumed to be 1, except if there is a chance of immediate event.

Thehazard ratereflects the death intensity in a small time interval[t, t+dt), given that the individual has survived up until the timet. Heredt is approaching zero, so the hazard rate represents the instant death rate. This function is often denoted λ:

λ(t) = lim

∆t→0

P(t≤T < t+ ∆t|T ≥t)

∆t (2.6)

which can be rewritten by following[13]:

P(t ≤T < t+ ∆t|T ≥t) = 1−P(T ≥t+ ∆t|T ≥t)

= 1− P(T ≥t+ ∆t) P(T ≥t)

= 1− S(t+ ∆t) S(t)

By putting both parts in the same fraction, λ(t) = lim

∆t→0

1

∆t · S(t)−S(t+ ∆t)

S(t) =− 1

S(t) lim

∆t→0

S(t+ ∆t)−S(t)

∆t =−S⁰(t) S(t),

(2.7) where the definition of the derivative S⁰(t) = lim∆t→0(S(t+ ∆t)−S(t))/∆t was applied in the last transition.

The relation between the hazard rate λ(t) and the survival function S(t) is useful. By the chain rule, we know that _dt^d logS(t) =S⁰(t)/S(t), and

λ(t) =−S⁰(t)

S(t) =−d

dtlogS(t), (2.8)

which implies that

S(t) = exp(−Λ(t)), Λ(t) = Z t

0

λ(s)ds. (2.9)

Λ(t) is called the cumulative hazard and it follows from (2.8) and (2.9) that the relation between the cumulative hazard and the survival function is Λ(t) =

−logS(t).

(27)

Another useful relation is that the lifetime distribution functionF(t)describes the negative survival function: F(t) = 1−S(t). By this, the density function is f(t) = −S⁰(t). This is the foundation of an equivalent version of the hazard rate:

λ(t) = f(t)

S(t) (2.10)

Survival models are based on survival data, i.e. information about a set of observations, usually patients. This includes a censor indicator which tells you if a patient has been withdrawn from the experiment for some reason. Death by other causes than the illness of interest, a patient failing to turn up for check-ups or voluntarily withdrawing from the experiment may all be causes for censoring.

Usually the events are denoted with ones and zeros: A one represents event (death) and a zero represents censoring.

In addition to this information, there usually is information about gender, age, gene expression levels, as described in Section 2.1, and the time until withdrawal or event (death), called survival time. The survival time is usually the time from the entering of the patient in the study until event or censoring. A patient may be entered in a study at the time the tumour was discovered. Other survival times, such as age, may also be applied.

2.2.4 The Cox model

The most common regression model for survival data is theCox modelbecause it makes no assumptions about the baseline hazardλ₀(t), it is semi-parametric, as described in this section. According to this model, the hazard function is

λ(t|x₁, . . . , x_p) =λ₀(t) exp(β₁x₁+. . .+β_px_p) or

λ(t|x) = λ0(t) exp(β^Tx). (2.11) Here, x = (x₁, . . . , x_p) are the p covariates values for each individual. λ₀(t) is the baseline hazard. Notice that if β =0, thenλ(t|x) = λ0(t). Consider two individuals with covariate vectors x_a and x_b. Then

λ(t|xa)

λ(t|x_b) = λ0(t) exp(β^Txa)

λ₀(t) exp(β^Tx_b) = exp β^T(x_a−x_b)

. (2.12)

Thus the ratio of the hazards of two individuals is independent of time. Suppose the ith covariate increases by 1 and all other covariates are identical in the two individuals, i.e. xa,i =xb,i+ 1 and xa,k =xb,k for k 6= i. Then, by inserting into (2.12):

λ(t|x_a)

λ(t|x_b) = exp(β_i), (2.13)

which is called thehazard ratio for the ith covariate.

(28)

The partial likelihood

It is a complicated task to estimate the regression coefficients β from the hazard function (2.11) as it is a function of more than one parameter. Cox observed that with the independence of time it is possible to estimateβ without modelling the hazard function, and instead maximise the partial likelihood[16]. This approach is called Cox proportional hazards model[15].

The partial likelihood is a function ofβ:

L(β) = Y

i:Ci=1

θ_i P

j:Yj≥Y_iθ_j, θ_k = exp(β^Tx_k). (2.14) wherex1, . . . ,xnare the covariate vectors (column vectors) for thenobservations, each consisting ofpcovariates. Y_k is the observed time for thekth individual and is censored if C_k = 0. C_i = 1 means that individual i is non-censored, i.e. an event has taken place. The notation δi ∈ {0,1} is also common for censoring status.

The partial likelihood is, in words: For all non-censored individualsi, multiply the hazard ratio (fori) divided by the sum of the hazard ratios of all individuals still alive at the time of event i (both censored and non-censored).

The log partial likelihood is given by l(β) = X

i:Ci=1

(β^Tx_i−log X

j:Yj≥Yi

θ_j) = X

i:Ci=1

ϕ_i(β), (2.15)

which we may maximise to find the maximiser of (2.14).

2.2.5 Lasso and the Cox model

So far, the general linear regression case of Lasso has been described. In Section 2.2.4, the Cox model was introduced which may also be combined with Lasso.

Tibshirani[16] proposes to estimateβˆ by βˆ =argmax_β l(β), s.t.

p

X

j=1

|β_j| ≤t, (2.16)

where l(β) is the log partial likelihood given in (2.15). To findβ, Tibshirani[16]ˆ proposes to first fix t and initialise βˆ = 0, then minimise (z −η)^TA(z− η), where z, η and A are based on our existing knowledge of X, β = ˆβ and l, subject to Pp

j=1|β_j| ≤ t through a quadratic programming procedure. Repeat the minimisation until βˆ converges.

Tibshirani was not the first person to consider quadratic programming as a solution to such a problem, or even representing the problem with an L1 penalty

(29)

(Lasso). The first use of linear programming was by Leonid Kantorovich in 1939 who used it to plan effective moves in World War II.

A basic linear programming problem can look like this:

maxf(x, y) = max(2x+ 3y), s.t. g(x, y) = 4x+ 0.5y≤10, x, y ≥0. (2.17) This problem can be solved by the simplex algorithm.

Application of the Lagrange multiplier theorem, as mentioned in Section 2.2.2, can transform the problem into an equivalent problem:

Λ(x, y, λ) =f(x, y) +λ·(g(x, y)−c) = 2x+ 3y+λ(4x+ 0.5y−10) (2.18) The minimisation points (x, y) of this problem give the solution to (2.17). Note that this is similar to the Lasso formulae in (2.3) and (2.4). The Lasso problem forms a quadratic programming problem instead, because the function in (2.16) is quadratic.

2.2.6 Other regularisation methods

The Lasso is not always the best choice. In a situation where variables are correlated within groups, Lasso tends to pick one of these and discard the rest. A proposed zero-coefficient therefore does not necessarily mean that the variable is irrelevant; it could just be correlated with another variable which has a non-zero coefficient.

In the case wherep > n, i.e. the number of genes is greater than the number of observations, Lasso includes a maximum ofn (number of observations) variables.

Zou and Hastie[17] argue that this is a limiting feature of the method.

There are other methods that avoid these issues, such as ridge regression[18]

and the elastic net[17].

Ridge regression

Ridge regression maintains the number of variables in the model, it just shrinks the coefficients. All the variables are included in the final model. However, it is not necessarily easy to see which are more relevant to the survival response. The coefficient sizes may say something, but must be seen in relation to the covariate value. A large coefficient combined with a small covariate may make only a small contribution to the overall predicted effect and vice versa.

The corresponding expression to (2.4) is:

βˆ=min_β



 1 2

N

X

i=1

y_i−β₀−

p

X

j=1

x_ijβ_j

!2

+λ

p

X

j=1

β_j²



 (2.19)

The term λPp

j=1β_j² is called the L2 penalty.

(30)

Elastic net

The elastic net is one attempt to combine the L1 and L2 penalties of Lasso and ridge regression. If some variables are correlated, more variables than in Lasso will tend to be included in the final suggested fit. Still, like Lasso, it reduces the number of variables in the model[17].

(Naïve) elastic net is defined by:

βˆ =min_β

N

X

i=1

y_i−β₀−

p

X

j=1

x_ijβ_j

!2

, s.t. (1−α)

p

X

j=1

|β_j|+α

p

X

j=1

β_j² ≤t, (2.20) where α is a constant defined by the restrictions laid upon the penalties defined by α = λ₂/(λ₁ +λ₂). λ₁ is the Lasso penalty parameter and λ₂ is the ridge penalty parameter. Leaving one of them zero leads to the other regularisation method:

• λ₂ = 0 ⇒α= 0⇒ Lasso

• λ₁ = 0 ⇒α= 1⇒ Ridge regression.

Comparison to (2.3) shows the similarity to Lasso and ridge regression.

The naïve elastic net performs poorly when it differs greatly from either Lasso or ridge regression. To eliminate this disadvantage, a suggestion is to scale βˆ by a factor[17].

2.2.7 Cross-validation

When fitting a model on a training set, we wish to minimise the difference between the prediction and the actual observation, i.e. y_i−yˆ_i. One way to do this would be to minimise the mean squared error (MSE):

1 n

n

X

i=1

(yi−yˆi)² = 1 n

n

X

i=1

e²_i, ei =yi−yˆi

In a situation wherep >> n, however, the MSE would be zero for several different values of β.

Such a procedure would be less suitable for any other data. Adding a penalty, such as Lasso (see Section 2.2.2), will restrict the model. However, we can just choose λ= 0 in (2.4), and we are left with exactly the same problem as before.

Even if we did not have a possibly overfitted model, training the model on a specific data set may make it a poor model for a general case. Having a separate test set would be beneficial to test the model’s general usefulness. That way, we

(31)

could set the model parameter λ to a specific positive value, leading to f(λ,x_i), try it out on a test set by estimating the MSE

M SE = 1 n

n

X

j=1

(f(λ,x_j)−Y_j)², (2.21) where (x_j, Y_j) are the observation pairs in the test set. Set λ to a new value, make a new model, run this in the test set, find the MSE. Iterate this until the λ that gives the minimum of MSE has been identified.

The aim of this is not to use the same data that we trained the model on to test it. Unfortunately, the general solution is not yet obvious. The test set may not be representative for a general case, in fact usually they are not. And just optimising the model on the one data set, would make it suitable for the test set, but not generally. The dream scenario would be to have infinitely many different data sets to test the model on and find the one that fits most. With that information, the predicted squared error (PSE) could be calculated. PSE is the error of the model tested on future observations. Then we would have a model that performs well on any data set.

Having infinitely many data sets is impossible. Instead, we can use the one data set available to both train and test the model in several iterations, so called cross-validation (CV).

The CV process

An illustration of the CV flow can be found in Figure 2.2. The general idea of CV is to separate m observations from the data set, make a model based on the remaining n−m and test it on the m observations outside the data set. The left-out group of size m is called the test set, while the remaining group of size n−mis called thetraining set. Repeat this such that all observations are used once as the test set.

Saym= 1, which means there areK =ngroups of size 1. This is calledleave- one-out cross-validation (LOOCV) . (Later, K-fold CV will be introduced.) Say we have a penalised model that depends on a parameterλ:

f(λ,x_i) = β(λ)^Tx_i, (2.22) which is similar to the regression model presented in Section 2.2.1. CV can be used to find theλ that, inserted inf, minimises the CV error,

CV(λ;X,y) = 1 n

n

X

i=1

(y_i−f⁽⁻ⁱ⁾(λ,x_i))². (2.23) f⁽⁻ⁱ⁾ is the model trained on xj for j = 1, . . . , n except for i. Fix λ, leave out observation i, train a model on the remaining data, and test on i. Do this for

(32)

Figure 2.2: Flow of CV. First, the data set is split into groups, then a λ is chosen. One group is marked as the test set, the other groups as the training set.

A model based on the training set is used to predict the values in the test set and an error is recorded. Do this until all groups have been used as the test group and find the average error for this λ. Fix a new λ and repeat. Finally, choose the λ that gives the smallest error.

(33)

every i = 1, . . . , n and sum up the error term and find the average error. Try a new value for λ and repeat.

By calculating CV for several values of λ, we can choose the λ that gives the lowest value of CV, and find the model f(λ⁽⁻ⁱ⁾, X) based on data from all observations.

An important property of CV(λ) is that

E(CV(λ))≈P SE(λ), (2.24)

where PSE is the difference between the model and the unobservable real model.

In words, the expected shape of the curve of CV(λ) is approximately equal to the PSE, see Figure 2.3. CV(λ)is easy to calculate, even though P SE(λ)is not.

By performing CV, we find a good estimate of how the predicted error behaves.

Figure 2.3: Example of how CV and PSE may behave, according to (2.24). The expected shape of CV is approximately the shape of PSE.

K-fold cross-validation

Choosing K = n, LOOCV, can be time consuming for large n and if there are many λ_k to optimise λ. Instead, it is common to choose e.g. K = 10, such that for every iteration, we removem ≈n/K individuals instead of just one. OnlyK iterations are needed to calculateCV(λ;X,y), instead of n. This will reduce the run time, and lower the variance of the prediction errors. LOOCV will result in higher variance because the fits are trained on n nearly equal data sets[11]. For time measurements, see Section 3.1, especially Figure 3.1.

If the sub data set used as test data is very small, the risk of having a dispro- portional number of censored events present is greater, i.e. a test data set with nearly none events is a nonrepresentable test set. How the censorings affect the

(34)

model depends on what kind of CV is performed, see the list of alternatives in Section 2.2.8. This can cause the wide range of suggested signatures in K-fold CV where K is large.

2.2.8 Cross-validation and the Cox model

So far, CV has been introduced in combination with the linear regression model.

Now, we will look at Cox regression. Recall (2.15), the Cox partial log likelihood is defined by

l(β) = X

i:Ci=1

(β^Tx_i−log X

j:Yj≥Y_i

θ_j) = X

i:Ci=1

ϕ_i(β), (2.25)

where θj = exp(β^Txj). With Lasso penalty, it is l_pen(β, λ) = X

i:Ci=1

ϕ_i(β) +λ

p

X

j=1

|β_j| (2.26)

For more information, see Section 2.2.2. The regression model f is now replaced with the partial likelihood l, with penalised likelihood l_pen.

The purpose of using CV to find the best fitting Cox model, is to find the optimal model parameter λ, as for linear regression. Usually, as in the case of (2.22), the likelihood contribution of the i’th individual is independent of the other individuals (when i6=j). So, when cross-validating, it is easy to make the expressionf⁽⁻ⁱ⁾where observationiis excluded, see (2.23). But, because part two of ϕ_i(β), i.e. logP

j:Yj≥Yiθ_j = logP

j:Yj≥Yiexp(β^Tx_j), depends on information about other observations than itself (those in the still-at-risk group), we cannot use CV directly as described earlier where each observation is removed.

Three variations of CV for the Cox model are described below. Every variation maintains a quality of general CV, at the cost of others. Common for all CV variations are that the λ that maximises the cross-validated likelihood (CVL) is the λ that gives the best model.

1. Kuk[19] presented a type which is based on changing the status of one observation i from non-censored to censored, which is done for each non- censored observation. Changing the event indicator is the corresponding action to removing observation i from the set as in ordinary CV. This means that individual i affects the risk of all individuals still at risk until the time of event i(which is no longer considered an event).

CVL= X

i:Ci=1

ϕ_i

β(λ)ˆ ⁽ⁱ⁾

, (2.27)

where β(λ)ˆ ⁽ⁱ⁾ is estimated by maximising the penalised partial likelihood (2.26) for a given parameterλbased on the data set where censor indicatori

(35)

has been changed from non-censored to censored[11]. This method is similar to ordinary CV for a general likelihood model. The fact that individual i was alive up until t_i is used, but not the fact that he/she dies.

2. Another option was presented by Verweij and Van Houwelingen[20]. This takes into account that the components of the partial likelihoods affect each other. The effect of observation i is defined as l_i(β) =l(β)−l−i(β)where l−i(β) is the log-likelihood when observation i is left out. Linked to this, the βˆ that maximises l_−i(β) is denoted βˆ⁽⁻ⁱ⁾. The CVL is defined by Pn

i=1l_i(βˆ⁽⁻ⁱ⁾), and considering the penalised likelihood:

CVL=

n

X

i=1

l_i

β(λ)ˆ ⁽⁻ⁱ⁾

, (2.28)

where β(λ)ˆ ⁽ⁱ⁾ is estimated by maximising the penalised partial likelihood (2.26) for a given parameter λ.

3. One version was presented in Hege L. Størvold’s master thesis[11]. This is based on the idea of general CV. We ignore that the contribution of one observation may affect the likelihood of another, so for every calculation of the likelihood in the loops of the CV, observation i is removed entirely, as if it did not exist at all.

CVL= X

i:Ci=1

ϕ_i

β(λ)ˆ ⁽ⁱ⁾

(2.29) This is a hybrid of the previous CV variations; the formula is similar to Kuk’s proposal[19], but the β(λ)ˆ ⁽ⁱ⁾ is calculated as in Verweij and Van Houwelingen’s proposal[20] where β(λ)ˆ ⁽ⁱ⁾ is estimated by maximising the penalised partial likelihood (2.26) for a given parameter λ based on the data set where observation i is left out. Maximisation of the penalised log-likelihood is a common way to find the best λ[21].

In summary, CV for the Cox model may ignore that the observations may affect the likelihood of each other (3), a censored observation may be treated as non-censored up until event (1) or we can consider that the observations may affect the likelihood of each other (2).

2.3 Computations in R

There are several R packages that perform the statistics described in Section 2.2. The package penalized[22] is used in the CRC paper[1] and is the package mainly used in this study. In addition, the package glmnet has been tested to some degree.

(36)

2.3.1 penalized

The function optL1 from penalized is the one used by Sveen et al.[1]. It takes input including a Survival object (time and event) and covariates (X matrix, gene expression levels), and optionally an interval for the value ofλ in (2.4) and K for K-fold CV. A typical call looks like this:

opt <- o p t L 1 ( Surv ( time , e v e n t ) , p e n a l i z e d = genes , fold

=10)

The genes matrix consists of elements where each row represents an observation and each column represents a gene, as in (2.1). The fold parameter, i.e. K, decides how many groups the data set should be split into in the CV process.

Recall the step by step algorithm from Section 1.1. Iffoldis set ton, the number of observations, it will result in LOOCV. In the CV the data set is split into n groups of size 1.

Output

optL1 suggests a regression model and the coefficients for the covariates can be found by:

c o e f f s <- c o e f f i c i e n t s ( opt $ f u l l f i t )

These coefficients are the β_is from Sections 2.2.2 and 2.2.5 and describes the weights of the genes. Only the non-zero coeffcients are included in the final model, and obviously then length(coeffs) will give the number of genes supposedly relevant to the response.

Challenges with optL1

As mentioned in the documentation[22], it is not certain thatoptL1always man- ages to pick the λ that gives global minimum error:

The optL1 [...] functions use Brent’s algorithm for minimization without derivatives [...]. There is a risk that these functions converge to a local instead of a global optimum. This is especially the case for optL1, as the cross-validated likelihood as a function of lambda1 quite often has local optima. It is recommended to use optL1 in combination with profL1 to check whether optL1 has converged to the right optimum.

We can take a look at the CVL path by applying the functionprofL1:

prof <- p r o f L 1 ( Surv ( time , e v e n t ) , p e n a l i z e d = genes , step

=20 , fold = opt $ fold )

(37)

Figure 2.4: CVL plot from profL1 function. The plot shows the CVL value for increasing values of λ. This should be as close to zero as possible, i.e. the maximum which can be found atλ≈15. The red line marks optL1’s choice of λ.

The function takes nearly the same parameters as optL1, but to be able to see the previous objectopt’s path, the exact same folding must be used. Even if we call the function with the same number of folds, the grouping varies from time to time, so to avoid this we need to specify fold=opt$fold.

profL1shows the steps of the optimisation. A greater value of the parameter stepwill show more detail. It is interesting to look at the plots of profL1called by:

plot ( prof $ lambda , prof $ cvl ) p l o t p a t h ( prof $ f u l l f i t )

These calls give plots shown in Figures 2.4 and 2.5, respectively.

The maximum of the CVL plot in Figure 2.4 gives the optimal λ, which is about 15 in this particular case. By tracing the λ over to the plot to the right, taking notice of how many lines present at lambda1=15, we can see that many, more than ten, coefficients are non-zero.

The plot in Figure 2.5 is a nice way to illustrate the effect of the model parameter λ. As λ decreases the number of non-zero coefficients increases. In the extreme case of λ = 0, there would be no constrain on the penalised ex-

(38)

Figure 2.5: Path plot from profL1 function. Theλ that gave the maximum value from the plot in Figure 2.4, about λ= 15, represents a large gene signature.

The red line marks optL1’s choice of λ which results in a gene signature of size 9.

(39)

pression βˆ = argmin_β¹₂ PN i=1

y_i−β₀−Pp

j=1x_ijβ_j2

+λPp

j=1|β_j| (see Section 2.2.2) making it unpenalised. By makingλgreater, it affects the model more and more and eventually the model would have no non-zero coefficients.

Investigation of the optimisation path can detect errors where a local minimum error (the value marked with the red line in Figure 2.4) has given the suggestedλ instead of theλwith the global minimum error. We may correct the call tooptL1 by

# Find the l a m b d a that g i v e s g l o b a l m i n i m u m e r r o r

p r l a m b d a <- prof $ l a m b d a [ m a t c h ( max ( prof $ cvl ) , prof $ cvl ) ] opt <- o p t L 1 ( Surv ( time , e v e n t ) , p e n a l i z e d = genes , fold

=10 , m i n l a m b d a 1 =( prlambda -1) , m a x l a m b d a 1 =( p r l a m b d a +1) )

This will force the λ to stay in a close interval around the optimal λ found by profL1.

2.3.2 glmnet

glmnet uses an algorithm called coordinate descent to find theλ that minimises the error. This algorithm is faster than penalized, which uses the full gradient method in combination with Newton Raphson.

The application of glmnet is similar to penalized[23].

cv <- cv . g l m n e t ( genes , Surv ( time , e v e n t ) , n f o l d s =10 , a l p h a =1 , f a m i l y = " cox " ) # Cross - v a l i d a t i o n

fit <- g l m n e t ( genes , Surv ( time , e v e n t ) , a l p h a =1 , f a m i l y

= " cox " ) # M o d e l The CV finds an optimalλ:

> p r i n t ( cv $ l a m b d a . min ) [1] 0 . 2 2 2 1 7 1 7

This value of λ can be traced to the corresponding value on the x-axis in Figure 2.6, marked by the vertical left line. As can be seen in the plot (at the top), the number of non-zero coefficients in the model is two. This is confirmed by:

> c o e f f s _ vec <- as . m a t r i x ( coef ( fit , s = cv $ l a m b d a . min ) )

> i n d e x <- w h i c h ( c o e f f s _ vec ! = 0)

> c o e f f s <- c o e f f s _ vec [ i n d e x ]

> p r i n t ( i n d e x ) [1] 112 199

> p r i n t ( c o e f f s )

[1] - 0 . 1 1 1 4 3 8 2 - 0 . 2 3 9 1 6 6 5

The genes are represented in the 112th and 199th columns in the gene expression matrix genes and the coefficient estimates areβ₁₁₂ ≈ −0.11and β₁₉₉ ≈ −0.24.

(40)

Figure 2.6: CV plot of the glmnet model. The left vertical line marks the minimum point with x-value corresponding to the optimal λ. The right line marks the model which is one standard deviation away from model that gives the minimum error. The number of non-zero coefficients in the model is shown at the top of the plot.

(41)

2.3.3 Differences

Bothglmnetand penalizedbase their calculations on generalised linear models (GLM) and they both maximise the penalised log-likelihood described in Section 2.2.8 whereλ is the penalty parameter corresponding to lambda in R.

The main difference described in the documentation[22][24] between the two methods is how they estimate the fits, i.e. β, givenˆ λ. penalizedapplies the full gradient algorithm and Newton Raphson when the optimal value is approached.

glmnetapplies coordinate descent which is much faster. Also, the CV is different, but it was difficult to find documentation of this without thoroughly reading the source code.

The computations in this study show that the two R packages do not always give the same results, see Section 4.5.3, but it is difficult to understand why from the documentation. Even though λ is defined as the penalty parameter in the penalised log-likelihood in both packages, the values are very different from each other as the examples above have shown. This is also discussed in Section 4.5.3.

2.4 Recent work

Xu et al.[25] write that

A sparse algorithm cannot be stable and vice versa. [. . . ] In particular, our general result implies that L1-regularised regression (Lasso) cannot be stable, while L2-regularised regression [Ridge] is known to have strong stability properties and is therefore not sparse.[25]

Here, stability describes how the result of an algorithm varies from one data set to another data set which is nearly identical to the first. The instability discussed in this thesis instead describes how the fold in CV affects the result of Lasso, so called model-selection variability. This topic has been discussed[26][27].

Roberts[26] proposes to apply a method called percentile-Lasso to overcome this instability, which he reasons with an illustration similar to Figure 4.17, which is more thouroughly discussed in Section 4.5.1.

Percentile-Lasso

This section is based entirely on [26]. CV’s choice ofλseems to be somewhat small compared to the number of real relevant covariates included in the prediction model. A larger λ would give fewer non-zero coefficients, and mostly it would be the coefficients of the irrelevant genes that would be reduced to zero first.

Percentile-Lasso can be a useful guide in the choice of λ. The algorithm is as follows:

1. Perform K-fold CV and Lasso. Take note of the choice of λ, denoted λˆ₁.

(42)

2. Repeat step 1 N times s.t. a vector Λ(N) = (ˆλ₁,λˆ₂, . . . ,ˆλ_N) is obtained.

3. Find theθth largest value ofΛ(N), denoted λ(θ).ˆ 4. Setλ = ˆλ(θ).

An algorithm of how to chooseθ is discussed in Roberts’ paper, but a typical choice would be θ ≥0.75·N, i.e. the 75-percentile or greater, reflecting how the normal Lasso tends to choose too small λs. A value as great as θ = 0.95·N is proposed.

Roberts discusses how the percentile-Lasso improves the normal Lasso by how it decreases the model-selection error and the variance of the size of the predictive models. It is also easy to apply to normal Lasso. A downside is that the percentile-Lasso is more time-consuming than the normal Lasso in combination with CV. However, that should not be an issue as modern computers have high speed and compared to the importance of a good prediction result, time is nearly irrelevant.

(43)

Chapter 3 Methods

There may be several reasons for the instability of the results from Lasso in the paper about CRC[1], mentioned in the introduction. Some initial thoughts were discussed in Section 1.1. In this chapter there is first a description of some challenges when using K-fold CV and R, then follows how simulation have been used to study properties of Lasso and K-fold CV, how properties of the data set affects the prediction result, and finally some graphical displays used later are presented.

3.1 Causes

3.1.1 K-fold cross-validation

The most worrying property of the results in [1], is that they are not unique. The same data set is sent into the R functionoptL1every time, but the coefficients in the result vary. Some variation is expected because the grouping in the CV process is random. However, the level of variation seen in [1] is worrying, motivating a closer look at the CV process, including simulations.

In [1], K-fold CV is repeated, with K = 10, 1000 times. A boundary of fifty is chosen in Sveen’s paper[1] and the eight signatures that appear more times than that are kept. Because there are validation series available, they are able to test the signatures on these. First, one of the validation series is used to pick out the signature that gives the smallest error in that data set. Then, the other validation series is used to verify that this signature is sensible.

Now, such validation series may be absent and we would like to draw a conclusion based on our original data set. The purpose of the validation series would then be, exactly that, to validate that the signatures behave satisfactory on independent data.

Therefore, it is interesting to investigate how simulated data sets with prede- cided relevant covariates behave when run throughoptL1 with K-fold CV. Will

29

(44)

LOO 10-fold 4-fold

n = 100 38.7 sec 4.8 sec 2.4 sec

n = 300 4.0 min 8.8 sec 3.9 sec

n = 1000 3 hrs 23 mins 29.2 secs 2 min 25.2 sec 60.5 sec

Figure 3.1: Average time spent by running LOOCV, 10-fold CV and 4-fold CV once on a simulated data set with information about 1000 genes.

there be instability in these cases as well? Should the unstable behaviour be expected? These questions are answered in Section 4.1 where knowledge of especially K-fold CV, as described in Section 2.2.7, and simulations as described in this chapter are applied.

One of the reasons for choosing something other than LOOCV is to save time. But how much time can actually be saved? Is the advantage of choosing K ≤n/2 so great that it is worth downgrading predictive accuracy? Running a test in R, using optL1, with the same input every time (simulated survival data with 1000 genes from n individuals and their survival times), the elapsed time was measured, see Figure 3.1.¹

It is not until there are more than 300 observations that LOOCV takes so much time that it may be sensible to choose K-fold CV (K ≤ n/2) and run it several times instead (to see which result singles out). The time difference between 10- fold and 4-fold CV is considerable however, so as long as the properties of 4-fold CV are as good as those of 10-fold CV, 4-fold CV is preferable when considering time. However, the predictive accuracy is more important than the time gain of choosing K-fold CV in favour of LOOCV.

3.1.2 Programming parameters

As discussed in Section 2.3, it is possible to putλto be in a specific interval when callingoptL1, forcing it to stay in it. This may affect the prediction result. Also, does optL1 actually pick out the correct genes?

The resulting models from the optL1 function should preferentially not vary as much as they do. Using a different R package may be the solution. One of the alternatives is glmnet. Still, the CV process comes first and then regularisation, so the framework is the same. How does glmnet behave compared to optL1?

Does one function predict better than the other?

These questions are answered in Section 4.5. The computations are based on simulations described in this chapter.

1The test was run on a computer with an Intel(R) Core(TM) i7 CPU (870@2.93GHz, 3066 Mhz, 4 Core(s), 8 Logical Processor(s)) processor and 8 GB RAM.

The Instability of Cross-Validated Lasso

by

Kine Veronica Lund

THESIS for the degree of Master of Science

(Master i Modellering og dataanalyse)

Faculty of Mathematics and Natural Sciences University of Oslo

December 2013

Abstract

Acknowledgments

Contents

List of Figures

List of Abbreviations

Chapter 1 Introduction

1.1 Motivating example

1.1.1 Rough summary of the method used in Sveen’s paper

1.2 Chapter overview

Chapter 2 Background

2.1 Biological background

2.1.1 Data structure

2.2 Statistical background

2.2.1 Introduction

2.2.2 The Lasso method

2.2.3 Survival analysis

2.2.4 The Cox model

2.2.5 Lasso and the Cox model

2.2.6 Other regularisation methods

2.2.7 Cross-validation

2.2.8 Cross-validation and the Cox model

2.3 Computations in R

2.3.1 penalized

2.3.2 glmnet

2.3.3 Differences

2.4 Recent work

Chapter 3 Methods

3.1 Causes

3.1.1 K-fold cross-validation

3.1.2 Programming parameters