Computational Statistics and Data Analysis: Robustness of the Linear Mixed Effects Model to Non-normality of Longitudinal Data.

(1)

Faculty of Science and Technology

MASTER’S THESIS

Study program/Specialization:

Mathematics and Physics/Mathematics

Spring semester, 2016

Open / Restricted access Writer: Eze Sunday

………

(Writer’s signature) Faculty supervisor: Assoc. prof. Bjorn Henrik Auestad

External supervisor(s): Not applicable

Thesis title: Computational Statistics and Data Analysis: Robustness of the Linear Mixed Effects Model to Non-normality of Longitudinal Data.

Credits (ECTS): 60 Key words:

Fixed effects Random effects Covariates

Response variable Growth curve model Covariance structure Random intercept

Random intercept and random slope Robustness

Pages: 82

+ enclosure: 1 CD

Stavanger, 14/06/2016 Date/year

(2)

Master Thesis

Computational Statistics and Data Analysis: Robustness of the

Linear Mixed Effects Model to Non-normality of Longitudinal

Data

Author:

Eze Sunday

Supervisor:

Assoc. prof. B. Auestad

A thesis research work submitted for completion of Master of Science

in Mathematics and Physics

June 2016

(3)

I, Eze Sunday, declare that this thesis titled, ’Computational Statistics and Data Analysis: Robustness of the Linear Mixed Effects Model to Non-normality of Longitudinal Data’ and the work presented in it are mine. I attest that:

This work was done wholly or mainly while in candidature for a research degree at the University of Stavanger.

I clearly indicated where I have consulted the published work of other au- thors.

Where I have quoted from others, the source is always provided. With the exception of such quotations if any, this thesis is entirely my own work.

Signed:

Date:

i

(4)

L. Ron Hubbard

(5)

The Linear mixed effects model is based on one of the assumptions, which is that data are normally distributed as the errors and random effects in the model are assumed to have normal distributions; but in practice, longitudinal data are often either slightly or very skewed. Hence, they have non-normal distributions. This thesis is focused on studying the robustness of the linear mixed effects model to non-normality of error distribution.

The thesis is structured in this way: in thechapter 1, the literature review of the linear mixed effects model is presented. Chapter 2 contains explanations of the statistical tools used in checking the goodness of fit for a fitted model. Chapter 3 contains description of the research procedures employed in the implementa- tions. Chapter 4contains the simulations and results. All in all, conclusions are presented in the chapter 5.

(6)

I thank Assoc. prof. Bjorn Auestad for his guidance and support during this thesis research. He has been motivating and encouraging to ensure that its purpose is met.

I thank all the professors who had taught me throughout my duration of study in the Department of Mathematics and Natural Science, especially Prof. Jan Terje Kvaloy. Their dedications are the seeds of this fruition.

Also, my family members contributed immensely to my success. Many thanks go to my mother for her incomparable love and care.

iv

(7)

Declaration of Authorship i

Abstract iii

Acknowledgements iv

Contents v

List of Figures vii

1 Introduction 1

1.1 Longitudinal Data Analysis . . . 1

1.2 The Evolution of Longitudinal Data Analysis. . . 3

1.3 The Linear Mixed Effects Model . . . 6

1.4 Generality of the Linear Mixed Effects Model . . . 7

1.5 Covariance Structure . . . 11

1.6 The Distribution of Response Variable,Yi . . . 16

1.7 The Maximum Likelihood Estimation . . . 17

1.8 Restricted maximum likelihood estimation . . . 19

1.9 Robust Inference . . . 20

1.10 The Likelihood Ratio Test . . . 21

1.11 The t-test . . . 22

1.11.1 The Confidence Interval . . . 22

1.12 The Information Criteria . . . 22

1.13 The assumptions of the LME Model . . . 23

2 Checking an LME Fitted model 25 2.1 The Goodness of Fit of a Model . . . 25

2.2 Cholesky Residuals . . . 25

2.3 χ² Distribution for a Model Check. . . 27

2.4 Shapiro-Wilk Test. . . 28

3 Research Procedure 29

v

(8)

3.1 The Simulation Setup. . . 29

3.2 Sampling Process . . . 31

3.2.1 Balanced Data . . . 31

3.2.2 Unbalanced Data . . . 32

3.3 Models for Data Generation . . . 33

3.4 RIRS Models . . . 34

3.5 RI Models . . . 38

4 Simulations and Results 39 4.1 Case 1: Model Validation. . . 39

4.1.1 Simulation Settings for Case 1 . . . 40

4.2 Case 2: The Corrupted Distribution of Error . . . 44

4.3 Case 3: Asymmetric Distribution of Error . . . 48

4.4 Case 4: Gamma Distribution of Error . . . 52

4.5 Case 5: Exponential Distribution of Error . . . 56

5 Conclusions 60

A Appendix 63

B Appendix 78

C Appendix 80

Bibliography 82

(9)

1.1 Example of the Graphical Representation of Logitudinal Data . . . 3

2.1 A QQ-Plot for Checking Departure from a Normal Distribution . . 26

2.2 A QQ-Plot for Checking Departure from a Normal Distribution . . 27

3.1 The Symmetric Corrupted Normal Distribution of Error . . . 35

3.2 The Asymmetric Corrupted Normal Distribution of Error . . . 36

3.3 The Skewed Distribution:Gamma . . . 37

3.4 The Very Skewed Distribution: Exponential . . . 38

4.1 Validation Model: qqnorm Plot for Residues . . . 40

4.2 Validation Model: qqnorm Plot for Random Effects . . . 41

4.3 Validation Model: Responses against Fitted Values . . . 41

4.4 Corrupted Distribution: qqnorm Plot for Residues . . . 45

4.5 Corrupted Distribution: qqnorm Plot for Random Effects . . . 45

4.6 Corrupted Distribution: Responses against Fitted Values . . . 46

4.7 Asymmetric Distribution: qqnorm Plot for Residues . . . 49

4.8 Asymmetric Distribution: qqnorm Plot for Random Effects . . . 49

4.9 Asymmetric Distribution: Responses against Fitted Values . . . 50

4.10 Gamma Distribution: qqnorm Plot for Residues . . . 53

4.11 Gamma Distribution: qqnorm Plot for Random Effects . . . 53

4.12 Gamma Distribution: Responses against Fitted Values . . . 54

4.13 Exponential Distribution: qqnorm Plot for Residues . . . 57

4.14 Exponential Distribution: qqnorm Plot for Random Effects . . . 57

4.15 Exponential Distribution: Responses against Fitted Values . . . 58

vii

(10)

viii

(11)

Introduction

1.1 Longitudinal Data Analysis

The word longitudinal is often used to describe outcomes of repeated measures that are carried on objects or persons, who are under a treatment, over time to study their responses to the treatment. Hence, it is more applied to biotechnology and medicine than to engineering.

Longitudinal data analysis plays an important role in the medical field because it is mainly used to study a degree to which a response variable responds to its covariates. The analysis involves follow-up and measurements of the responses of the variable (a unit) to the treatment over time.

Ideally, the measurements of the responses are meant to be repetitively taken in such a way that each unit, under the same treatment, has the same number of measurements taken at the same time points over time; but in practice, it is not always feasible to realize such balanced measurements, so they (repeated measures of data) do not have to assume a balanced form, meaning it is allowed that some units do not have the same number of measurements as others in a sample.

Definition: Longitudinal data are the repeated measures of changes in responses to a treatment carried on N units over time, where N > 1 is the total number of units. For example, a research to be conducted to find out the effects of a new treatment on a viral disease such as Zika virus can be carried out through administering a medicine on ten patients and taking the measures of counts of

1

(12)

viral loads over time on each patient. At each individual level, such repeated measures, taken on each individual, yield correlated data that account for each patient’s unique responses to the medicine.

The measurements, taken on each patient over time, may be or are typically statistically correlated, but measurements obtained from different patients are independent. A variation of the measurements taken on each patient shows a within-individual variation and a variation of the measurements taken on different individuals (units) shows a between-individual variation [1]. The difference between the two variations is called a random effect, which represents the pecu- liarity of each individual in a population from which a sample is drawn.

αi =βi−β (1.1)

E[α_i] = E[β_i]- β = 0 and Var[α_i] = Var[β_i] = σ². α_i ∼N(0, σ²)

For example, let i be the index for the units in a sample, β be the between- individual variation, β_i be the within-individual variation, and α_i be the random effect for unit i, where i≥1. The random effect is defined in the equation1.1.

It should be noted that α_i ∼ N(0, σ²) is based on a linear-mixed-effects assumption.

In addition, the measurements taken in a closer interval of time are more correlated than the measurements taken in a wider interval of time since the correlation is reduced as the time interval of measurements is widened. This feature of longitudinal data is due to the fact that a characteristic of response, on which the measures are based, randomly changes in time. In other words, the responses of the units are random as the units are random variables.

As shown in the figure 1.1, the difference between each measured response value and its corresponding value on a linear model is the error of the measurement, e_ij; where Y_ij is the response variable, µ_ij is its corresponding mean value on the regression line,iis the index for the number of patients (units) in the study and j is the index for the number of the response data obtained from each patient over time, soµ_ij is the expected value for unit i at timej.

e_ij =Y_ij −µ_ij (1.2)

(13)

E[e_ij] = E[Y_ij]- µ_ij = 0 and Var[e_ij] = σ²_e, so e_ij ∼N(0, σ_e²) (assumption).

It should be noted that e_ij ∼N(0, σ²_e) is based on the linear-mixed-effects, LME, assumption.

Figure 1.1: Example of the Graphical Representation of Logitudinal Data

1.2 The Evolution of Longitudinal Data Analysis

The concept of longitudinal data analysis is derived from the traditional regression analysis where the main goal is to model the relationship between a dependent variable and independent variables. The independent variables are often called the predictors because they determine a responsiveness of the dependent variable while the dependent variable itself is called the response variable.

Y_ij =β₀+β₁X_1ij +β₂X_2ij +β₃X_3ij +...+β_pX_pij +e_ij (1.3)

(14)

The relationship between a dependent variable and independent variables is as shown in the equation1.3. WhereY_ij—the response of unitiatj time;X_1ij, X_2ij, X3ij. . . Xpij are the predictor variables, β0, β1, β2, β3, . . . , βp are the fixed effects, where pis the total number of the fixed effects in a model; and e_ij are the errors.

The essence of the fixed effects in the model is to quantify the impact of each predictor variable on a response variable.

As presented in the definition of longitudinal data analysis, let’s suppose N individuals’ counts of viral loads were measured at three different time points over time; hence, Y_i = [Y_i1, Y_i2, Y_i3]^T, where i = 1. . . N and j = 1. . . n_i. The three repeated measures on each unit of N units over time is written as:

Y_i1 = [1, X_1i1, X_2i1, X_3i1, . . . , X_pi1]×





 β₀ β₁ β₂ β3

... β_p







+e_i1 (1.4)

Y_i2 = [1, X_1i2, X_2i2, X_3i2, . . . , X_pi2]×





 β₀ β₁ β2

β₃ ... β_p







+e_i2 (1.5)

Y_i3 = [1, X_1i3, X_2i3, X_3i3, . . . , X_pi3]×





 β₀ β₁ β₂ β₃ ... β_p







+e_i3 (1.6)

Where Y_i1 is the response measurement of unit i at the time point, j = 1; Y_i2 is the response measurement on the same unit at a second successive time point, j = 2; Y_i3 is the response measurement on the unit at a third successive time

(15)

point, j = 3. In this case, the total measures for a unit i, n_i, is three (3); that is n_i = 3.

It should be noted that n_i, the number of measurements taken on the unit i, is dependent on the unit i. That is n_i may be different for different units.

Y_i =







1 X_1i1 X_2i1 X_3i1 . . . X_pi1 1 X_1i2 X_2i2 X_3i2 . . . X_pi2 1 X_1i3 X_2i3 X_3i3 . . . X_pi3











 β₀ β₁ β2

β₃ ... β_p





 +





 e_i1 e_i2 e_i3







(1.7)

Each individual responses to a treatment is written in the matrix form as shown in the equation1.7.

Y_i =X_iβ+e_i (1.8)

Where

Y_i =





 Y_i1 Y_i2 Y_i3







,X_i =











 ,β =





 β₀ β₁ β2

β₃ ... β_p







ande_i =





 e_i1 e_i2 e_i3





 .

Then the equation 1.7can be summarized as equation1.8. That means that there may be different design matrix, X_i, and vector of errors, e_i, for different units, i, while the vector of fixed effects,β, remains the same for each unit of a sample in a longitudinal data analysis.

Statistically, the measurements that are obtained from each unit are clustered since they are generally correlated. Hence, there are N clusters of data [2]. The covariance matrix for Y_i is as shown in the equation 1.9.

(16)

V ar(Y_i) =







V ar(Y_i1) Cov(Y_i1, Y_i2) Cov(Y_i1, Y_i3) Cov(Y_i2, Y_i1) V ar(Y_i2) Cov(Y_i2, Y_i3) Cov(Yi3, Yi1) Cov(Yi3, Yi2) V ar(Yi3)







(1.9)

1.3 The Linear Mixed Effects Model

Y_ij =β₀ +α_0i+β₁t_ij +α_i1t_ij +β₂X_i+e_ij (1.10) α_0i ∼N(0, σ₀²),α_1i ∼N(0, σ₁²),e_ij ∼N(0, σ_e²);i= 1,2, . . . , N and j = 1,2, . . . , n_i. In the linear mixed effects model, the fixed effects and random effects are combined in a model to capture variabilities around the mean of responses of the units of a sample, drawn from a population. The regression model that is considered so far is a typical example of a fixed-effects model because the variability of each unit is not considered. A typical example of LME model is the simple growth curve model, which is shown in the equation 1.10.

The growth curve model is an example of a random intercept and random slope, RIRS, model because it comprises the two random effects: random intercept (α_0i) and random slope (α_1i). When random slope is omitted from the model, it is then reduced to a random intercept, RI, model.

The conceptual idea of the linear mixed effects model is to incorporate a variability of a units’ responses to a treatment in the model. As given in the previous example, let’s suppose that three measurements of count of viral load of Zika virus in a patient who is subjected to a follow-up are measured over time. The three measures, plotted against time, give the regression model for his responses to the treatment. The regression model obtained from the three measurements taken over time does not include the variability of unit’s responses to the administered medicine.

The use of linear mixed effects model becomes crucial when there is a need to model the variabilities of units’ responses to a treatment.

Y_i=X_iβ+Z_ib_i+e_i (1.11)

(17)

The model shown in the equation 1.11 is a typical linear-mixed-effects model. n_i is the number of measurements taken on a unit i. Where Y_i is n_i ×1 vector of responses, Xi is ni×p matrix of covariates, β is p×1 vector of fixed effects, Zi

is n_i×q matrix of covariates associated with random effects, b_i isq×1 vector of random effects, where q is the number of covariates associated with the random effects; e_i isn_i×1 vector of errors of n_i measurements.

Using the equation 1.11 and three (3) measures over time per unit, the vectors assume these forms:

X_i =







1 t_i1 X_i 1 t_i2 X_i 1 t_i3 X_i





 ,β =





 β₀ β₁ β₂





 , Z_i =





 1 t_i1 1 t_i2 1 t_i3





 , b_i =

"

α_0i α_1i

# e_i =





 e_i1 e_i2 e_i3





 .

1.4 Generality of the Linear Mixed Effects Model

The linear mixed effects model, shown in the equation 1.11, is a model for a two- level repeated measures in which the fixed effects and random effects are crossed as a level of the factor is measured accross multilevels of another factor. For instance, the responses that are measured accross the units, as given in the previous example with Zika virus, is the two-level repeated measures with crossed fixed effects and random effects. For generality, Y_i, the responses for uniti, can be extended to

Y_i =





 Yi1

Y_i2 Y_i3 ... Y_in_i







∼M V N(X_iβ,V_i),

As previously stated, the number of measures in a unit, n_i, depends on unit i.

Hence, the total number of observations, n = PN

i=1n_i. Where N is the total number of units.

X_i=







... ... ... . .. ...

1 X_1in_i X_2in_i X_3in_i . . . X_pin_i





 ,

(18)

The n_i × p matrix of covariates, X_i, contains p number of covariates that are uniquely associated with the p number of fixed effects in an LME model. The first column of the matrix is usually assigned a vector of ones to recover the fixed effect, β₀, in the model.

β =





 β0

β₁ β₂ β₃ ... β_p





 ,

The fixed-effects vector, β, contains p number of fixed effect associated with the covariates arraigned in the rows of the matrix of covariates, X_i.

Z_i=







1 X_1i1 X_2i1 . . . X_qi1 1 X_1i2 X_2i2 . . . X_qi2 1 X_1i3 X_2i3 . . . X_qi3 ... ... ... . .. ... 1 X_1in_i X_2in_i . . . X_qin_i





 ,

Then_i×q matrix, Z_i, contains covariates that are associated with random effects in the model. Hence q≤p in the LME models.

The q×1 column vector of random effects, b_i, contains random effects for some covariates in the design matrix Xi. The column vector of random effect is defined by

b_i =





 α_i1 αi2

α_i3 ... α_iq





 ,

Under the LME assumptions that the random effects are independent of each other, Var(b_i) = P

, reduces to a diagonal positive definite matrix and the b_i follows a multivariate normal distribution with mean vector of zeros and the variance, P

.

(19)

b_i ∼ MVN(0, P ).

The covariance of b_i is defined by

X=Cov(bi) =







V ar(α_i1) Cov(α_i1, α_i2) . . . Cov(α_i1, α_iq) Cov(α_i2, α_i1) V ar(α_i2) . . . Cov(α_i2, α_iq)

... ... . .. ...

Cov(α_iq, α_i1) Cov(α_iq, α_i2) . . . V ar(α_iq)







(1.12)

For example, the matrix, Z_i, reduces to a column vector of ones for the random intercept, RI, model. In the case of random intercept and random slope, RIRS, model with an intercept and a covariate respectively associated with a random effect for intercept and a random effect for the covariate, the Z_i matrix reduces to:

Z_i=







1 X_2i1 1 X_2i2 1 X_2i3

... ... 1 X_2in_i





 ,

The column matrix for the two random effects in the model reduces to:

b_i =

"

α_i1 αi2

# ,

The covariance matrix of random effects, b_i, reduces to:

P= Cov(b_i)=

"

V ar(αi1) Cov(αi1, αi2) Cov(α_i2, α_i1) V ar(α_i2)

#

In the case of the random effects being independent, the covariances become zeros and the matrix further reduces in number of parameter to:

P= Cov(b_i)=

"

V ar(α_i1) 0 0 V ar(α_i2)

#

(20)

The errors, e_i, which determine the growth curve path of a unit i around the mean, µ, is defined by

ei =





 e_i1 e_i2 e_i3 ... eini





 ,

The errors,e_i, are based on three LME assumptions: in one case, they are assumed to be uncorrelated; in the second case, they are assumed to be homoscedastic, meaning they have the same variance. In the third case, they are assumed to follow multivariate normal distribution with mean vector of zeros and the covariance matrix, H_e.

e_i ∼M V N(0, H_e), Generally,

He =V ar(ei) =







V ar(e_i1) Cov(e_i1, e_i2) . . . Cov(e_i1, e_in_i) Cov(e_i2, e_i1) V ar(e_i2) . . . Cov(e_i2, e_in_i)

... ... . .. ...

Cov(e_in, e_i1) Cov(e_in, e_i2) . . . V ar(e_in_i)







(1.13)

.

For the assumption of uncorrelated errors, the covariance matrix of errors, H_e, takes the form

He = Var(ei)=







V ar(e_i1) 0 . . . 0 0 V ar(e_i2) . . . 0 ... ... . .. ... 0 0 . . . V ar(e_in_i)







.

The vector of random effects, b_i, and the vector of errors, e_i, are assumed to be independent [3].

(21)

The vector of errors, e_i, is dependent on i. This means that the number of errors in the vector corresponds with the number of measures carried on the unit. Hence, the vector of errors for the three measurements, as given in the example, reduces to:

e_i =





 ei1

e_i2 e_i3





 .

1.5 Covariance Structure

A covariance structure refers to the structure that determines the number of covariance parameters that are in a model. The covariance parameters are stored in a vector called vector of covariance parameters, θ.

The two most often used covariance structures for a positive definite symmetric matrix of a vector of random effects, P

shown in equation 1.12, are the unstructured and structured covariance structure. Let’s suppose that two random effects are used in a model, where the vector of random effects is defined by

b_i =

"

α₁ α₂

#

The unstructured covariance structure for the two-random-effects model is defined by

X=

"

σ₁² σ₁₂ σ₂₁ σ₂²

#

(1.14)

The vector of covariance parameters, θ^P, for the covariance structure is defined by

θ^P =





 σ₁² σ₁₂

σ₂²







(1.15)

The vector of covariance parameters, shown in the equation 1.15, contains three elements because the covariances between the two random effects are the same, so the number of the elements contained in the vector is three.

(22)

A structured covariance structure of the two-random-effects model is defined by

X=

"

σ²₁ 0 0 σ₂²

#

(1.16)

The structure (structured covariance structure) is simpler because of independency between the two random effects. Such structure in which the covariance elements of a positive definite matrix are zeros except variance elements is called diagonal matrix. The variance elements of the matrix are not equal.

The corresponding vector of covariance parameters, θ^P, is defined by

θ^P =

"

σ₁² σ₂²

#

(1.17)

The vector contains only two elemets in this case.

If the variances are assumed equal, the diagonal matrix reduces to a multiple of an identity matrix, which is defined by

X=σ₁²

"

1 0 0 1

#

=σ₂²

"

1 0 0 1

#

(1.18)

The corresponding vector of covariance parameters, θ^P, is defined by

θ^P = h

σ₁² i

(1.19) Moreover, for a model in which there are nested random effects in a random effect, a compound-symmetry or diagonal-block structure may represent a true covariance structure. Let’s suppose that there are iunits in whichj subcomponents are nested, where j = 1,2,3; each unit has a random effects,α_i, and each subcompo- nent nested within the unit has the random effect, α_i,j, where α_i ∼ N(0, σ₁) and α_i,j ∼N(0, σ₂). The vector of the random effects for unit i could be defined by

b_i =







α_i+α_i,1 α_i+α_i,2 αi+αi,2







(1.20)

(23)

Under the assumption of independency among the random effects and the same variance, σ₁², for the random effects of unit and the same variance, σ₂², for the random effects of subcomponents, the covariance structure is defined by

X=







σ₁²+σ²₂ σ₁² σ₁² σ₁² σ₁²+σ²₂ σ₁² σ₁² σ₁² σ₁²+σ²₂







(1.21)

The covariance structure (1.21) is called compound-symmetry [4]. Its corresponding vector of covariance parameters, θ^P, is defined by

θ^P =

"

σ₁² σ₂²

#

(1.22) The vector of the random effects, b_i, could also be defined by

b_i =





 α_i α_i,1 α_i,2 α_i,2







(1.23)

The block-diagonal structure for the vector (1.23) is defined by

X=







σ₁² 0 0 0 0 σ²₂ 0 0 0 0 σ₂² 0 0 0 0 σ₂²







(1.24)

The covariance structure (1.24) is derived from these assumptions: the random effects are independent; the variances for the units are same,σ₁²; the variances for the subcomponents nested within the units are also the same, σ²₂. The covariance structure is represented as block diagonals such asσ₁²andσ₂²I. The identity matrix, I, is aj ×j matrix, where j = 3 in this case. Its vector of covariance parameters is the same with that of the compound-symmetry structure (1.22).

The block-diagonal covariance matrix may take other forms depending on a vector of random effects, b_i. If a vector of random effects of a model is defined by

(24)

b_i=





 α_i αi,j

α_i,k α_i,l







(1.25)

where j = 1,2,3,4, k = 1,2,3, and l = 1,2,3; α_i ∼ N(0, σ₁²), α_ij ∼ N(0, σ₂²), α_ik ∼ N(0, σ²₃), and α_il ∼ N(0, σ²₄). The block-diagonal covariance structure takes a form that is different from the first block-diagonal structure(1.24). The covariance structure for the vector of random effects, (1.25), becomes

X=







σ²₁ 0 0 0 0 σ₂²I 0 0 0 0 σ₃²I 0 0 0 0 σ²₄I







(1.26)

There are different orders of the identity matrixes in the covariance structure (1.26): σ₂²I has a j ×j order of identity matrix, σ²₃I has a k×k order of identity matrix, and σ₄²I has an l ×l order of identity matrix. Its vector of covariance parameter has four parameters as defined by

θ^P =





 σ₁² σ₂² σ₃² σ₄²







(1.27)

Furthermore, the two most used covariance structures for the positive definite symmetric matrix,e_i shown in the equation1.13, are diagonal symmetric structure and compound symmetric structure.

A diagonal structure contains the same variance along the diagonal of the covariance matrix and zeros in other parts of the matrix. The reason for the diagonality is due to the assumption that the errors are not correlated and they have the same variance. Hence, the diagonal structure assumes this form for a single unit i:

(25)

H_e =V ar(e_i) =







σ_e² 0 . . . 0 0 σ_e² . . . 0 ... ... . .. ...

0 0 . . . σ_e²







(1.28)

The vector of the covariance parameters, θ_H_e, contains only one parameter,

θ_H_e =σ_e². (1.29)

A compound symmetry structure is derived from the combination of the diagonal structure (symmetric covariance structure), which is based on the assumptions that errors are uncorrelated and homoscedastic and the unsymmetric covariance structure, which is based on the assumptions that the errors are correlated. Hence, the compound symmetric structure is obtained from the combination of both structures.







σ²_e 0 . . . 0 0 σ_e² . . . 0 ... ... . .. ...

0 0 . . . σ²_e





 +







σ₁ σ₁ . . . σ₁ σ1 σ1 . . . σ1

... ... . .. ...

σ₁ σ₁ . . . σ₁







=







σ_e²+σ₁ σ₁ . . . σ₁ σ1 σ_e²+σ1 . . . σ1

... ... . .. ... σ₁ σ₁ . . . σ_e²+σ₁







Thus, the compound symmetric structure for the covariance matrix of a vector of errors is

H_e =V ar(e_i) =







σ_e²+σ₁ σ₁ . . . σ₁ σ₁ σ²_e+σ₁ . . . σ₁ ... ... . .. ... σ₁ σ₁ . . . σ²_e+σ₁







(1.30)

The vector of covariance parameters, θ_H_e, for the compound symmetric structure is

θ_H_e =

"

σ²_e σ₁

#

(1.31)

(26)

The diagonal structure and structured covariance structure are based on homoscedastic assumption. In the case of the heteroscedastic assumption, the variances and covariances may differ, which impose structures that are different from the ones presented here.

The vectors of covariance parameters, θ^P and θ_H_e, are contained in a vector, θ;

that is,

θ =

"

θ^P θ_H_e

#

(1.32)

1.6 The Distribution of Response Variable, Y

_i

The marginal distribution of the vector variable of responses, Y_i, can be derived from the equation1.11. The equation can be presented as,

Y_i =X_iβ+R_i (1.33)

where

R_i =Z_ib_i+e_i

As both b_i and e_i are assumed to follow multivariate normal distributions with zero mean vector,b_i ∼MVN(0,P

) ande_i ∼MVN(0,H_e). Whereb_i and e_i are assummed independent. The expectation of R_i is

E[R_i] = Z_i 0+ 0 =0 The variance of R_i, V_i, is

Var[R_i] = Z_iCov(b_i)Z_i^T + Cov(e_i) Then

Vi=V ar[Ri] =Zi

XZ_i^T +He (1.34)

Hence, R_i ∼ M V N(0, V_i) and Y_i ∼MVN(X_iβ, V_i)

(27)

1.7 The Maximum Likelihood Estimation

The response vector variable for uniti, Y_i, which is given in the equation1.11has a multivariate normal probability distribution function.

f(y_i;β) = ( 1

2π)⁻ⁿⁱ² |V_i|⁻¹²e⁻¹²^((yⁱ^−Xⁱ^β)^T^Vⁱ⁻¹^(yⁱ^−Xⁱ^β)) (1.35) Where the V_i is given in the equation 1.34.

The maximum likelihood function, L_i(β;θ), for observation y_i of the response vector variable,Y_i, for unit i becomes,

L_i(β;θ)= (_2π¹ )⁻ⁿⁱ² |V_i|⁻¹² e⁻¹²^((yⁱ^−Xⁱ^β)^T^Vⁱ⁻¹^(yⁱ^−Xⁱ^β))

As the units are independent, the joint likelihood function for all the units becomes L(β;θ)=

N

Q

i=1

Li(β;θ)

=

N

Q

i=1

(_2π¹ )⁻ⁿⁱ² |V_i|⁻¹² e⁻¹²^((yⁱ^−Xⁱ^β)^T^Vⁱ⁻¹^(yⁱ^−Xⁱ^β)) The log likelihood function for the joint likelihood function becomes

l(β;θ) = −1

2nln (2π)−1 2

N

X

i=1

ln (|V_i|)−1 2

N

X

i=1

(y_i−X_iβ)^TV_i⁻¹(y_i−X_iβ) (1.36)

Where the total number of data, n =

N

P

i=1

n_i.

The score of the log-likelihood function, U(β) = ^∂l(β)_∂β when θ is constant, is

U(β)=¹₂

N

P

i=1

X_i^TV_i⁻¹(y_i−X_iβ) + ¹₂

N

P

i=1

X_iV_i⁻¹(y_i−X_iβ)^T.

The score function, U(β), is zero when the fixed-effects vector, β, is very close to its maximum value or at the maximum value. Hence, estimator for the fixed-effects vector, ˜β, is obtained by setting U(β) = 0.

β˜ = (

N

X

i=1

X_i^TV_i⁻¹X_i)⁻¹

N

X

i=1

X_i^TV_i⁻¹y_i (1.37)

(28)

The estimator of the fixed effects, β, is unbiased as its expectation is the same˜ with the estimate for the fixed effects, E[β] =˜ β. This can be shown by replacing the expectation of the response variable, E[yi], withXiβ in the model, shown in the equation 1.37.

The next is estimation of θ. The parameters in the vector are estimated using numerical approach as it (θ) cannot be expressed in a closed form; because of the reason, a profiled log likelihood is constructed by replacing β with its estimate, as shown in the model of equation 1.37, in the log likelihood function. Then the profiled log likelihood function becomes the function of a vector of covariance parameters.

l(θ) =−1

2ln (2π)

N

X

i=1

ni− 1 2

N

X

i=1

ln (|Vi|)− 1 2

N

X

i=1

e^T_i Vi−1

ei (1.38) where

ei=yi−Xi(

N

X

i=1

X_i^TV_i⁻¹Xi)⁻¹

N

X

i=1

X_i^TV_i⁻¹yi (1.39)

The estimation of a vector of covariance parameters,θ, is done by applying numerical optimization: the optimization concept implemented in R by Pinheiro (2010) used a combination of the expectation-maximization method and the Newton- Raphson method to find estimate ofθ,θ, using the profiled log likelihood function.˜ The expectation-maximization method is based on an iterative method: the vector of covariances, θ, is firstly calculated. Then, the calculated vector is inserted in the log-likelihood function to find the expectation of the (log likelihood) function.

The expectation of the log likelihood function is derived and the derived function is maximized to obatinθ^k. The steps are repeated iteratively until the 25 iterations are elapsed. Where 1≤k ≤25.

The next step of the method is to use the vector,θ^k, obtained after the twenty-five iterations to optmize the estimates of the parameters contained in the vector by using the Newton-Raphson method [5].

θ^m =θ_k^m−1+ U(θ^m−1_k )

U⁰(θ_k^m−1) (1.40)

(29)

Where U(θ_k) is the score vector and U⁰(θ_k) is a partial derivative of the score vector, which is the same with second derivative of the log likelihood function.

The iteration is repeated until convergence in the vector of covariance parameters occurs; that is θ^m is approximately the same with θ˜at the iteration at which the convergence occurs.

Moreover, the vector of covariance parameters is unpacked and its elements are correspondingly arrayed in H_e and P

to calculate the V_i.

V˜_i =Z_iX˜

Z_i^T +H˜_e (1.41)

Then the fixed effects can be estimated:

β˜ = (

N

X

i=1

X_i^TV˜_i⁻¹X_i)⁻¹

N

X

i=1

X_i^TV˜_i⁻¹y_i (1.42)

The variance of the fixed-effects vector, Var(β), can be estimated using the equa-˜ tion 1.42. It is defined by

V ar( ˜β) =Q_i(

N

X

i=1

X^T_i V˜_i⁻¹V ar(y_i)X^T_i V˜_i⁻¹)Q^T_i (1.43) Where the symmetric matrix Q_i = (PN

i=1X^T_i V˜⁻¹_i X_i)⁻¹. The equation 1.43 can be reduced to:

Var( ˜β)= Q_iQ⁻¹_i V ar(y_i) ˜V_i⁻¹Q^T_i =Q^T_i Thus, the variance of the estimator (1.42) is defined by

V ar( ˜β) = (

N

X

i=1

X^T_i V˜i

−1Xi)⁻¹ (1.44)

1.8 Restricted maximum likelihood estimation

The restricted maximum likelihood estimation method is used to estimate unbiased covariance parameters contained in the vector, θ. This method achieved

(30)

unbiasedness of the covariance parameters through compensating the lost degrees of freedom that occurred during estimation of the fixed effects (β) and adding an additional term to the log likelihood function. That is

l(θ) = −1

2(n−p) ln (2π)−1 2

N

X

i=1

ln (|V_i|)−1 2

N

X

i=1

e^T_i V_i⁻¹e_i−1

2|X^T_i V⁻¹_i X_i| (1.45) Where the column vector of errors, e_i, is shown in the equation 1.39. The expectation-maximization method and the Newton-Raphson method are applied to the restricted log likelihood to estimate the covariance parameters of the vector, θ. Then, the vector is unpacked and its parameters are correspondingly arrayed in H_e and P

to calculate V_i (equation 1.41). Then, V_i is used to estimate β as shown in the maximum likelihood method. The estimate of β from this method is called restricted maximum likelihood estimate, REML, of fixed effects.

1.9 Robust Inference

The estimation of the variance of fixed-effects vector,β, shown in the equation1.44 is dependent on the covariance matrix, V_i. Hence, the correct specification of covariance structure of the matrix, V_i, contributes to an unbiasedness of the estimator, Var(β).˜

For robust inference, Liang and Zeger (1986) proposed sandwich estimator. The estimator is obtained by replacing Var( ˜y_i) with e_ie^T_i in the equation 1.43, where e_i =y_i−X_iβ [6].

V ar( ˜β(θ)) =˜ Q_i(

N

X

i=1

X^T_i V_i⁻¹( ˜θ)eie^T_i X^T_iV_i⁻¹( ˜θ))Q^T_i (1.46) Where θ˜is the REML estimates for the parameters in the vector, θ.

(31)

1.10 The Likelihood Ratio Test

The likelihood ratio test is mainly used to reduce a number of parameters of a model. To reduce parameters of a model, a vector of parameters that defines the minimum number of parameters that can be in a model, Ω_min, and another vector that defines the maximum number of parameters in the model are evaluated,Ω_max. The two vectors of parameters are used to parametrize two log likelihood functions.

Then the log likelihood ratio test is used to choose between the two vectors of parameters considering the difference between the values of the two parameterized log likelihood functions. The difference between the values of the two functions is called deviance, Λ.

Λ = 2(l(Ω_max)−l(Ω_min)) (1.47) whereΩ=β_min,β_max orΩ=θ_min,θ_max. The deviance, Λ, isχ²_df distributed with the degree of freedom,df, being equivalent to the difference between the numbers of parameters in the two vectors. The log likelihood function, l, is the same as shown in equation 1.36. Then two hypotheses are set up to assess the validity of the two models.

H₀: Ω= Ω_min versus H₁: Ω6=Ω_min

If the deviance, Λ, is sufficiently small, there is evidence for the null hypothesis, which means that a model with a vector (of parameters) that has the minimum number of parameters, Ω_min, is the best fit for the data; but when the deviance is sufficiently large, there is evidence against the null hypothesis, which indicates that the model with the vector that has the maximum number of parameters, Ωmax, is the best fit for the data.

p= 1

2(1−χ²_Λ,1)− 1

2(1−χ²_Λ,2) (1.48)

The p-value, p, is the probability with which to assess whether the null hypothesis should be accepted or rejected: if the deviance, which is the test statistic, is significant onα, the null hypothesis is rejected ifp < α; but it is accepted ifp > α [5].

(32)

1.11 The t-test

The t-test is used to test fixed effects in the linear mixed effects model. It tests the two hypotheses

H₀: β = 0 versus H₁: β 6= 0 The t-test statistic is defined by

t= β˜

se( ˜β) (1.49)

The t-test follows a t distribution and the degree of freedom is evaluated by considering the grouping level at which a term, fixed effect, is estimated.

1.11.1 The Confidence Interval

In the linear mixed effects model implemented by J.C Pinheiro, the confidence intervals, CI, are based on an approximated t distribution rather than Z distribution as in the Wald confidence interval. The interval is created for each fixed effect in the model. Let’s suppose that a vector of fixed effects, β, comprises βj

where j = 1,2,3. . . , p. The interval for each fixed effect in the model is defined by

CIj = ˜βj ±tdf,^α₂

q

V ar( ˜βj) (1.50) Where ˜β_j is an estimate in the estimated vector of fixed-effects (1.42), ˜β ,αis the level which is usually assigned 0.05, Var( ˜β_j) is a variance estimate for ˜β_j in the matrix (1.44) or (1.46) and df is the degree of freedom.

1.12 The Information Criteria

The information criteria are used to choose the best fitted model for data. The two often used information criteria are Akaike Information Criterion and Bayesian Information Criterion.

(33)

The Akaike Information Criterion, AIC, is defined by

AIC =−2l(β;˜ θ) + 2p˜ (1.51) wherepis the number of parameters in a model. l(β;˜ θ) is either the log likelihood˜ function shown in the equation1.38 or the restricted log likelihood function shown in the equation1.45

The Bayesian Information Criterion is defined by

BIC =−2l(β;˜ θ) + 2p˜ ln (n) (1.52) where n=

N

P

i=0

n_i is the total number of data in a dataframe. In a case where AIC is used to choose a best fitted model to data, smaller AIC indicates a best model.

In a model where BIC is used for the same purpose, smaller BIC indicates a best fitted model to the data. According to Pinheiro (2000), theBIC for restricted log likelihood can be calculated by using the equation 1.45 and substituting n with (n−p_β). Wherep_β is the number of fixed effects parameters [5] .

1.13 The assumptions of the LME Model

The linear mixed effects models, LME, are built on four keys assumptions that are structured on the normal distribution and covariance structure. The key assumptions are as listed in the following items:

That the random effects have normal distributions;

That the errors are normally distributed;

That the errors are independent as responses are independent given random effects though not in general. The errors are sometimes correlated;

That the errors are independent of the random effects, and

That the errors are homoscedastic though not also in general. They are sometimes heteroscedastic.

(34)

The assumptions had led to many researches: it had been demostrated that the maximum likelihood inference is robust to non-normal distributions of random effects (Butler and Louis, 1992; Verbeke and Lasaffre, 1997; Zhang and Davidian, 2001)[7].

As shown above, the maximum likelihood estimation (MLE) of a model parameters is usually obtained with the equation of the log likelihood function, shown below, under the Guassian assumption. In the case where optimization is required, the Newton-Raphson Method is often applied.

l(β) =˜ −¹₂ PN

i=1{n_iln (2π) + ln ˜|V_i|+ ( ˜Y_i−X˜_iβ)˜ ^TV˜_i⁻¹( ˜Y_i−X˜_iβ)}˜ β(˜ θ) = (X˜ ^TV( ˜θ)⁻¹X)⁻¹X^TV( ˜θ)⁻¹Y˜

β˜=β˜ⁱ⁻¹ + ^U(^β^˜ⁱ⁻¹⁾

U⁰(β˜ⁱ⁻¹)

Furthermore, Lange and Laird (1989) have showed that the variance of MLE of fixed effects is strongly dependent on a covariance structure, but the variance estimates of fixed effects in a model with random intercept and random slope are unbiased even when the true covariance structure implies more random effects than specified.

So far, the robustness of the fixed effects of the LME models to non-normal distribution of random effects and misspecified covariance structure had been well researched on, but robustness of the fixed effects, from the MLE inference, to non- normal distribution of errors does not seem to have gotten enough attention of researchers. Then these questions come up: Is maximum likelihood inference from the linear mixed effects model robust to non-normal distribution of errors? If it is, how robust is it? The answers to these questions are the motivations of this thesis [7].