Identifying expression quantitative trait loci in patients with inflammatory bowel disease

(1)

Identifying expression quantitative trait loci in patients with

inflammatory bowel disease

Statistical models relating gene expression and single nucleotide polymorphism data

June 2019

Master's thesis

Kristine Lund Mathisen

2019Kristine Lund Mathisen NTNU Norwegian University of Science and Technology Faculty of Information Technology and Electrical Engineering Department of Mathematical Sciences

(2)

(3)

Identifying expression quantitative trait loci in patients with inflammatory bowel disease

Statistical models relating gene expression and single nucleotide polymorphism data

Kristine Lund Mathisen

Master of Science in Physics and Mathematics Submission date: June 2019

Supervisor: Mette Langaas

Co-supervisor: Atle van Beelen Granlund

Norwegian University of Science and Technology Department of Mathematical Sciences

(4)

(5)

Summary

This thesis is a pilot study, conducted in order to investigate the possible relation- ships between genotyped SNPs and the gene expression of a gene. We will use the gene ITGAL as a prototype gene, and analyse the samples from individuals with and without chronic bowel diseases. The selected SNPs are located within the ITGAL gene. The gene expression of ITGAL is measured and preprocessed with two different technologies, microarray and RNASeq. From the microarray data we have 62 samples, and from the RNASeq data we have 75 samples. For all samples, we have the ITGAL gene expression, disease status and SNP status for the selected SNPs. For each sample from patients with chronic bowel disease, it is specified whether the gene expression of ITGAL is measured in inflamed or uninflamed tissue. The data set is compiled by the IBD research group at the De- partment of Clinical and Molecular Medicine, NTNU. The methods considered in this project are multiple linear models and generalized linear models, where the response is the gene expression of ITGAL, and the covariates are disease status and the interaction between disease status and SNP status. We are interested in how the gene expression of ITGAL is affected when the disease status and SNP status change. There are very few significant results for the microarray data when we correct for multiple testing. For the RNASeq data, no covariates with the interaction term is significant even if we do not correct for multiple testing. We have also compared the two groups of inflamed and uninflamed tissues, regardless of disease. In addition, we have looked at the genotyped SNPs within a distance from ITGAL, and the correlation between the SNPs with significant results. The next step is to expand the pilot study to other genes.

(6)

(7)

Preface

This project constitutes the course TMA4900 - Industrial Mathematics, Master’s Thesis for the Industrial Mathematics program at NTNU. The topic is investigation of the relationship between gene expressions and genotyped SNPs for persons with and without different bowel diseases. I would like to thank my supervisor Mette Langaas at the Department of Mathematical Sciences for excellent guidance in this process, and Atle van Beelen Granlund, researcher at the Department of Clinical and Molecular Medicine for making the data set available to me and for support in understanding the genetic part of the project. I would also like to thank Per Kristian Hove, Oddgeir Lingaas Holmen and Sandor Zeestraten for helping me setting up the connection to the HUNT Cloud and technical support, and my parents for proofreading the project.

(8)

(9)

Chapter 1 Introduction

In this project we perform statistical analysis of data from persons with inflammable bowel disease.

1.1 Inflammable bowel disease

Inflammable bowel disease (IBD) is a term to describe multiple chronic bowel diseases, where the two most common are Crohn’s disease (CD) and ulcerative colitis (UC). The difference between these diseases is that ulcerative colitis is just in the colon, while Crohn’s disease can occur in any part of the gastrointestinal tract. This is illustrated in Figure 1.1. In many cases, it may be difficult to tell the difference between these diseases (Aabakken, 2016).

Crohn’s disease may cause inflammations through the whole digestive system, but is most common in the small and large intestine. The inflammation can go straight through the intestinal walls and create false openings (fistulas), which may lead to infections, and narrow areas, which may cause twisting of the stomach (gastric volvulus) (Aabakken, 2018).

Ulcerative colitis usually starts in the lower part of the large intestine and rectum, but might spread through the colon and to the lower part of the small intestine.

The inflammation is usually limited to the mucosa. Areas with pus arises in the colon. These areas are called crypt abscesses, and when these breaks, wounds ap- pear where tissue fluid and blood seeps through (Aabakken and Halstensen, 2018).

We will in this thesis look at measurements from patients with Crohn’s disease in

(14)

Figure 1.1: This figure shows the difference between Crohn’s disease and ulcerative colitis. The red parts represent where the colon is inflamed.

Source:https://commons.wikimedia.org/wiki/File:Crohn%27s Disease vs Colitis ulcerosa.svg, licensed under CC BY-SA 3.0

inflamed tissue (CDA), patients with Crohn’s disease in uninflamed tissue (CDU), patients with ulcerative colitis in inflamed tissue (UCA) and patients with ulcerative colitis in uninflamed tissue (UCU).

1.2 SNP-data

DNA is a self-replicating material which is carrier of genetic information. It de- scribes how the organism will look and function, and these descriptions are inher- ited from one generation to the next (Martinsen, 2019). SNP (single-nucleotide polymorphism) is a position-based one-base-variation in the DNA. We will study bi-allelic SNPs, which means that there are two base variants. This is illustrated in Figure 1.2. To be classified as a SNP, the least frequent variant (the minor allele) must exist in at least one percent of the population (if not, it is called a SNV). In a genome-wide association study (GWAS) the objective is to search for associations between a phenotype (for example a disease) and a SNP. In this project we will focus on a selection of SNPs located inside and within a distance from the gene ITGAL. As a running example, we will use the SNP with identification number rs11150589, which is located in chromosome 16, position 30 482 494. According to Jostins et al. (2012) this SNP is associated with ulcerative colitis.

(15)

Figure 1.2: SNP is a position-based one-base-variation in the DNA. As illustrated here, the upper base-pair is C/G, while the lower base-pair is A/T.

Source: https://commons.wikimedia.org/wiki/File:Dna-SNP.svg, licensed under CC BY 4.0

1.3 Linkage disequilibrium and genotype correlation

Linkage disequilibrium (LD) is a non-random association of alleles, and is defined as ”the difference between the observed frequency of a particular combination of alleles at two loci and the frequency expected for random association” (Robinson, 2004, pp. 1586). It is a measure for correlation between the SNPs. A haplotype is a combination of alleles in nearby loci in a DNA molecule. Ideally each SNP contribute to two alleles as in Table 1.1.

SNP 1 SNP 2 Haplotype A Haplotype B

0 0 0-0 0-0

0 1 0-0 0-1

0 2 0-1 0-1

1 0 1-0 0-0

1 1 1-0 or 1-1 0-1 or 0-0

1 2 0-1 1-1

2 0 1-0 1-0

2 1 1-0 1-1

2 2 1-1 1-1

Table 1.1:Table showing connection between SNP status and haplotype.

As we do not have the haplotype data, we do not know if the SNP status 1 at SNP 1 and 1 at SNP 2 is a representation of haplotype 0-1 and 1-0 or 0-0 and 1-1.

(16)

Figure 1.3:Illustration of the transcription from DNA to RNA, and the translation of RNA to protein.

Source: https://commons.wikimedia.org/wiki/File:Genetic code.svg, licensed under CC BY-SA 3.0

Due to this, we prefer to calculate the genotype correlation coefficient, which is a measure of composite LD. The formula is Corr(Y1, Y2), whereY1 andY2 are the SNP statuses of two SNPs.

1.4 Gene expression

Gene expression is the process where the information in a gene’s DNA sequence is transcribed and translated to the structures and functions of the cell. Usually the end products are proteins (Alberts et al., 2014, pp. 228). Gene expression includes the transcription from DNA to RNA and (for coded genes) translation of RNA to protein. This is illustrated in Figure 1.3. In this project we will look at RNA measurements of ITGAL, or Integrin alpha L, which is a gene involved in a variety of immune phenomena. We will analyse gene expression data from two technologies: microarray data are from the Illumina human HT-12 expression BeadChips, and RNASeq data from the sequencing (75 cycles single end reads) Illumina HiSeq4000 instrument.

1.5 eQTL

The expression quantitative trait loci (eQTL) is according to Nica and Dermitzakis (2012) ”the discovery of genetic variants that explain variation in gene expression levels”. An eQTL is a locus which explains a fraction of the genetic variance of a gene expression phenotype. In our case, the phenotype is the gene expression of the ITGAL gene. In the statistical model for discovering eQTLs also disease status and SNP status will be included. Regulatory variants are calledcisortransacting,

(17)

and depends on the distance from the transcriptional start site (TSS). The TSS is the location where transcription starts. Thecisacting is within 1 megabase (Mb) from TSS, on both sides, while thetransacting is within 5 Mb. The term 1 megabase means that we are looking at the nearest 1 000 000 base pairs. However, there are other ways to use the definitionscisandtrans, where the terms describe whether the regulation works directly or through other eQTLs. In this thesis, we will use the termslocalanddistant cis-actingto describe the distance from TSS. As discussed in Nica and Dermitzakis (2012), it might be difficult to detect differences between causal and reactive expression changes. The aim of eQTL is to find the association between SNPs and gene expression. In Chapter 3 we will look at SNPs located inside and within a distance from the gene ITGAL and relate these to the gene expression of ITGAL.

1.6 HUNT Cloud

HUNT Cloud is a digital infrastructure where researches can store, access and analyse sensitive data in controlled environments. This includes research unrelated to HUNT (Helseundersøkelsen i Nord-Trøndelag), as in this project. The data set analysed in this thesis contains data on genotyped (and imputed) SNPs for the participants in the study, and this makes it possible to identify them. For their protection, it is necessary to perform the analysis in a safe environment. We have used command line tools in Linux, and run R and Rstudio using X2Go. To access the data, two factor authentication was necessary. More information on HUNT can be found athttps://www.ntnu.no/hunt.

1.7 Thesis outline

The structure of the thesis is as follows: In Chapter 2 we present the theory of the statistical models used in this project, in order to understand the results. We also present the structure of the data set. In Chapter 3 we look at the model set-up and the connections to genetic models. The results from a running example are shown.

In Chapter 4 we look at the process for performing the calculations. The data set is presented properly, the theory from Chapter 2 is applied on the data set and the results are presented. We also expand our data set to investigate how this affects the results. In Chapter 5 we discuss the results and suggest directions for further work.

(18)

(19)

Chapter 2 Linear and generalized linear models

In this chapter we will present the statistical methods used to perform eQTL analysis. We will use the gene expression of the gene ITGAL and the SNP with identification number rs11150589 located within the ITGAL gene as a running example. From the microarray data we have 62 observations for this gene, and from RNASeq we have 75 observations. For each observation we have the SNP status and the gene expression represented as either a preprocessed value from the microarray technology or count data from the RNASeq technology. The microarray data will be analysed with multiple linear regression and RNASeq with a variant of generalized linear model. We also have information on disease status to include in the model.

2.1 Multiple linear regression model

This presentation is based on Fahrmeir et al. (2013, pp. 73-168).

For our eQTL analysis, we will use the multiple linear regression (MLR) model.

We use the following notation:

Yis a(n×1)vector of responses (random variables) Xis a(n×p)design matrix with rowsx^T_i fori= 1, ..., n βis a(p×1)vector of regression parameters including intercept εis a(n×1)vector of random errors

The classical normal linear regression model isY=Xβ+ε. We define the linear

(20)

predictor asηi =x^T_i β. The following is assumed:

1. E (ε) =0

2. Cov (ε)=E εε^T

=σ²I

3. The design matrixXhas full rank, rank(X) =p 4. ε∼N_n 0, σ²I

It is assumed that the observation pairs are measured from sampling units x^T_i ,Y_i and that the observation pairs are independent from each other. We assume that our sample is representative of some population of interest.

In the running example, the response vectorYis the microarray gene expression of ITGAL. The design matrix is X = [Xe Xg], whereerepresents the covariate disease and g represents the interaction between disease status and SNP status.

This means that in our running example, the vectorYand design matrixX(with additive coding, this is explained in Section 3.1) will be

Y=





 7.25 7.28 6.55 6.64 6.80 ... 7.59







X=







0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 ... ... ... ... ... ... ... ... ... ... 0 0 0 1 0 0 0 0 2 0







The columns 1-5 in the design matrix X represent the disease status. Column 1 shows N (which means samples without any bowel disease), column 2 shows

”Crohns disease inflamed (CDA)”, column 3 shows ”Crohns disease uninflamed (CDU)”, column 4 shows ”ulcerative colitis inflamed (UCA)” and column 5 shows

”ulcerative colitis uninflamed (UCU)”. The columns 6-10 shows the interaction

(21)

between disease status and SNP status, so they show the SNP status multiplied by respectively column 1-5. The observant reader may notice that there is no intercept nor main effect of SNP in this design matrix. This is explained further in Section 3.1.2, and is used because it is easier to make contrasts with this parameterization.

2.1.1 Parameter estimation

In MLR the aim is to estimate the regression parametersβandσ². We assume a sample of independent random variablesY=Y₁, ..., Y_n. EachY_ihas an univariate normal distribution

f y_i;β, σ²

= 1

√2πσ ·exp

− 1

2σ² y_i−x^T_i β2 .

Assuming independent joint distribution for the sampleY f y;β, σ²

=

n

Y

i=1

f yi;β, σ² ,

which gives the likelihood function L β, σ²;y

=f y;β, σ² .

It is common to use the natural logarithm of the likelihood, which is called the log- likelihood functionl β, σ²;y

. This makes the calculations easier, and is allowed because the log-function is a monotone function so these likelihood functions have optimum at the same point. This gives

l β, σ²;y

= ln L β, σ²;y

=

n

X

i=1

lnf yi;β, σ² .

The parametersβandσ²are unknown, but we may estimate them by maximising the log-likelihood. The maximum likelihood estimatesβˆ andσˆ²are the values of βandσ² respectively so that

lnL

β,ˆ ˆσ²;y

≥lnL β, σ²;y

for allβ, σ².

To find the maximum likelihood estimates, we want to maximisel β, σ²;y . This is done by solving this set of equations with respect to theβ-part:

∂l β, σ²;y

∂β = 0

(22)

which leads to the normal equations (in matrix notation) X^TXβˆ =X^TY.

The estimator forβˆis

βˆ= X^TX⁻¹ X^TY, with E

βˆ

=βand Cov βˆ

=σ² X^TX−1

. Hereβˆis multivariate normally distributed, because it is a set of linear combinations of the random variables inY which we know are independent and normally distributed.

The maximum likelihood estimator forσ² is found by maximising the likelihood inserted the estimate forβ, and the formula isˆ

ˆ σ²= 1

n

Y−Xβˆ T

Y−Xβˆ

= SSE n .

However, this is a biased estimator ofσ². It is known that ˆ

σ²(n−p)

σ² ∼χ²_n−p

so to get unbiased estimator, which is preferred, we use restricted maximum likelihood (REML). This gives

s² = 1 n−pSSE.

In our running example, we get







βb_diseaseN βbdiseaseCDA

βb_diseaseCDU βbdiseaseU CA

βbdiseaseU CU

βbdiseaseN:snp

βb_diseaseCDU_:snp βbdiseaseU CA:snp

βbdiseaseU CU:snp







=







4.589380 4.705833 4.591303 5.032506 4.707411 0.075255 0.087544 0.181252 0.047407







and

s² = 0.2175²= 0.04731.

(23)

2.1.2 Hypothesis testing

In single hypothesis testing, we want to test a null hypothesis against an alternative hypothesis. In the MLR we study

H₀ :β_j = 0 H1 :β_j 6= 0.

In hypothesis testing there are two different types of errors that can be made. The

”type I error” is when the null hypothesisH0is rejected, even thoughH0 is true.

The other type of error, ”type II error”, is whenH₀is not rejected, even thoughH₀ is false. This can be illustrated in a table:

Not rejectH0 RejectH0

H₀ true Correct Type I error H0 false Type II error Correct

To choose whetherH₀ should be rejected or not, we can calculate a p-value. A p-value is defined informally as the probability of our result or a more extreme result, given thatH0is true. TheH0 is chosen to be rejected at some significance levelαif thep-value is smaller than the chosenα. Thep-value is based on a test statistic, and for the MLR the test statistic for testingH0is

T_0j =

βˆ_j−0 ˆ

σ q

X^TX−1 [j,j]

∼tn−p,

where X^TX−1

[j,j]is thej-th element of the diagonal of the X^TX−1

matrix. The formula for our two-sided test uses the test statisticT0jand the observed test statistic valuet0j. Thet-distribution is symmetric around zero, so thep-value can be calculated as

p-value=P(T0j > abs(t0j)) +P(T0j <−abs(t0j)) = 2·P(T0j > abs(t0j)).

In our running example, thep-values are shown in Table 2.1. We choose the significance level to beα = 0.05. This means that the diseaseN, diseaseCDA, dis- easeCDU, diseaseUCA, diseaseUCU and diseaseUCA:snp are significant, while diseaseN:snp, diseaseCDA:snp, diseaseCDU:snp and diseaseUCU:snp are not significant.

(24)

Estimate Std. Error t value Pr(>|t|) diseaseN 4.589380 0.101404 45.258166 4.90E-44 diseaseCDA 4.705833 0.217488 21.637173 5.67E-28 diseaseCDU 4.591303 0.126986 36.156030 4.99E-39 diseaseUCA 5.032506 0.117950 42.666573 1.03E-42 diseaseUCU 4.707411 0.081622 57.673367 1.66E-49 diseaseN:snp 0.075255 0.082796 0.908918 3.68E-01 diseaseCDU:snp 0.087544 0.108744 0.805049 4.24E-01 diseaseUCA:snp 0.181252 0.079748 2.272795 2.71E-02 diseaseUCU:snp 0.047407 0.063224 0.749829 4.57E-01 Table 2.1:Summary of the results from fitting a MLR to the running example. The number 2.16E-52 is the scientific notation of2.16·10⁻⁵².

2.1.3 Model assessment

There are different ways to measure how well a model fits the data. For the MLR, one way is to look at the total variability in the data, called the sums-of-squares total (SST). The SST can be decomposed into one part that is explained by the regression, sums-of-squares regression (SSR), and one part that is not explained by the regression, sums-of-squares error (SSE). UsingY¯ = _n¹Pn

i=1YiandYˆi=x^T_i β,ˆ then we have

SST=SSR+SSE

where

SST=

n

X

i=1

Yi−Y¯2

SSR=

n

X

i=1

Yˆ_i−Y¯2

SSE=

n

X

i=1

Yi−Yˆi

2

We can use this to define the coefficient of determination R², which is the ratio between SSR and SST. This gives

R²=SSR/SST= 1−SSE/SST.

(25)

(a)Normal QQ-plot. (b)Residuals vs fitted values.

Figure 2.1:Residual plots for the running example.

.

ThisR² is the squared correlation coefficient between the observed and predicted response. In our running example we haveR² = 0.9983. This means that more than 99 % of the variability in the data is explained.

Another way to determine whether a model fit is good or not, is to look at a normal QQ-plot and the plot showing residuals versus fitted values. The residuals e_i =Y_i−Yˆ_iare predicted values for the error termsε_i. The normal QQ-plot can be used to evaluate the assumption of normality of error terms. The residuals versus fitted values-plot shows if the residuals have non-linear patterns. This can be used to test the assumption of a linear relationship between the response and the covariates. For our running example, plots are shown in Figure 2.1. The normal QQ-plot looks good because the values follow the straight line, while the plot of the residual versus fitted values looks good because there are no clear trends.

The Anderson-Darling normality test can be used to test if a sample comes from a normal distribution, so we have

H0:The data follow a normal distribution H1:The data do not follow a normal distribution and the test statistic, as reviewed in Das and Imon (2016), is

A² =−

1 +0.75

n +2.25 n²

hP_n

i=1[(2i−1)log zˆ_(i) 1−zˆ_(n−i+1) ]

n +n

i ,

(26)

wherezˆ_(i) = Φ [y_(i)−µ]/ˆˆ σ

andΦ (·)is the cumulative distribution function of anN (0,1)random variable. To find the distribution ofAunderH₀, see Table 2 in Stephens (1974). In the R packagenortest(Gross and Ligges, 2015), a table from another publication of Stephens is used.

2.2 Generalized linear models

In a regression setting, whereY_iis the response andx_iare covariates, we describe the generalized linear model (GLM) consisting of three ingredients:

1) Distribution ofYi(random component) 2) Linear predictorη_i(systematic component)

3) Link functiong(link between E(Yi) and linear predictorηi) 2.2.1 Distribution ofY_i

For a generalized linear model, the distribution ofY_ican be written as a univariate exponential family

f(yi |θi) = exp

yiθi−b(θi)

φ +c(yi, φ)

whereθi is the canonical parameter,φis the nuisance parameter andbandcare known functions. It can be shown thatE (Yi)=b⁰(θi) =θandVar (Yi)=b⁰⁰(θi)·φ.

Further, V(µ_i) =b⁰⁰(θ_i)is called the variance function. The linear model, presented in Section 2.1, is a GLM withYi ∼N.

2.2.1.1 Normal distribution f y_i;µ_i, σ²

= 1

√2πσ ·exp

− 1

2σ²(y_i−µ_i)²

WhenYi ∼ N µi, σ²

, thenθi = µi,b(θi) = ¹₂θ_i² andφ = σ². ThusE (Yi) = b⁰(θ_i) =µ_i andVar (Y_i)=b⁰⁰(θ_i)·φ=σ².

2.2.1.2 Negative binomial distribution

A common parameterization for the probability mass function is

f(yi;µi, α) = Γ yi+_α¹ Γ _α¹

Γ (y_i+ 1) µi

µ_i+ ¹_α

!yi 1 α

µ_i+_α¹

!_α¹

, yi = 0,1,2...

(27)

According to Agresti (2015, pp. 247-250), α acts like ”a type of dispersion parameter”. The reason is that the negative binomial distribution can be seen as a mixture model for count data. Conditional on a parameterλ, letY be Poisson(λ).

The Poisson(λ) has mean and varianceλ. Further, letλhave a gamma distribution.

Then it can be shown that marginallyY has negative binomial distribution. If we look at Var(Yi)=µi+αµ²_i in the negative binomial distribution, whenα→0then Var(Yi)→µi and the negative binomial distribution can be proven to converge to the Poisson distribution. So in this senseαacts as a ”dispersion parameter”. When αis fixed, then according to de Jong and Heller (2008, pp. 39), this is an univariate exponential family with

θi= ln µi

αµ_i+ 1

, b(θi) =−1 αln

1−αe^θⁱ

, φ= 1

Thus,

E (Yi) =b⁰(θi) = e^θⁱ

1−αe^θⁱ = µi/(1 +αµi)

1−α·µ_i/(1 +αµ_i) =µi

Var (Y_i) =b⁰⁰(θ_i)·φ= e^θⁱ

(1−e^θⁱ)² ·1 = µi(1 +αµi)

(1 +αµ_i)²−2αµ_i(1 +αµ_i) +α²µ²_i

=µi(1 +αµi).

The variance function isb⁰⁰(θi) = Var (Yi).

2.2.2 Linear predictor

The linear predictor is the same as for the linear model, η_i =x^T_i β

where

xi is a(p×1)vector βis a(p×1)vector 2.2.3 Link function

The link function g is used to connect E(Yi) =µi to the linear predictorηi. We have ηi = g(µi), and assume that g is monotone and twice differentiable. The inverse function is called the response function:h(η_i) =µ_i. The canonical link is

η_i =θ_i ⇐⇒g(µ_i) =θ_i.

(28)

Figure 2.2:The parametersθi,µi,βand the linear predictorηiare connected.

2.2.3.1 Normal distribution

For the normal distribution the canonical link isg(µi) =µi. 2.2.3.2 Negative binomial distribution

For the negative binomial distribution, the canonical link isg(µi) =θi = ln µi

µi+α

, but we will useg(µ_i) = ln (µ_i).

2.2.4 Parameter estimation

In GLM, the aim is to estimateβandφ. We might as well estimate the parameters µi,ηiorθi, as they are all connected as shown in Figure 2.2.

The log-likelihood function is written on the form l(β) =

n

X

i=1

li(β) =

n

X

i=1

1

φ(yiθi−b(θi)) +

n

X

i=1

c(yi, φ).

To estimate the unknown parameters, we sets(β) = _∂β^∂l = 0. This is called the score function, ands(β)is a(p×1)vector. This gives

s(β) = ∂l

∂β =

n

X

i=1

∂l_i

∂β =

n

X

i=1

s_i(β)

where

li= yiθi−b(θi)

φ +c(yi, φ).

(29)

To find the score function, each component is calculated by using the chain rule:

si(β) = ∂li

∂β = ∂li

∂θi

· ∂θi

∂µi

·∂µi

∂ηi

·∂ηi

∂β = yi−b⁰(θi)

· 1

b⁰⁰(θi) ·h⁰(ηi)·xi

= (yi−µi)· 1

Var(Yi) ·h⁰(ηi)·xi. The score function is

s(β) =

n

X

i=1

(yi−µi)·xi·h⁰(ηi) Var (Yi) .

2.2.4.1 Normal distribution

For the normal distribution, the log-likelihood function is l(β) =√

2π−ⁿ

2 ·σ⁻ⁿ² ·exp −1 2

n

X

i=1

(yi−µi)² σ²

!

and the score function is

s(β) =

n

X

i=1

yi−x^T_i β

·xi

σ² .

This is easy to solve. Theβˆis as shown in Section 2.1.1, βˆ= X^TX−1

X^TY.

2.2.4.2 Negative binomial distribution

For the negative binomial distribution, the log-likelihood function is L(µ, α,) =

n

X

i=1

log Γ

y_i+ 1

α

−log Γ 1

α

−log Γ (y_i+ 1)

+

n

X

i=1

yilog

αµi

1 +αµi

− 1

α

log (1 +αµi)

.

Using the log-link,η_i= ln(µ_i)andη_i=x^T_i β, the score function is s(β) =

n

X

i=1

yi−exp x^T_i β

·xi·exp x^T_i β exp x^T_i β

1 +αexp x^T_iβ (2.1)

(30)

Newton-Raphson and Fisher scoring

For the negative binomial distribution the equation s(β) = 0 does not have a closed form solution, but can be solved using Newton’s method (Quarteroni et al., 2007, pp. 311):

β^(t+1)=β^(t)−H

β^(t)−1

s β^(t)

whereH(β)is the Hessian on the log-likelihood, also called the observed Fisher information matrix,

H(β) = ∂l(β)

∂β∂β^T.

In statistics we use the expected information matrix F⁻¹( ˆβ^(t)) instead of the observed Fisher information matrixH

βˆ^(t)⁻¹

, which gives the Fisher scoring method:

β^(t+1) =β^(t)+F⁻¹ β^(t)

s β^(t)

. (2.2)

The expected Fisher information matrix is in general given as F(β)_[h,l]=

n

X

i=1

x_ihx_il[h⁰(ηi)]² Var (Y_i) ,

for an exponential family GLM model. This can be rewritten into

F(β) =X^TWX (2.3)

whereW = diag

h⁰(ηi)² Var(Yi)

. In our case, the negative binomial distribution with the log-link, W = diag

exp(^x^Tiβ)²

exp(^x^Tiβ)(^1+α^exp(^x^Tiβ))

. To find β, insert (2.1) andˆ (2.3) into (2.2). This is possible as long asαis fixed.

2.2.5 Properties of parameter estimators It can be shown that asymptotically

βˆ ≈N_p β, F⁻¹(β) and even

βˆ ≈Np

β,Fˆ⁻¹

βˆ

whereFˆ⁻¹ βˆ

is as in (2.3), but insertingβˆintoWto getW.ˆ

(31)

2.2.6 Wald test

For the GLM we want to test the multivariate hypotheses H₀:Cβ =d

H1:Cβ 6=d

whereC is a(r×p)matrix. An estimator forCβisCβ, whereˆ E

Cβˆ

=Cβ

Cov

Cβˆ

=C·Cov βˆ

C^T =CF⁻¹(β)C^T The Wald test statistic is

w=

Cβˆ −dT

[CF⁻¹ βˆ

C^T]⁻¹

Cβˆ −d

which is asymptoticallyχ²-distributed withrdegrees of freedom (Fahrmeir et al., 2013, pp. 286).

2.3 Several SNPs and multiple testing

In this project we want to test multiple SNPs, and for this we may use multiple hypothesis testing. We assume that we have m hypothesis tests, which givesm p-values, and then we choose a cut-off on thep-values at a local significance level αlocto decide if we want to reject each null hypothesis. The null hypotheses with p-value lower than α_loc are rejected, and this gives R rejected null hypotheses.

The number of false null hypotheses that are rejected hypotheses is calledS, and the number of true null hypotheses that are rejected is calledV (type I error). The number of true null hypotheses that is not rejected is calledU, and the number of false null hypotheses that are not rejected is calledT (type II error). This gives the Table 2.2. The only quantities that are observed areRandm.

Not rejectH₀ RejectH₀ Total

H0true U V m0

H₀false T S m−m₀

Total m−R R m

Table 2.2:Multiple testing set-up (Benjamini and Hochberg, 1995).

(32)

2.3.1 Familywise error rate

The familywise error rate (FWER) is defined as the probability of one or more false positive findings

FWER=P(V >0)

where V is the number of type I errors in the mhypothesis tests. Here V is an unobservable random variable. To control the FWER we find a local significance level cut-offα_locto be used on each of themhypothesis tests.

2.3.2 Bonferroni method

The Bonferroni method is a method for controlling the FWER. We have FWER=P(R1∪ · · · ∪Rm)≤

m

X

j=1

P(Rj) =

m

X

j=1

α_loc=mα_loc

whereRj is the event where we reject the hypothesis numberj. Setting the FWER toα, we solve

mα_loc=α which gives the local significance level

α_loc= α m for the Bonferroni method.

(33)

Chapter 3 Analysing gene expression data

In this chapter, we will apply the theory from Chapter 2 on the dataset briefly presented in Chapter 1. We will start by looking at design matrices and their connections to genetic models, interactions between covariates and introduce the term contrasts. Then we present the two technologies and use them both to analyse one SNP.

3.1 Design matrices and contrasts

There are different ways to look at the effect of a SNP. In this thesis we will look at additive SNP effects and relation between additive SNP effect and disease. For completeness next we give a presentation of the additive, dominant, recessive and codominant genetic models.

3.1.1 Genetic models

Let the most common variant of the SNP be calleda, while the least common is calledA. The SNP statuses are represented by numbers, whereaa = 0,Aa= 1 and AA = 2. First, the additive model assumes that the mean increase in the response when comparing two different SNP statuses is linear dependent on the number ofas. The dominant model assumes that the change in the response depends on whether there is anyAin the SNP or not, regardless of how many. The recessive model assumes that the only change in response is when the SNP isAA.

The codominant model assumes that the response changes, but not linearly, with SNP status. This is illustrated in Figure 3.1.

(34)

Figure 3.1:This figure shows different genetic models. On the vertical axis we have gene expression on some scale.

The genetic models can be represented as codings in design matrices. We will now consider a hypothetical data set with three individuals with SNP statuses= [aa Aa AA]^T. The additive model is linear, where the SNP status is a linear factor:

X_add=



 0 1 2





The recessive model represents whether the SNP status isAAor not:

Xres=



 0 0 1





The dominant model represents whether the SNP status isaaor not:

X_dom=



 0 1 1





(35)

The codominant model can be regarded as one factor with three levels. The SNP statusaais the reference level. The first column represents whether the SNP status isAaor not, and the second column whether the status isAAor not. The model is then:

X_cod =



 0 0 1 0 0 1





As mentioned above, we will use the additive encoding in this thesis.

3.1.2 Interactions

We now examine two different covariates (SNP status and disease status), to study their effect in the linear predictor. These effects are not necessarily additive, and there might be an additional effect because the effect of the SNP on the response varies for different diseases. This means that we have to consider models with an additional interactive effect as well. Consider 15 individuals, where these 15 have all possible combinations of the 5 diseases statuses and the 3 SNP statuses.

For our running example with the additive model, theβand design matrix for the microarray data will be:

β=







β_diseaseN β_diseaseCDA βdiseaseCDU

βdiseaseU CA

βdiseaseU CU

βdiseaseN:snp

βdiseaseCDA:snp

β_diseaseCDU_:snp βdiseaseU CA:snp

βdiseaseU CU:snp







(36)

X_micro =







1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 2 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 2 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 2 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 2 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 2







(3.1)

This means that in row 1 the sample has disease status N and SNP status 0, in row 5 the sample has disease status CDA and SNP status 1 and in row 15 the sample has disease status UCU and SNP status 2. The RNASeq data does not contain any samples with disease status N, so theβand the design matrix are

β=







βdiseaseCDA

β_diseaseCDU βdiseaseU CA

βdiseaseU CU

βdiseaseCDA:snp

β_diseaseCDU_:snp βdiseaseU CA:snp

βdiseaseU CU:snp







(37)

X_{RN ASeq} =







1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 2 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 2 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 2 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 2







(3.2)

This means that in row 1 the sample has disease status CDA and SNP status 0, in row 5 the sample has disease status CDU and SNP status 1 and in row 12 the sample has disease status UCU and SNP status 2.

We have chosen a parameterization with main effect of disease and interaction between SNP and disease. Instead, we could have chosen a parametrization with intercept, dummy variable coding of main effect of disease, linear (additive) main effect of SNP and dummy variable coding of the interaction between SNP and disease. In both of these two parameterizations we would be able to estimate the same number of parameters, 10 for the microarray data and 8 for the RNASeq data.

3.1.3 Contrasts of interest

Assume we are interested in looking at the difference between inflamed and uninflamed tissues, independent of the disease status. Then the data are grouped into two groups: CDA+UCA (inflamed) and CDU+UCU (uninflamed). To study this, we multiply theβvector with a vectorC. TheCvector can be written in different ways, depending on which elements ofβwe want the difference between. In our case, and for the RNASeq data, there are four interesting C vectors, calledC₀, C1,C2andC3. The first is

C0 =

1 −1 1 −1 0 0 0 0 and shows the difference

(β_diseaseCDA+βdiseaseU CA−(β_diseaseCDU +βdiseaseU CU)).

(38)

This is the difference between inflamed and uninflamed tissues when the SNP status equals 0. The secondC vector isC₁

C₁=

1 −1 1 −1 1 −1 1 −1 and shows the difference

β_diseaseCDA+βdiseaseCDA:snp+βdiseaseU CA+βdiseaseU CA:snp

−(β_diseaseCDU +β_diseaseCDU_:snp+βdiseaseU CU +βdiseaseU CU:snp).

This is the difference between inflamed and uninflamed tissues when the SNP status equals 1. The thirdC vector isC₂

C2=

1 −1 1 −1 2 −2 2 −2 and shows the difference

β_diseaseCDA+ 2βdiseaseCDA:snp+βdiseaseU CA+ 2βdiseaseU CA:snp

−(βdiseaseCDU + 2βdiseaseCDU:snp+βdiseaseU CU + 2βdiseaseU CU:snp).

This is the difference between inflamed and uninflamed tissues when the SNP status equals 2. The fourthCvector is

C3 =

0 0 0 0 1 −1 1 −1 and shows the difference

(βdiseaseCDA:snp+βdiseaseU CA:snp−(β_diseaseCDU_:snp+βdiseaseU CU:snp)). This is the difference between the effects of the change in SNP status (from 0 to 1 or from 1 to 2) for inflamed and uninflamed tissues. See above thatC₃ =C₁−C₀. Since we have additive coding of SNP,C3can also be interpreted asC2−C1. For the microarray data, theβincludes N, so we have

C0 =

0 1 −1 1 −1 0 0 0 0 0 and

C3 =

0 0 0 0 0 0 1 −1 1 −1

For both microarray data and RNASeq data, we do not have complete data, and the contrast vectors will be adjusted. There are only one observation for CDA in

(39)

our microarray data set, so we can not estimateβdiseaseCDA:snp. This means that the contrast vectors change to

C₀=

0 1 −1 1 −1 0 0 0 0

C3=

0 0 0 0 0 0 −1 1 −1 and similarly for other types of missing data.

3.2 Microarray and LM

For the microarray data, we will use LM. We have recieved data which are already preprocessed, by so-called quantile normalization and transformed to the log₂- scale. As we will see in Section 3.3, the RNASeq data will be analysed on natural log scale. Hence, the preprocessed microarray data was divided bylog₂(exp(1)) to more easily compare results across technologies. Using the methods presented in Chapter 2 with design matrix (3.1), the fitted model for our running example rs11150589 is presented in Figure 3.2 and Table 3.1.

Figure 3.2:Fitted model for rs11150589. The observed values are represented as dots.

(40)

Estimate Std. Error t value Pr(>|t|) diseaseN 4.589380 0.101404 45.258166 4.90E-44 diseaseCDA 4.705833 0.217488 21.637173 5.67E-28 diseaseCDU 4.591303 0.126986 36.156030 4.99E-39 diseaseUCA 5.032506 0.117950 42.666573 1.03E-42 diseaseUCU 4.707411 0.081622 57.673367 1.66E-49 diseaseN:snp 0.075255 0.082796 0.908918 3.68E-01 diseaseCDU:snp 0.087544 0.108744 0.805049 4.24E-01 diseaseUCA:snp 0.181252 0.079748 2.272795 2.71E-02 diseaseUCU:snp 0.047407 0.063224 0.749829 4.57E-01

Table 3.1:Summary of the fitted model for rs11150589.

There are only one observation of a sample with CDA, so there are no estimated effect of SNP for CDA. Out of the covariates representing interaction between SNP status and disease status,βdiseaseU CA:snpis the only significant one (at level 0.05).

To look at the effect of the difference between inflamed and uninflamed tissues, we use the contrast vectorC0which is presented in Section 3.1.3. We study

H0:C0β= 0 H₁:C₀β6= 0.

We have

C0βˆ = ˆβ_diseaseCDA+ ˆβdiseaseU CA−( ˆβ_diseaseCDU + ˆβdiseaseU CU)

= 4.705833 + 5.032506−(4.591303 + 4.707411) = 0.439625 and the Wald test gives ap-value of

W_disease= 0.01582864

This means that there is a significant effect of the difference between inflammable and un-inflammable tissues. When we look at the contrast vectorC3, we have

H₀ :C₃β= 0 H₁ :C₃β6= 0 where

C₃βˆ = ˆβdiseaseU CA:snp−( ˆβ_diseaseCDU_:snp+ ˆβdiseaseU CU:snp)

= 0.181252−(0.087544 + 0.047407) = 0.046301

(41)

Figure 3.3:Residual plots for rs11150589.

The Wald test gives ap-value of

W_{SN P i}= 0.3832292

which means that there are no significant effect of the difference between inflammable and un-inflammable tissues when we are looking at the interaction between disease status and SNP status. The normal QQ-plot is shown in Figure 3.3a and the plot for the residuals vs fitted values is shown in Figure 3.3b, and they look good. The Anderson-Darling normality test has ap-value = 0.001265. This means that even though the plots look good, we have to show caution when interpreting p-values, because the error terms are not necessarily normally distributed.

3.3 RNASeq and GLM

We want to compare the gene expressions for patients with different SNP statuses and disease statuses. When analysing RNASeq data, the gene expression is represented as count data. We used ESNG-number for ITGAL to find correct transcript. For preprocessing, the transcript expression values were generated by quasi alignment using Salmon and the Ensemble (GRCh38) human transcriptome.

Aggregation of transcript to gene expression was performed using tximport. Gene expression values with CPM (counts per million) below one in more than three samples were filtered out before differential expression analysis. The preprocessing was performed at the group of Atle van Beelen Granlund, and we received preprocessed count data.

(42)

According to Holmes and Huber (2018), there are challenges in using count data.

This is because their distribution is not symmetric, and the variance and distribution shape of the data in different parts of the dynamic range are very different.

To not deviate too much from the notation used in Love et al. (2014a), we use the index i for the gene and j for the sample in this section. To compare our observations, due to the challenges in using count data, the data from different samples have to be scaled with a factor. To calculate this factor, we use the data from all available genes. In our case we have 58 051 genes. This scaling factor is called a size factor, and is a different number for each sample. The size factorsjis calculated using the R functionestimateSizeFactorsfrom the R package DESeq2(Love et al., 2014b), which uses the ”median ratio method” as described in Love et al. (2014a). For our data set, the size factors are

s=







1.0467966 1.0031062 1.0885007 0.8742161 1.0313613

... 1.2620621







The distribution of the size factors for all samples in our data set are presented in Figure 3.4.

Let K_ij be the RNASeq count for gene i and sample j, and m the number of samples, then we have

ˆ

sj =mediani

K_ij (Π^m_v=1K_iv)^1/m

where the denominator gives the geometric mean. Further, we assume K_ij ∼NB(µ_ij, α_i)

where NB(µ_ij, α_i) denotes the negative binomial distribution with meanµ_ij and dispersion parameterαi. In Chapter 2, we saw thatαcan be regarded as a kind of dispersion parameter, and following Love et al. (2014a) we refer toαas a dispersion parameter here. We further assume that

µij =sjqij

log(qij) =x^T_jβ_i

(43)

Figure 3.4: Distribution of size factors for all samples in our data set. The range of the data is 0.46-1.35.

In this thesis we are only looking at one gene (ITGAL), so we use the negative binomial model for one gene, omitting theisubscript. The observed count of the ITGAL RNASeq gene expression isK_j, where

Kj ∼NB(µj, α) with

µj =sjqj

log(qj) =x^T_jβ

Observe thatµ_j is a product. Using a log-link this can be written as log(µj) = log(sj) + log(x^T_jβ)

where the term log(s_j) is called an offset and is considered a constant and not estimated in the GLM routine. In the GLM, we also assume that α is known. In GLM we have usedφ = 1 for negative binomial, but it can also be estimated in R. To estimate the dispersion factorα, the R functionestimateDispersions is used. This function calculates dispersion estimates for negative binomial distributed data, borrowing strength across all genes, see Love et al. (2014a) for de- tails on the estimation. The dispersion estimate for ITGAL is

ˆ

α_{IT GAL}= 0.2202738.

(44)

Figure 3.5:The distribution of dispersion factors for all genes in our data set.

The distribution of the dispersion factors for all genes in our data set are presented in Figure 3.5.

To perform hypothesis testing for many genes simultaneously, the R function nbinomWaldTest was considered, but it will not be used. This is because we have a large enough sample size to perform hypothesis testing without borrowing strength from other observations. We will instead use the R functionglmwith the negative binomial family from the R packageMASS(Venables and Ripley, 2002).

For comparison, we also show the result of running thenbinomWaldTest.

Using the methods presented in Chapter 2 with design matrix (3.2), the fitted model for our running example rs11150589 is presented in Figure 3.6 and Table 3.2. We want to plot the observed data. For plotting, it is recommended to use normalised data, transforming the data to get the same variance. There are many ways to do this. In Love et al. (2014a) it is recommended to use regularized logarithm transformation (rlog), which must be calculated based on all genes simultaneously. This operation is time-consuming when we have many genes. Another way to trans- form the data is to use a variance-stabilising transformation (vst), which stabilise the variance, but this transformation does not take the size factor into account.

This means thatrlogmight be better, but due to this being very time consuming and we are only interested in the ITGAL gene, we have chosen to usevstin this thesis because it is faster. According to Love et al. (2014a) the value of rlog(K_ij)

(45)

for large counts is approximately equal to log2

_K

ij

sj

. Thevst values are used for plotting the observed values from the RNASeq data in Figure 3.6, Chapter 4 and Appendix C. The data shown arevstvalues divided bylog(exp(1)), because we work on the natural log scale, while the DESeq2 uses log2 scale. Table 3.3 shows the results using the R functionnbinomWaldTest, which uses almost 4 minutes and perform the calculations for all genes, but for one SNP. As we can see, the results are similar to the results fromglmand we prefer usingglmdue to running time and because we are only interested in the results for ITGAL.

Figure 3.6:Fitted model for rs11150589.

Estimate Std. Error z value Pr(>|z|) diseaseCDA 7.464120 0.185182 40.306983 0.00E+00 diseaseCDU 6.209766 0.196842 31.546945 1.97E-218 diseaseUCA 7.209524 0.171076 42.142231 0.00E+00 diseaseUCU 6.369039 0.171791 37.074384 7.27E-301 diseaseCDA:snp 0.057709 0.157115 0.367307 7.13E-01 diseaseCDU:snp 0.330731 0.209419 1.579282 1.14E-01 diseaseUCA:snp 0.153356 0.142041 1.079659 2.80E-01 diseaseUCU:snp 0.177812 0.150583 1.180825 2.38E-01 Table 3.2:Summary of the fitted model for rs11150589 using the R functionglm.

None of the covariates representing interaction between SNP status and disease status are significant at level 0.05.

(46)

Estimate Std. Error z value Pr(>|z|) diseaseCDA 7.464142 0.182841 40.823126 0.00E+00 diseaseCDU 6.209770 0.194369 31.948423 5.68E-224 diseaseUCA 7.209495 0.168915 42.681150 0.00E+00 diseaseUCU 6.369016 0.169635 37.545339 0.00E+00 diseaseCDA.rs11150589 0.057687 0.155128 0.371867 7.10E-01 diseaseCDU.rs11150589 0.330730 0.206784 1.599404 1.10E-01 diseaseUCA.rs11150589 0.153386 0.140246 1.093696 2.74E-01 diseaseUCU.rs11150589 0.177833 0.148692 1.195985 2.32E-01 Table 3.3: Summary of the fitted model for rs11150589 using the R function nbinomWaldTest.

To look at the effect of the disease groups, we use the contrast vectorC₀which is presented in Section 3.1.3. We study

H0:C0β= 0 H1:C0β6= 0.

We have

C0βˆ = ˆβdiseaseCDA+ ˆβdiseaseU CA−( ˆβdiseaseCDU + ˆβdiseaseU CU)

= 7.464120 + 7.209524−(6.209766 + 6.369039) = 2.094839 and the Wald test gives ap-value of

W_disease = 1.522387E−08.

This means that there is an significant effect of the difference between inflammable and un-inflammable tissues. When we look at the contrast vectorC₃, we have

H0 :C3β= 0 H₁ :C₃β6= 0 where

C₃βˆ = ˆβdiseaseCDA:snp+ ˆβdiseaseU CA:snp−( ˆβ_diseaseCDU_:snp+ ˆβdiseaseU CU:snp)

= 0.057709 + 0.153356−(0.330731 + 0.177812) =−0.297478 The Wald test gives ap-value of

W_{SN P i}= 0.382014

(47)

Figure 3.7:Residual plots for rs11150589.

which means that there are no significant effect of the difference between inflammable and un-inflammable tissues when we are looking at the interaction between disease status and SNP status.

The normal QQ-plot is shown in Figure 3.7a and the plot for the residuals vs fitted values is shown in Figure 3.7b, and they look good. Even though we have fitted a GLM model with negative binomial distribution, it is not unusual to study residual plots for trends. We have plotted the deviance residuals, as explained in Dunn and Smyth (2018, pp. 297-300).

(48)

(49)

Chapter 4 eQTL analyses

4.1 Process overview

Figure 4.1: Flowchart explaining the operations performed on the gene expression and SNP data.

To perform the analysis presented in Chapter 3 and this chapter, we have executed a series of operations on the gene expression and SNP data. This is presented in the flowchart in Figure 4.1. We will now explain the process. The starting point is that the researcher want to perform an eQTL analysis for a specific gene. The

Identifying expression quantitative trait loci in patients with inflammatory bowel disease

Identifying expression quantitative trait loci in patients with

inflammatory bowel disease

Statistical models relating gene expression and single nucleotide polymorphism data

Master's thesis

Kristine Lund Mathisen

Identifying expression quantitative trait loci in patients with inflammatory bowel disease

Kristine Lund Mathisen

Summary

Preface

Table of Contents

Chapter 1

Introduction

1.1 Inflammable bowel disease

1.2 SNP-data

1.3 Linkage disequilibrium and genotype correlation

1.4 Gene expression

1.5 eQTL

1.6 HUNT Cloud

1.7 Thesis outline

Chapter 2

Linear and generalized linear models

2.1 Multiple linear regression model

2.2 Generalized linear models

2.3 Several SNPs and multiple testing

Chapter 3

Analysing gene expression data

3.1 Design matrices and contrasts

3.2 Microarray and LM

3.3 RNASeq and GLM

Chapter 4

eQTL analyses

4.1 Process overview