Interpretability of convolutional neural networks when applied to receptor sequence data

(1)

networks when applied to receptor sequence data

Aleksander Lunevski Husa

Thesis submitted for the degree of

Master in Computational Science: Bioinformatics 60 credits

Institutt for informatikk

Faculty of mathematics and natural sciences UNIVERSITY OF OSLO

Autumn 2020

(2)

(3)

Interpretability of convolutional neural networks when applied to receptor sequence data

http://www.duo.uio.no/

Printed: Reprosentralen, University of Oslo

(4)

Machine learning has become more ubiquitous and complex in recent years, and it is therefore important that their inner workings are well understood. This holds especially true for the technicians and scientists who apply the models to real world problems.

The ability of machine learning models to capture patterns in complex data makes them ideal for many real world applications.

Our adaptive immune systems contain a magnitude of immune cells, some of which detects pathogens. The receptors of these immune cells can be matched with the epitopes of the antigens, and an immune response will be triggered if a match is found.

Understanding which receptors connects with which epitopes based on their amino acid compositions is an active area of research.

In this thesis we explore the interpretability of convolutional neural networks when applied to receptor data on an amino acid level. Convolutional neural networks are trained to distinguish between sequences that connect with an epitope, and sequences that does not connect to the epitope. The goal of this thesis is not to design a model that achieves a perfect performance, but to explore how the models fit to the datasets. The initial testing is done using a simple simulated dataset, which has a single strong signal implanted. Different hyperparameters and model architectures are tested to understand how they impact the model’s ability to capture the signal. The second part of this thesis involves using a more realistic dataset, and the results from the testing on the simulated dataset are used to inform the investigation into the new dataset. We find that a few of our previously held intuitions about how the networks learn the signals does not quite match reality.

(5)

I would like to express my gratitude to my supervisors Geir Kjetil Sandve and Milena Pavlovic for giving me the opportunity to work on this project, for giving me guidance throughout the process and for always being available on Slack to answer my numerous questions. I also wish to thank my friends and family for their continued support during my studies. Finally, I would to thank Helle for proofreading this thesis, and for always being the best support I could ask for.

(6)

1 Introduction 1

2 Theory 3

2.1 Machine learning . . . 3

2.1.1 One hot encoding . . . 4

2.1.2 Deep neural networks . . . 4

2.1.3 Regularization . . . 15

2.1.4 CNN . . . 15

2.1.5 Other machine learning methods . . . 22

2.2 Immune system . . . 23

2.2.1 Activation of the adaptive immune system . . . 23

2.2.2 The immune response . . . 24

2.2.3 The memory of the immune system . . . 24

2.2.4 The production of T-cells . . . 24

2.3 Levenshtein distance . . . 26

3 Methods 27 3.1 ImmuneML . . . 27

3.2 Dataset . . . 28

3.2.1 Uniformly distributed amino acid dataset . . . 28

3.2.2 Real-world dataset . . . 29

3.3 Models . . . 29

3.3.1 Initial model . . . 30

3.3.2 Model with max pooling . . . 30

3.3.3 Model with multiple kernels . . . 32

3.3.4 Deeper CNN . . . 32

3.3.5 PyTorch . . . 32

3.4 Kernel score . . . 33

4 Results 35 4.1 Exploration of interpretability on simulated dataset . . . 35

4.1.1 Max pooling . . . 40

4.1.2 Models with five kernels . . . 42

(7)

4.2 Impact on model interpretability when applied to real-world dataset . . 45

4.2.1 Models with different kernel widths and number of kernels . . . . 51

4.2.2 Impact of multiple kernels . . . 54

4.2.3 Exploring sequence similarities . . . 56

4.2.4 Comparing prediction performance to alternative machine learning models . . . 61

5 Discussion 62 5.1 Simulated dataset . . . 62

5.1.1 Max pooling . . . 65

5.1.2 Models with five kernels . . . 67

5.2 Real-world dataset . . . 68

5.2.1 Models with different kernel widths and number of kernels . . . . 69

5.2.2 Sequence similarities . . . 69

5.2.3 Alternative machine learning models . . . 70

5.3 Future work . . . 71

5.3.1 Kernel score . . . 71

5.3.2 Parameter restriction . . . 71

5.3.3 Other encodings and dataset . . . 72

6 Conclusion 73

Bibliography 73

(8)

Introduction

The use of, and the research into machine learning has grown immensely the last decade.

Machine learning methods are being used in an increasing amount of scientific fields, and the industries are trying to make these new advances into a source of economic growth.

It has therefore gotten more important to make the models (especially the deep neural networks) more transparent. It is crucial that we can discern how a machine learning model makes its prediction when the model is used for new scientific breakthroughs or by companies, which are accountable to their consumers.

Convolutional neural networks (CNN) are often used on images, and a lot of the explanations of the networks inner workings use image data as examples. It is often explained that the kernels in the first layer of the network capture basic shapes such as lines and edges and that the kernels deeper in the network capture progressively more complex patterns and shapes. These intuitions can often be modified and adapted to fit other types of data. Neural networks grow more complex, but there are many interesting results about how these models function that can be interpreted using relatively simple models. The field of machine learning is growing, and it is therefore even more crucial that the behaviour of the fundamental functions that contribute to the models are well understood by its users.

In this thesis we will focus on simple convolutional neural networks applied to immune receptor data. The receptors of the immune system, which are made up of chains of amino acids, are used to recognize pathogens in the human body. The receptors try to connect with the epitopes of the pathogens, and an immune response will be triggered if a match is found. How these matches are decided and which parts of the receptor connects with the epitopes is an active area of research. The entire sequence of amino acids is not used to determine a match. A subset, or a motif, of the sequence binds with the antigen. Machine learning models are well suited to find patterns in these types of data. If a model is created, such that these machetes are predicted, and if the method used by the model can be sufficiently explained, then we can get closer to understanding how this biological process works.

In this thesis we will first explain the theory behind the methods used in the exper- imentation. In following, the methods themselves are described in more detail, before

(9)

the results of the testing are presented. We make an in-depth discussion of the main findings, before making the concluding remarks.

(10)

Theory

2.1 Machine learning

Supervised-, unsupervised- and reinforcement learning are the three main types of machine learning problems. Supervised learning requires labelled data which let the networks know when a false prediction has been made. Unsupervised learning does not require labels, and the models cluster data based on how similar their features are.

Reinforcement learning trains an agent in an environment based on rewards and pun- ishments. This thesis will be focused around convolutional neural networks, which are trained using supervised learning. The other types of learning will therefore not be covered in this theory section.

There has been a substantial amount of development in the field of machine learning the in last decade, and it has transformed the way companies and fields of science operate. New ideas and improvements to old ideas have been published in a rapid pace during the 2010’s. However, machine learning is considered by many to have its roots in a paper published in the 1940’s where McCulloch and Pitts proposed the first mathematical model of an artificial neuron [McCulloch and Pitts, 1943]. It modelled a simple neuron which gave a binary output based on its input variables and a threshold.

The next leap in progress happened in 1957 when Frank Rosenblatt [Rosenblatt, 1957] made several improvements to the artificial neuron proposed by McCulloch and Pitts. This model had adjustable weights, and was called a perceptron. Rosenblatt applied a supervised learning algorithm to the single layer perceptron, which could update the weights and learn from the input data. However, this single layer model could only solve linear problems and therefore had limited real world usage. Adding more layers and non-linear activation functions would allow models to solve non-linear problems, but there were no methods for updating the weights for a multilayered perceptron (MLP).

Progress in machine learning was unsteady in the following decades, but the development of backpropagation, activation functions and optimisers eventually made our current models possible. The final pieces of the puzzle that made the 2010’s an extraordinary decade for machine learning progress were computation power and avail-

(11)

ability of data. Powerful computer hardware had become relatively cheap. This allowed larger networks to train longer using less time than previously. Large amounts of data were also collected and stored which meant that it was much easier to train and test models.

Today, there exists a variety of artificial neural networks such as convolutional neural networks, multi layer perceptrons and recurrent neural networks. However, as mentioned above, CNNs are the main focus of this thesis. This type of network gets its name from the convolutional layers that have convolving kernels with learnable parameters. Most CNNs also contain dense layers, which are layers of nodes. Convolutional neural networks will be described in more detail later in the theory section.

In the following, we will go through the different elements that contribute to the functionality of a convolutional neural network (CNN). We will start by explaining an MLP which is a feed forward network that only uses dense layers. This will cover the dense layers of the CNN, as well as general concepts in supervised learning models.

2.1.1 One hot encoding

A vital part when training a machine learning model is how to represent the input data.

Image data is fairly intuitive to represent, as it is already represented by integers or floats which can easily be fed into a machine learning model. Other types of data are more complicated. For instance, natural language does not have a simple numerical representation. Several methods have been developed to transform data into a representation that can be interpreted by machine learning models. We will mostly be using one hot encoding in this thesis. One hot encoding transforms each categorical feature to a vector where all elements are 0 except for one element, which has a value of 1.

The location of the element containing the1value indicates which category the feature belongs to. This is a simple, yet robust method for representing categorical data.

An intuitive way of representing categorical data might be to assign an integer to each category. This can be seen in the first two columns of Table 2.1. The left column shows the amino acid alphabet, where each letter represents one type of amino acid. The middle column shows a possible integer representation. A problem with this representation is that it assumes that A (0) is closer, and therefore more similar to C (1) than for example T (16). This is an assumption that will impact the machine learning model, and we do not know if this assumption is correct. The right column shows a one hot encoded amino acid alphabet. We can see that this representation is more sparse, but it does not assume that any amino acid is more closely related than others. Our model will therefore be able to start learning without any preconceived similarities between the categories based on the encoding of the data. As a result, we represent our amino acid sequences as 2D-matrices consisting of ones and zeros.

2.1.2 Deep neural networks

A deep neural network is a network which contains several layers of nodes. The goal of the network is to estimate a given output that is a function of a number of inputs,

(12)

Amino acid Integer category One hot encoded

A 0 [1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]

C 1 [0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]

D 2 [0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]

E 3 [0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]

F 4 [0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]

G 5 [0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0]

H 6 [0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0]

I 7 [0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0]

K 8 [0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0]

L 9 [0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0]

M 10 [0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0]

N 11 [0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0]

P 12 [0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0]

Q 13 [0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0]

R 14 [0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0]

S 15 [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0]

T 16 [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0]

V 17 [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0]

W 18 [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0]

Y 19 [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1]

when trained using a supervised learning method [Russell and Norvig, 2009]. The output that the model tries to estimate is a class label (y) for classification problems.

The inputs get fed into the network and the model makes a prediction based on these inputs. The predictions are then compared to the correct labels, before the model makes adjustments based on the results. The purpose of the trained model is to estimate an unknown function that maps from the input data to the class label. We will first explain the components of the networks, before giving a detailed explanation of the methods that make the neural networks function.

Deep neural networks have multiple layers of neurons/nodes. Nodes are the modern version of the perceptron proposed by Rosenblatt in 1957. Figure 2.1 shows a diagram of a single node. x are the inputs to the node which are multiplied by their corresponding weights (w). The sum of these products along with the bias (b) gives the output of the node (z). The weights (w) and biases (b) are the adjustable parameters that we want our network to learn. The output gets passed through an activation function, f(z), whose output (a) gets passed further into the network. We will explain different activation functions in detail in Section 2.1.2. The activation of one node with three inputs can be written as

a=f

3

X

i=1

xiwi+b

!

(2.1)

(13)

x

₁

x

₂

x

₃

w

₁

w

₂

w

₃

b

z a

Node Activation

Figure 2.1: Diagram of a single node where xi is an input, wi is the weight xi is multiplied by andb is the bias value for the node. z is the output of the node andais the output of the activation function.

The inputs can be defined as a vector of features,X, with lengthˆ n, wherenis the number of inputs. The weights can also be defined as a vector, Wˆ, which in this case will have the same length as the input vector. This means that we can rewrite Equation 2.1 as

a=f

XˆWˆ +b

(2.2) This would be the output of the model if the model only contained a single node, but deep neural networks can contain hundreds or thousands of these nodes. This means that the inputs get passed to several nodes in the hidden layers, which are layers between the input and the output layer. Deep neural networks contain several hidden layers, and each node has their own bias and weights. Therefore, the weight vector becomes a weight matrix of size n×m, where m is the number of nodes in the given layer. For multiple nodes, Equation 2.2 can be written as

Aˆ=f

XˆWˆ + ˆB

(2.3) where Aˆ and Bˆ are the vectors containing the activation and bias for each node.

These layers can be fed into new layers which create a deep neural network. Figure 2.2 shows a diagram of a neural network with one hidden layer and two outputs. The values after the activation function of the output layer are the estimated outputs (˜y) of the model.

(14)

x₁

x₂

x_n

a₁ a₂ a₃

a_m

˜ y₁

˜ y₂

Hidden layer Output layer

Figure 2.2: Diagram of a neural network with one hidden layer and two outputs. Each node also contains the activation function and bias as opposed to Figure 2.1. y˜represents the estimated output of the model.

Feed forward

The goal of the network, as discussed in the previous section, is to generate outputs based on the inputs that are fed into the network. The information from the inputs get passed to the first layer, which generates activations and feeds them further into the network. This is done for all the layers until an output is estimated. We can rewrite Equation 2.1 for the activation of any node in any layer, such that

a^l_j =f

M

X

i=1

w^l_ija^l−1_i +b^l_j

!

(2.4)

whereldenotes the current layer andjdenotes the node in the layer. a^l−1_i represents the inputx_i for the first layer in the network and a^l_j is the output of the network if lis the last layer. M is the number of nodes in the previous layer. This can also be written in vector form as

Aˆ^l=f^l

Aˆ^l−1Wˆ^l+ ˆB^l

(2.5) Further, the node output of a network with one hidden layer can be written as

(15)

˜ y_k=f²



 X

j=1

w²_jkf¹ X

i=1

w¹_ijxi+b¹_j

! +b²_k



 (2.6)

where y˜ is the estimated output of the model. Equation 2.6 can also be written using vector notation, such that

Yˆ =f² f¹

XˆWˆ¹+ ˆB¹

Wˆ²+ ˆB²

(2.7) where Yˆ is a vector containing the prediction for each sample. This is how the information from the inputs get passed through the network to produce an output, but a crucial element that is missing is how the weights and biases are decided. We train the model to learn the parameters that give the best prediction. There are three more elements that are needed to optimise the parameters: a cost function, an optimising method and backpropagation.

Activation functions

Before we can start exploring the new elements, we need to explain the activation functions. There exists a variety of activation functions, and different functions can be used in each layer. Different activation functions have different strengths and weaknesses, but they all need to be differentiable since the backpropagation passes gradients backwards through the network. Backpropagation will be described in detail in Section 2.1.2. The non-linear activation functions, when combined in more than one layer, is what makes the models able to learn non-linear problems. We use sigmoid activation in the output layer and ReLU activation in the hidden layers, and they will be described in the following.

The sigmoid function has been a popular activation function for many years. This is a function that reduces any number to a value between 0 and 1, and it is a non- linear and differentiable function. The sigmoid activation is an intuitive step when considering that the original artificial neuron give a binary output. It has almost the same characteristics, but with added benefits. The function itself can be written as

f(z) = 1

1 +e^−z (2.8)

Figure 2.3 shows the output of the function for input values x ∈ [−7,7]. This function has a simple derivative

df(z)

dz =f(z)(1−f(z)) (2.9)

A problem with the sigmoid function is that it might saturate. The sigmoid function will limit the activation tof = 1if weights get too large, and therefore generates a large output from a node. Then the gradient of the function will be close to 0, which will severely hinder the learning of the network. This problem becomes even more prevalent

(16)

Figure 2.3: Illustration of the sigmoid function forx∈[−7,7].

if a sigmoid function is used in several layers. This is called the vanishing gradients problem. The rectified linear unit (ReLU) fixes some of these problems. The function can be written mathematically as

f(z) =max(0, z) (2.10)

This means that the function is linear for values above 0, and it gives 0 as an output for values below0. Figure 2.4 shows a representation of the ReLU function, and Equation 2.11 shows the derivative of the function.

df(z) dz =

(0 if z <0

1 if z >0 (2.11)

This derivative is not defined for0, but it can in practice be set to either0or1. We can see how this function does not saturate, and therefore does not cause the gradients of nodes with large outputs to become small. This improves learning in deep neural networks. We can also see that a problem might occur when nodes output a value less than 0. This will cause the gradient to be 0 and effectively stopping the node from adjusting its weights, thus killing the node. However, this might be less of a problem for learning as long as the gradients can be passed through other nodes in the layer as hypothesised by Glorot et al. [2011]. Regularization in the network is often beneficial when using ReLU as it has no upper bound to its output, and it can therefore continue to grow.

Cost functions

We want to train our network by adjusting the weights and biases, and we have previously mentioned that this is done by comparing the predictions against the correct labels. Cost functions are how we measure this error. Loss and cost are sometimes used interchangeably, but an often used distinction is that loss are for one sample and the cost are over the entire training set. The value of the cost is large when the model is a

(17)

Figure 2.4: The ReLU function for x∈[−7,7].

poor predictor and low when the predictions are good. This means that we can try to minimize the cost with regards to the weights and biases of the network. The binary cross entropy cost function is a function for binary classifiers. It can be written as

C(yi,y˜i) =−(y_ilog( ˜yi) + (1−yi)log(1−y˜i)) (2.12) where y_i is the true label andy˜_i is the predicted label. Equation 2.12 can give an intuitive understanding of the function. yi can either be 0or 1for a binary prediction problem. We wanty˜i to be as close toyi as possible. We also want our loss to be high when the prediction is far form the correct label, and low if it is close to the correct label. y˜_i is also a value between 0and 1, which can be interpreted as a probability for the sample to be labelled as1. The closer the value is to 1, the larger the likelihood of 1being the correct label. If the correct label isy_i = 1, then we get

C(1,y˜i) =−log( ˜yi) (2.13) We can see from this that the closer y˜_i is to0, the larger the loss will be. Further, we have that

C(0,y˜_i) =−log(1−y˜_i) (2.14) if the correct label is0. This also shows that the loss gets higher ify˜i is closer to1.

We use the average when finding the cost of all the samples C( ˆY ,Y˜ˆ) =−1

N

X

i=1

y_ilog( ˜y_i) + (1−y_i)log(1−y˜_i) (2.15) whereY˜ˆ contains predictions for theN samples and Yˆ contains all the true labels.

For non binary, single label predictions we can use the cross entropy cost function, which can be written as

(18)

C( ˆY ,Y˜ˆ) =−1 N

N

X

i=1 O

X

o=1

yi,olog(˜yi,o) (2.16)

where O is the number of possible labels. The loss function also assumes that the values of y˜can be interpreted as the probability of the label being correct. It also assumes that all the predicted labels sum to1, and it is therefore used together with the softmax activation function. The softmax activation function is an activation function where the output of the nodes gets mapped to probabilities that sum to one.

Backpropagation

After the network has passed the input through the nodes and activation functions and the error in the predictions have been quantified using the cost function, the error has to be passed backwards to the correct nodes. The goal of the backpropagation is to spread the prediction error to the nodes that contributed to the error. This is done by passing the gradient of the cost function backwards and updating the weights based on how much the activation contributed to the error. We pass the gradient of the cost w.r.t. to weights because we want to minimize the loss by adjusting the weights in the network. Hence our goal is to find dC/dw_jk^l , wherew_jk^l is the k’th weight for the j’th node in layer l. If we apply the chain rule to this derivative we get

dC

dw^L_jk = dC da^L_j

da^L_j dz_j^L

dz_j^L

dw_jk^L (2.17)

where L represents the last layer of the network. This means that da^L_j = ˜y for a network with one output node. The first two terms on the right hand side are often called the error term, and can be written as

δ^L_j = dC da^L_j

da^L_j

dz^L_j (2.18)

In the following example, we will use binary cross entropy as our cost function, a sigmoid activation function for the last layer and the ReLU activation function for the other layers. The first derivative on the right hand side of Equation 2.17 tells us how the activation of a node affects the cost function. For one sample of our binary cross entropy function we get

(19)

dC d˜y = d

d˜y −(ylog(˜y) + (1−y)log(1−y))˜

=− y

˜

y −1−y 1−y˜

=−y

˜

y + 1−y 1−y˜

= y(1˜ −y)−y(1−y)˜

˜ y(1−y)˜

= y˜−y

˜ y(1−y)˜

(2.19)

The second term on the right hand side of Equation 2.17 tells us how the activation function is impacted by our node output. For our output layer, which uses a sigmoid activation, we get

d˜y

dz^L = ˜y(1−y)˜ (2.20)

using Equation 2.9. The final derivative on the right hand side of Equation 2.17 can be found using part of Equation 2.4. This gives us

dz^L

dw^L_k =a^L−1_k (2.21)

Combining these results with Equation 2.17 we get dC

dw^L_k = y˜−y

˜

y(1−y)˜ (˜y(1−y))a˜ ^L−1_k

= (˜y−y)a^L−1_k

(2.22)

where (˜y −y) = δ^L. This makes intuitively sense since we should use the error in the prediction and how strongly a node was activated to adjust the value of the weight. However, this only gives the gradients for the final layer weights, and a deep neural network has several layers. We therefore want to pass our error back through the network. The error of a node in the hidden layer will be the sum of the errors of the following layer multiplied by the connecting weights. The error of nodej in layerl can be written as

δ_j^l = dC z^l+1

dz^l+1 da^l_j

da^l_j

dz_j^l (2.23)

We can see the the first term on the right hand side is the error of the l+ 1 layer.

The second term on the right hand side of Equation 2.23 for nodek⁰ will be

(20)

z^l+1_k0

da^l_j0

= d da^l_j0





M

X

j=1

w_j,k^l+10a^l_j +b^l+1_k0





=w^l_j⁰_,k⁰

(2.24)

since the derivatives of the function where j 6=j⁰ will be 0. M will in this case be the number of nodes in thel’th layer. If we combine these equations and keep in mind that we need the errors for all the nodes in the l+ 1layer we get

δ^l_j =X

k

δ^l+1_k w^l+1_kj da^l

dz^l (2.25)

where k iterates over all the nodes in the l+ 1 layer and w_kj connects node k in layer l+ 1 to node j in layer l. da^l/dz^l will be either 0 or 1 for the layers with ReLU activation, as seen in Equation 2.11. Our gradient w.r.t. any node, using the error term from Equation 2.23 and derivative from Equation 2.21, will therefore be

dC

dw^l_jk =δ_j^la^l−1_k (2.26)

Optimisers

The final part needed in order to train the network is a method to use the backpropag- ated gradients to update the weights. We use two different methods in this thesis.

The first is the commonly used stochastic gradient decent (SGD) and the second is the increasingly popular Adam optimiser. Both methods will be described in the following.

The goal of the SGD method is to follow the negative gradient of the cost function to a theoretical minimum. We therefore update our weights using the calculated error/gradients. This can be written as

w^l_jk ←w^l_jk−ηδ_j^la^l−1_k (2.27) where η is the learning rate of the network. The learning rate is a parameter that can be adjusted to control how fast the network learns. This is done by managing how large the weight updates are, and this can impact the network in several ways.

A learning rate that is too low will make the network train for a long time. On the other hand, a learning rate that is too large could make the network unable to stay in optima. The update of the bias is almost the same as Equation 2.27, except that it is not dependent on the inputs of the node, which gives us

b^l_j ←b^l_j−ηδ_j^l (2.28)

Equations 2.27 and 2.28 show the update for one sample, which means that the weight is updated for each sample that is passed to the network. This is causes the model

(21)

to move in the direction that each sample indicates, which again leads to the network moving in several different directions in the search space when training. Another option is to train the network after all the training data has been passed through the network, which is often called batch gradient decent. This will cause the network to move in the direction indicated by all the data. However, this method is more prone to get stuck in local optima. Thus, it does not explore the search space as much as the previous method. The best result often comes from combining these methods, and it is called mini batching. It involves sending batches of data into the network before updating the weights. This means that our gradients for each weight are the average of all the gradients in the batch. Equations 2.27 and 2.28 then become

w_jk^l ←w^l_jk− η m⁰

m⁰

Xδ_j^la^l−1_k (2.29)

and

b^l_j ←b^l_j− η m⁰

m⁰

Xδ_j^l (2.30)

wherem⁰ is the number of samples in the batch. The SGD optimiser is often used and has several improvements that can be added to it. Momentum can be added to the SGD optimiser to improve its performance, and it works similarly to how it does in classical physics. It ensures that the network moves faster when several updates are in the same direction. Adam employs certain improvements to the SGD optimisers.

It has an adaptive learning rate and momentum. The adaptive learning rate allows the optimiser to increase or decrease the learning rate based on the sparsity of the parameter. An update of a weight with a single sample can be written as

w_jk^l ←w^l_jk− η

√ ˆ

v+mˆ (2.31)

where mˆ is the bias corrected first moment estimate, vˆ the bias corrected second raw moment estimate and avoids division by 0. mˆ and vˆ needs to be bias corrected since the moving averages are initialized as0, and this leads to the moment estimates being biased towards0. We can see that the learning rate is adjusted by using mˆ and ˆ

v, and they can be calculated using the following expressions

m←β1·m+ (1−β1)δ_j^la^l−1_k ˆ

m← m

1−β₁^t

(2.32)

and

v←β2·v+ (1−β2)(δ^l_ja^l−1_k )² ˆ

v← v

1−β₁^t

(2.33)

(22)

whereβ1andβ2 are parameters that can be adjusted when setting up the optimiser.

They are typically set to 0.9 and 0.99, respectively. β₁^t and β₂^t are β₁ and β₂ to the power of the time step. m is the first moment (mean) estimate andv is the second raw moment (uncentred variance) of the gradient. They are updated using the exponentially weighted moving averages of the gradient and squared gradient, respectively. Kingma and Ba [2017] showed that Adam performed better and converged faster than many popular optimisers.

2.1.3 Regularization L2-regularization

We use L2-regularization in our machine learning models. L2 is an often used form of regularization, and it is implemented by adding

λX

i

w²_i (2.34)

to our cost function. This results in a larger cost if the weights are large. λ is a parameter that can be used to adjust how strong the regularization is. The model will therefore be forced to reduce the size of its weights when minimizing the cost function. This will in turn reduce the amount of overfitting a model can do since weights cannot grow indefinitely, and therefore put too much emphasis on a relatively small subset of features in the dataset. We used regularization in our networks in order to test how different magnitudes of regularization impacted the accuracy and motif detection capabilities of the models. Many popular machine learning libraries use the term weight decay when describing L2-regularization. However, there are differing opinions regarding how the terms should be used as described by Loshchilov and Hutter [2019].

Dropout

The dropout method is another popular form of regularization. This does not involve directly adjusting the cost function or limiting the learned weight. The method ran- domly ignores nodes, with a probability p, during the training of the network, but no nodes are dropped during evaluation. This necessitates scaling of activations to adjust for the extra nodes during evaluation. The scaling can be done during the training process or during the evaluation. The dropout method hinders overfitting by preventing co-adaptations in the network. Co-adaptations could lead to feature detectors in the network being too reliant on other feature detectors when making predictions. Hinton et al. [2012] show that the dropout method is an effective regularization technique.

2.1.4 CNN

Almost all CNNs contain dense/fully connected layers as described in the previous section, but what sets them apart from other neural networks are the convolutional

(23)

1 2 3

4 3

1 1

1 1 12

1 1 1

1

Input data Kernel Kernel output

Figure 2.5: An illustration of a kernel convolving over an input matrix. The dot product between the kernel and the green field in the input data results in the first value of the output matrix. This is done three more times, while the kernel moves after each calculation of the dot product, to produce the output matrix.

layers, which consists of kernels. Filters and kernels are sometimes used interchangeably, but in this thesis we will use kernel when describing the matrices that perform the convolution. The weights in the kernels are adjustable, which get updated in a similar way to the ones in the dense layers, but there are certain differences when it comes to how the errors in the network are calculated. These differences will be described in the following.

CNNs are often used on image data, and a lot of the literature uses images as examples. One interpretation is that the kernels learn patterns in the images. Kernels in the first layer will learn simple features such as edges, which further combines into patterns, and layers deeper into the network combines the patterns into more complex shapes [LeCun et al., 2015]. These complex shapes can then be looked for in the image, and a classification can be done based on the combination of shapes found in the image.

Convolutional layers

We mentioned previously that the kernels are matrices. These matrices are passed, or convolved, over the input data. Figure 2.5 shows a2×2kernel passing over a3×3input and producing a2×2output matrix. The first step takes the dot product between the kernel and the four upper left elements in the input data. The kernel will take one step to the right and repeat the process. This is done for all positions in the input matrix.

It is possible to pad the input data with a given value to change the dimensions of the output, but this is not done in this thesis. There are also adjustments that can be made to the convolutions. The kernel can take longer steps between each calculation, but we have used a step size of1in our experiments. These parameters are often called padding and stride.

(24)

1 × 20 × 12 feature map

Convolution with 3

1 × 5 × 5 kernels Convolution with 5 3 × 3 × 3 kernels

3 × 16 × 8

feature map 5 × 14 × 6

feature map

Figure 2.6: An example of how the kernels impact the shape of the inputs. The figure shows a 1×20×12 feature map that passes through a convolutional layer with three 1×5×5 kernels, which produces a 3×16×8 feature map. This feature map is fed through a layer with five3×3×3kernels, which finally generates a 5×14×6feature map.

The inputs to a convolutional layer can sometimes have more than one channel.

Image data often have three channels which represent the strength of the reds, greens and blues in the image. A kernel must always match the number of input channels such that there is one convolution for each channel. However, this kernel will only produce a single channel for the output. Several channels in the output can be generated by having several different kernels convolve over the input. Figure 2.6 shows how the kernels change the sizes of the feature maps. This figure assumes no padding and a stride equal to1.

We also use max pooling in this thesis. Max pooling slides a window of a given size over the input and only outputs the largest value inside the window. We use global max pooling, which means that the size of the sliding window is equal to the size of the input. Thus, only one output will be generated from each matrix passed through the global pooling layer. We therefore get a single value to represent the activations of the entire kernel, and we can use this value to evaluate if the learned motif in the kernel was found in the sequence [Zeng et al., 2016]. Figure 2.7 shows an example of a global max pool layer.

The convolutional and pooling layers extract features from the inputs. These ex- tracted features are then used by the dense layers to make a class label prediction [Yamashita et al., 2018]. It is possible to choose layers and kernels such that a prediction can be made from using convolutional layers only. However, it is more common to feed the feature maps, after a number of convolutional layers, into dense layers.

(25)

7

1 3

2 4 7

5 1 2

5

Input data Max pool Max pool output

7

Figure 2.7: The max pooling layer evaluates all the values inside its window, which will encompass the entire kernel when using global max pooling. The largest value is passed further into the network.

Backpropagation through convolutional layers

The goal of backpropagation in convolutional layers are the same as for dense layers. We want to update our weights in a way that minimizes the cost function. The key difference between dense layers and convolutional layers is the fact that the layers convolve over the input. This is also how the gradients get passed backwards through the network.

We first need to look at the notation of the forward pass of the network which can be written as

z_k,g^l =X

i

X

j

w_i,j^l a^l−1_k+i,g+j +b^l (2.35)

where k and g are the indices of a single output of the convolution. i and j are the indices of the kernel. In this case we assume that a^l−1 have one channel (if it had more, we would need another summation). We can see that this expression is similar to Equation 2.4, but, as previously discussed, it has quite a different impact on the network. We also apply an activation function to the output which we write as

a^l_k,g =f(z^l_k,g) (2.36)

The activation function is applied element wise to the output. Our goal is still to update the weights in a way that minimizes the cost function. We can adjust Equation 2.17 and write the derivative as

dC

dw_i,j^l = dC da^l_k,g

da^l_k,g dz_k,g^l

dz_k,g^l

dw_i,j^l (2.37)

This equation tells us how this particular weight impacts the output, how the output impacts that activation and how the activation impacts the cost function. We know

(26)

from the previous backpropagation section that we need to sum the errors from the layer in front. We also know that the first two derivatives on the right hand side can be written asδ_k,g^l . We therefore rewrite Equation 2.37 as

dC dw_i,j^l =

K

X

k G

X

g

δ_k,g^l dz_k,g^l

dw^l_i,j (2.38)

where K and G are the size of the output of layer l. If we take the derivative of Equation 2.35 w.r.t. w_i^l0,j⁰ we get

dz_k,g^l dw^l_i0,j⁰

= d

dw^l_i0,j⁰



 X

i

X

j

w^l_i,ja^l−1_k+i,g+j+b^l





=a^l−1_k+i0,g+j⁰

(2.39)

We get this simple derivative since the derivatives of the sums of the weights in the kernel that does not have thei⁰, j⁰ indices are equal to0. Combining Equation 2.38 and 2.39 we get

dC dw_i^l0,j⁰

=

K

X

k G

X

g

δ_k,g^l a^l−1_k+i0,g+j⁰ (2.40) which we can interpret as taking the sum of the errors that the weight contributed to multiplied by the input of that weight. Next, we need to know howδ is calculated.

This is similar to the way it is done in dense layer. We still need to finddC/dz_k^l0,g⁰, and we calculate this by finding the sum of the errors in the previous layers (such as we did with dense layers) and multiplying this with the derivative of the output of the node w.r.t. the output of the previous node. Figure 2.8 shows which elements of the output an input element contributes to, given that there is no padding and the stride is equal to 1. We can see from this figure that δ^l+1_1,0, δ^l+1_1,1, δ^l+1_2,0 and δ_2,1^l+1 needs to be considered when finding the error term for z_2,1^l . They also need to be multiplied by the correct weights to determine how muchz^l_2,1 contributed to each δ.

We want the error to be the derivative of the cost function with respect to the output of the node before the activation function. The following notation used for the backpropagation was based on the notation used by Jefkine [2016]. We can write what is shown in the Figure 2.8 as a sum, such as

dC dz_k^l0,g⁰

=δ^l_k⁰_,g⁰

=

M−1

X

m=0 N−1

X

n=0

dC dz_k^l+10−m,g⁰−n

dz_k^l+10−m,g⁰−n

dz_k^l0,g⁰

=

M−1

X

m=0 N−1

X

n=0

δ_k^l+10−m,g⁰−n

dz^l+1_k0−m,g⁰−n

dz_k^l0,g⁰

(2.41)

(27)

Output of layerl Kernel with weightsw^l+1 Error for each element in layerl+ 1 z_2,1^l

w_0,0^l+1 w_0,1^l+1 w_1,1^l+1 w_1,0^l+1

δ_1,0^l+1 δ_2,0^l+1

δ_1,1^l+1 δ_2,1^l+1 4 × 4

2 × 2

3 × 3

z_2,0^l z_3,0^l z_3,1^l z_1,0^l z_0,0^l

z_2,2^l z_3,2^l z_1,2^l z_0,2^l

z_2,3^l z_3,3^l z_1,3^l z_0,3^l z_1,1^l

z_0,1^l

δ_0,0^l+1 δ_0,1^l+1 δ_1,2^l+1 δ_2,2^l+1 δ_0,2^l+1

Figure 2.8: The input elementz_2,1^l contributes the marked errors in the following layer.

It contributes to δ^l+1_1,0, δ_1,1^l+1, δ_2,0^l+1 and δ_2,1^l+1 when multiplied by w^l+1_1,1 , w_1,0^l+1, w_0,1^l+1 and w_0,0^l+1, respectively.

This gives us the error for one particular output in layer l whereM and N in this case are the size of the kernel between layerl and l+ 1. We can see how this equation will sum over the correctδs for the example in Figure 2.8. Finally, we compute

dz^l_k0,g⁰

= dz_k^l+10−m,g⁰−n

da^l_k0,g⁰

dz_k^l0,g⁰

(2.42) We can writedz_k^l+10−m,g⁰−n in Equation 2.41 as

z^l+1_k0−m,g⁰−n=

M−1

X

m⁰=0 N−1

X

n⁰=0

w^l+1_m0,n⁰a^l_k⁰_−m+m⁰_,g⁰_−n+n⁰+b^l+1 (2.43) Figure 2.9 helps to visualise this equation. We will use the values from the figure to explain the equations. We are still interested in getting the error term for elementz^l_2,1. This output contributes to four elements in z^l+1. We therefore need to match these errors toz_2,1^l multiplied by the correct weight. If we are interested to see which weights are multiplied with which elements formz^l, we can use Equation 2.43. For elementz_2,0^l+1 we get

z_2−0,1−1^l+1 =w^l+1_0,0a^l2−0+0,1−1+0

+w^l+1_0,1a^l2−0+0,1−1+1

+w^l+1_1,0a^l2−0+1,1−1+0

+w^l+1_1,1a^l2−0+1,1−1+1+b^l+1

(2.44)

which gives us

(28)

Activation of layerl Kernel with weightsw^l+1 Output of layerl+ 1 a_2,1^l

w_0,0^l+1 w_0,1^l+1 w_1,1^l+1 w_1,0^l+1

z_1,0^l+1 z_2,0^l+1

z_1,1^l+1 z_2,1^l+1 4 × 4

2 × 2

3 × 3

a_2,0^l a_3,0^l a_3,1^l a_1,0^l a_0,0^l

a_2,2^l a_3,2^l a_0,2^l

a_2,3^l a_3,3^l a_1,3^l a_0,3^l a_1,1^l

a_0,1^l

z_0,0^l+1 z_0,1^l+1 z_1,2^l+1 z_2,2^l+1 z_0,2^l+1 a_1,2^l

Figure 2.9: Elements a^l_2,0, a^l_2,1, a^l_3,0 and a^l_3,1 gets multiplied by w^l+1_0,0 , w^l+1_0,1, w_1,0^l+1 and w^l+1_1,1 . z_2,0^l+1 is the sum of these products.

z_2,0^l+1=w^l+1_0,0a^l_2,0 +w^l+1_0,1a^l_2,1 +w^l+1_1,0a^l_3,0 +w^l+1_1,1a^l_3,1+b^l+1

(2.45)

We can see that this matches the marked dot product in the Figure 2.9. This shows us that the activation of z^l_2,1 gets multiplied byw_0,1^l+1 when contributing to z^l+1_2,0. The next step is to take the derivative of Equation 2.43, which we write as

dz^l_k0,g⁰

= d

dz_k^l0,g⁰ M−1

X

m⁰=0 N−1

X

n⁰=0

w_m^l+10,n⁰

da^l_k0−m+m⁰,g⁰−n+n⁰

dz_k^l0−m+m⁰,g⁰−n+n⁰

+b^l+1

!

=w_m,n^l+1da^l_k0,g⁰

dz^l_k0,g⁰

(2.46)

The expression gets shortened since the derivatives become 0 when m 6= m⁰ and n6=n⁰. We place the derivative into Equation 2.41 and get

δ_k^l0,g⁰ =

M−1

X

m=0 N−1

X

n=0

δ^l+1_k0−m,g⁰−nw^l+1_m,nda^l_k0,g⁰

dz_k^l0,g⁰

(2.47) where the final derivative is the derivative of our activation function. We can see that using the indices for getting δ_2,1^l gives us matching values as in Figure 2.8. We now have the gradient for any given weight and the way to pass the errors through the

(29)

network. The final step will be to update the weight based on the chosen optimiser as described in Section 2.1.2.

We use global max pooling in our network, which impacts the backpropagation as well. The only change that needs to be done is that the error will only be passed to the element that had the highest value before the max pooling, since it was the only one that contributed to the error. The highest scoring node gets registered during the forward pass.

2.1.5 Other machine learning methods

Three other machine learning methods were used in this thesis. These methods were not analysed in detail, but their prediction result were compared with the results from the CNNs. These models were also tested with k-mer frequency encoding. A k-mer is a string of amino acids of length k. The sequences are then represented by the frequencies of the k-mers found in the sequences before being passed to the models.

Logistic regression

The logistic regression method is one of the simplest models which uses supervised learning. It can be though of as a single layer neural network where the inputs get fed directly to the output node. This means that this method is a linear classifier, which uses a weight for each input feature, a bias and a sigmoid activation function. The logistic regression function can be written as in Equation 2.2 where f is the sigmoid activation function.

Support vector machine

The support vector machine (SVM) is a machine learning method that aims to define a classification line based on support vectors. The support vectors are the samples in the dataset that are the most difficult to classify. These samples are used to create a decision boundary such that the distance between the line and the support vectors are maximized. An SVM only using this boundary to classify data is by itself a linear classifier, but different functions can be used to map the input data into a higher dimension, such that non linear classification can be done with a linear boundary. However, the SVM models used in this thesis all use linear classification functions.

K-nearest neighbours

K-nearest neighbours (KNN) classifies a sample based on its k-nearest neighbours. This method does not attempt to find an underlying generalizable function, but it simply stores the training data to use for classification. New data is then classified based on the majority class of its k-nearest neighbours. Different distance metrics can be used to define neighbour distances.

(30)

2.2 Immune system

The immune system is a vital part of keeping us healthy. It is responsible for removing dead cells and it protects us against pathogens. The immune system can be divided into three parts. The first part is the immune barriers which try to stop pathogens from entering the body. The second part is the innate immune system which tries to kill the microbes using inflammation, complement and non-specific cellular responses. This part of the immune system does not adapt specifically to different pathogens. Another role of the innate immune system is to stimulate the adaptive immune system. The adaptive part of the immune system is what will be described in this thesis.

The adaptive immune system uses B- and T-cells to detect antigens, and a response is triggered when a pathogen is detected. The adaptive immune system can be divided into humoral immunity and cell-imidiated immunity. B-cells use secreted antibodies to detect microbes and toxins that exist outside of cells. These antibodies can mark microbes for destruction which increases the effectiveness of phagocytes, which are cells that ingest harmful particles. The antibodies will not be effective if the microbes enter cells. However, that is when the cell-mediated immunity is effective. The T-cells detect the epitopes of the antigen in the form of peptides (short sequences of amino acids) that are displayed on the infected cell’s surface, and reacts accordingly. The antibodies can detect many types of chemical structures, but the receptors of the T-cells can only detect peptides displayed on major histocompatibility complex (MHC) molecules. We will mostly discuss the cell-mediated immunity in this thesis, but there are several similarities between the systems.

2.2.1 Activation of the adaptive immune system

The adaptive immune system requires an activation signal from the innate immune system (in the form of molecules) and that the naive B- and T-cells are stimulated by antigens to be initiated. Naive B- and T-cells are cells that have not yet encountered a matching antigen. Due to the immunologic memory, a faster response will be mounted if the antigen has been encountered before. The immunologic memory will be discussed in more detail in Section 2.2.3.

The T-cells require that APCs (antigen presenting cells) present a peptide from the antigen on an MHC molecule to get activated. B-cells can detect antigens presented by APCs, but they can also detect antigens in different forms. Dendritic cells, macrophages and B lymphocytes are all types of cells that can present the antigen peptides to T- cells. Macrophages phagocytose (ingest) microbes and then display the antigens on their cell surface using MHC. A T-cell can then detect the antigen and then stimulate the macrophage to kill the microbe. The B lymphocytes can display an antigen peptide to a helper T-cell. The helper T-cell can then stimulate the humoral immune response.

The dendritic cells are the most important APCs to stimulate naive T lymphocytes, and they can be found in different parts of the body. A dendritic cell detaches itself when they encounter and absorb a microbe before being transported to lymph nodes.

The microbe is processed during this migration such that the dendritic cell can present