Introducing an Efﬁcient Approach for Expressing Uncertainty in Deep Learning with Bayesian Neural Networks

(1)

Introducing an Efficient Approach for Expressing Uncertainty in Deep Learning with Bayesian

Neural Networks

Stochastic Target Metropolis-Hastings

Edward F. Bull

Thesis submitted for the degree of Master in Data Science

60 credits

Department of Mathematics

The Faculty of Mathematics and Natural Sciences

UNIVERSITY OF OSLO

(2)

(3)

Introducing an Efficient Approach for Expressing Uncertainty in Deep Learning with Bayesian Neural Networks

Stochastic Target Metropolis-Hastings

Edward F. Bull

(4)

Introducing an Efficient Approach for Expressing Uncertainty in Deep Learning with Bayesian Neural Networks

http://www.duo.uio.no/

Printed: Reprosentralen, University of Oslo

(5)

Abstract

Background: Markov chain Monte Carlo (MCMC) methods for deep learning are not commonly used because of the computationally heavy Metropolis-Hastings (MH) test. MCMC methods can give more information in its predictions than variational inference. Different approximations of the MH-test have made algorithms that are tractable for deep learning but are still inefficient.

Objective: This exploratory thesis will examine what makes different approximate MCMC methods efficient.

Results: This work introduces a method called stochastic target Metropolis-Hastings (STMH). As it uses gradients to approximate the target in the MH algorithm, rather than approximating the MH-test, STMH is a faster method. When compared to stochastic gradient descent (SGD), STMH is not overconfident. It is also more robust against overfitting, and retain goodness-of-fit over time. Lastly, STMH is both more stable and ver- satile compared to the SGD.

Discussion: The algorithm shows promising results for more efficient Bayesian deep learning and presents new opportunities for MCMC methods to be applied to complex problems. As the method uses some theoretic properties of optimal efficiency for the MH algorithm, it may be more efficient than the Stochastic Gradient Langevin Dynamics method that com- pletely omits the MH test.

Conclusion:Initial studies of STMH shows promising results. Further advancements of STMH should be considered for future work.

(6)

Acknowledgements

I would first like to thank my supervisor, Professor Anne H. S. Solberg, whose expertise was essential for the focus and methodology of this thesis.

Your insightful feedback pushed me to sharpen my thinking and elevate my work. I am grateful for your encouragement. I would also like to thank my co-supervisor Geir Storvik for good discussions and co-supervisor Odd Kolbjørnsen for his insights.

I am grateful for my tutors, specifically Fritz Albregtsen and Kristine Hein for their valuable guidance throughout my studies.

I would like to thank my peers for cheering me on throughout my university years. I would like to thank Erik Bolager for his motivating dedication to the field, and Marius Aasan for his inspiring projects.

Lastly, I would like to thank my friends and family. I could not have com- pleted this thesis without the wise counsel and sympathetic ear of my par- ents. You are always there for me. Lastly, I thank my girlfriend for her encouragement and support throughout this thesis.

(7)

Introduction

Neural networks have changed a lot since its early inception. Improve- ments made in the last decade resulted in neural networks becoming ap- plicable to a multitude of tasks (He et al., 2015; LeCun et al., 2015; LeCun et al., 1990; Wu et al., 2016). As the models improve they take on more complex tasks, which demands the model to express uncertainty. For real world tasks, knowing predictive uncertainty is essential (Kendall & Gal, 2017). This has motivated the use of Bayesian principles in deep learning (Gal, 2016). Markov chain Monte Carlo (MCMC) methods has been the golden standard for Bayesian methods, however a deep learning model can have millions of weights (LeCun et al., 2015) which is computationally intractable for MCMC methods. Other methods such as variational inference (VI) and approximate MCMC methods have been developed to solve the computational problem (Kendall & Gal, 2017; Seita et al., 2016; Welling

& Teh, 2011). However, VI suffers from too little exploration (Wilson & Iz- mailov, 2020) and MCMC methods for neural networks that have tried to approximate the Metropolis-Hastings (MH) test (Bardenet et al., 2017; Seita et al., 2016; Zhang et al., 2020) or even get rid of it (Welling & Teh, 2011) are still too inefficient (Gal, 2016).

The objective of this exploratory thesis is to look into what makes different approximate MCMC methods efficient. This thesis will through a process of combining components from different works and make assumptions in order to arrive at a new approach for approximate MCMC methods. A new method will be proposed in chapter 4 called stochastic target Metropolis-Hastings (STMH). STMH combines good aspects of similar work and introduces an approximation of the target distribution instead of the MH-test. This new method will be compared to existing methods such as stochastic gradient descent, stochastic gradient Langevin dynamics and a variant of the Metropolis-Hastings algorithm that approximates the MH-test. Implications and possible advancements of STMH will be discussed.

(10)

1.1 Outline

Chapter 1 - IntroductionIntroducing this thesis.

Chapter 2 - Introduction to Neural Networks Briefly goes through the composition of a basic neural network.

Chapter 3 - Bayesian Neural NetworksIntroduces Bayesian methods and how they are used in neural networks.

Chapter 4 - MethodologyIntroduces a new algorithm.

Chapter 5 - Experimental Setup States the models and hyperparameters used for the two experiments.

Chapter 6 - Results and DiscussionCompares the proposed method and some variants against other methods.

Chapter 7 - ConclusionA conclusion of the research.

1.2 Own Contributions

Algorithms that approximates the Metropolis-Hastings test has focused on using as little data as possible to get a reliable MH-test (Bardenet et al., 2017; Seita et al., 2016; Zhang et al., 2020). To the author’s knowledge the approximation of a target distribution in a Metropolis-Hastings algorithm is a new approach. for approximating MCMC.

(11)

Chapter 2

Introduction to Neural Networks

The first idea of a neural network came in 1943 when McCulloch and Pitts (1943) borrowed ideas from psychology and neuroscience. They proposed a set of assumptions that constituted what is known as the first neural network. Even though it is inspired by the brain the learning procedure of a neural network does not resemble how the brain learns (Rumelhart et al., 1986). In 1958, Rosenblatt showed that a perceptron could learn more than one thing simultaneously. This raised the problem of finding a set of weights that yields different output for different classes of input.

Rumelhart et al. (1986) introduced an algorithm where the neural network automatically updates its own weights from observing input data and desired output data, called learning. Further research has shown that a multi layer perceptron can approximate any function (Cybenko, 1989).

These papers have had a great impact, and made neural network gain a lot of attraction. Research has produced a vast number of different variations of neural networks. Today, state of the art neural networks have surpassed human accuracy for certain tasks (He et al., 2015; Wu et al., 2016).

2.1 Perceptron

The perceptron is a computational unit introduced by Rosenblatt (1958).

The perceptron is a weighted sum of inputs.

y=

∑

j

x_jw_j. (2.1)

In addition to weighting all the inputs a bias is added, for simplicity it is denoted asw₀ = _1,x₀ = 1. Here the bias is referred to as the first weight w₀, and it always gets an input x₀ = 1. The perceptron produces a sum based on the weightsw_j, and the idea is that a set of weights will produce very different sums based on the inputx_j. For a linearly separable problem a threshold can be found that separates two classes of input (Rosenblatt, 1958). The sum of a perceptron was transformed to a 0-1 output with a threshold. This transformation was called an activation function.

(12)

2.2 The Neural Network Model

A neural network consists of several different components. The first component is an input layer. The last component is an output layer. The layers between the input and output layer are called hidden layers. All layers consists of multiple neurons, which are also called computational units. However, the units in the input layer as well as the biases are not regarded as computational units. Another characteristic is the activation function which is applied after a neuron have calculated a linear combination Equation (2.1) of its own inputs. The activation is the output of a unit, the output gets propagated forward to the units in the next layer, which also calculates a linear combination that consists of outputs from the previous layer and a set of weights for the each unit. The weights and biases together, often called parameters, are responsible for transforming the input to the output of the final layer. The objective is to find a set of parameters that create the best predicting model for a given architecture.

All these components combined is the foundation of a neural network.

2.2.1 Formal Definition

Given a set of training data D = {(x₁,y₁),(x₂,y₂), ...,(x_n,y_n)} where x_i ∈ _R^p andy_i ∈ Y. There are to common cases of problems to solve: the regression problem where the model produces a real number of a desired variable (e.g. housing price), or the classification problem where the model will output a class from a set of classes (e.g. type of animal). Hence, Y will have two different definitions according to the problem at hand. For a regression problemY ∈_R. On the other hand, for a classification problem Y ∈ {_{1, 2, ..,}C}, where C is the number of classes. We thus define a neural network model as the mapping f^W :R^p→ Y, whereWis the set of weights and biases for the model.

2.2.2 Multi-Layer Perceptron

The perceptron introduced by Rosenblatt (1958) was a single computational unit, also referred to as a node or a neuron. This computational unit is the fundamental building block for a multi-layer perceptron also called a neural network. A neural network consists of multiple perceptrons either connected sequentially and/or working in parallel. Neurons that are working in parallel are referred to as a layer.

2.3 Feed-Forward Neural Networks

The feed-forward neural network (FFNN) model is the most basic architectural type of neural networks. Connections between layers are fully connected, meaning that a node in one layer has a connection to all the nodes in the next layer, except the biases. Feed forward means that all connections goes forward, usually only to the next layer. There exists other architectural types and they all build upon the idea of a feed-forward neural network.

(13)

Figure 2.1: A figure of a feed forward neural network, the bias nodes are not shown. Figure is showing a weather model taken from Cloud et al.

(2019).

Input Layer

The first layer in a neural network is called the input layer. This layer has the same amount of nodes as dimensions in the input-dataX. Images have to be flattened out to a long vector in order to use a feed forward network, and X is then the flattened vector. The nodes in the input layer are not computational nodes. They simply serve as the input to the nodes in the next layer. The node in the input layerx_iis connected to all the nodes in the subsequent layer, they are fully connected. The input layer is often denoted x⁰_i where a superscript of zero denote the first layer.

Hidden Layer

A layer in-between the input and output layer is called a hidden layer. It could be connected again to another hidden layer, and it is common to chain multiple hidden layers with a descending magnitude of nodes after each other like a funnel.

A node i in the first hidden layer h¹ calculates the weighted sum of inputs

h¹_i =

∑

j

x⁰_jw¹_ji. (2.2)

Herex⁰_j is the value of the corresponding input node, andw¹_jiis the weight in the connection between node jin the input layer to nodei in the first hidden layer. Againx₀⁰= 1 andw¹_0iis the bias. Alsoi>0 because the bias is not a computational node.

After the node in the hidden layer has computed its value, then it is transformed through an activation functiono¹_i = σ(h¹_i)(see Section 2.3.1).

The output o¹_i, gets passed on to the nodes in the next layer. A more generalized Equation (2.2) that uses superscript to denote the layer is given

(14)

by

h^l_i =

∑

j

o^l_j⁻¹w^l_ji. (2.3) With this generalisation the input layer is denoted as output zerox⁰=o⁰. Output Layer

The output layer is the final layer. For a regression problem it can consist of just one node. In a classification setting the output layer usually consists of the same number of nodes as classes. For classification the output nodes represents class probabilities. This is often achieved with a certain activation function, the softmax, because it transforms a vector of numbers into a relative probability vector.

o^N =σ(h^N) = ^exp(h^N_i )

∑c∈Yexp(h^N_c ) ^(2.4) o^N is the output of the final layer ando_i^N is the probability for classi,h^N is the nodes in the final layer before the softmax transformation.

2.3.1 Activation Function

The activation function is an important part of the neural network. The idea of an activation function is borrowed from neuroscience, where a neuron would fire if the potential energy surpasses a threshold. Rosenblatt (1958) used an activation function that fired if the weighted sum was over a thresholdθ:

σ(h_i) =

(1 ifh_i > θ

0 otherwise. (2.5)

The activation function, denoted as σ(.), can be any function that has a bounded derivative (Rumelhart et al., 1986). It is important that it has a bounded derivative, because the back-propagation is depended on the derivative of the activation function (Rumelhart et al., 1986). However, a simple activation function such as the identity transformation σ(x) = x is rarely used because it has been shown that a neural network can approximate any function if the activation function is not a polynomial (Leshno et al., 1993). This result explains why non-polynomial activation functions are preferred.

Sigmoid functions like the logistic σ(x) = ¹

1+exp(−x) ^(2.6)

and tanh are common activation functions because of their non-linearity and differentiability. However, the derivative of these functions are close to zero when xdeviates from 0. When gradients are zero the model does not learn, this is known as the vanishing gradient problem and different ways around this problem has been proposed (see Section 2.8.5).

(15)

In recent years rectified linear units (ReLU) (Nair & Hinton, 2010) and variations have become increasingly popular. ReLU is a simple activation function, defined as

σ(x) =max(x, 0), (2.7) that only ’fires’ if the neuron has a positive sum. A difference from Rosenblatt (1958) is that instead of firing an output of 1 if the sum is over a threshold, it fires the sum. The ReLU easy to compute, even thought it has a discontinuity at x = 0, Nair and Hinton (2010) ignores this and defines the derivative as 0 whenx ≤ _{0 and 1 if} x > 0. ReLU has become popular because it allows training of deep neural networks without unsupervised pretraining (LeCun et al., 2015) and improves with performance and accuracy (Ramachandran et al., 2017). Different variations of ReLU has been proposed (Clevert et al., 2015; He et al., 2015; Hendrycks & Gimpel, 2020; Klambauer et al., 2017; Maas et al., 2013).

2.4 Convolutional Neural Networks

The architecture of a convolutional neural network (ConvNet) consists of convolution and pooling layers. ConvNets take into consideration the fact that local statistics of an image are invariant to location (LeCun et al., 2015). For example a dog could be anywhere in the picture and it would still be an image of a dog. ConvNets have gotten a lot of attention after Krizhevsky et al. (2012) made a leap in performance with a deep ConvNet architecture. ConvNets make it possible to process multidimensional data, and are especially suited for images (LeCun et al., 2015).

2.4.1 Convolutional Layers

The role of the convolutional layer is to detect local conjunctions of features from the previous layer (LeCun et al., 2015). An advantage with a convolution filter is that it uses the same weights for the whole input data.

If a the filter has a size ofkxkand the input imageMxN, then a CNN would have to learnkxkweights, because the same filter convolves over the whole image. A FFNN on the other hand would haveMxNconnections into each of the nodes in the first hidden layer. This can become a problem for FFNN both in computing power, but most importantly that the model would have too many parameters compared to its data points, and would be prone to overfit (LeCun et al., 1990).

Units in a convolutional layer are organised in filter banks (LeCun et al., 2015). Filter banks contain different filters and each filter outputs a convolved image that is passed through an activation function, this image is called a feature image.

2.4.2 Pooling Layers

A pooling layer is a layer that takes neighbouring pixels and transform them into a single value. For example if one has a pooling layer that takes

(16)

a 3x3 region of a feature map and then transform the 9 input values into one output. This has the property of down-scaling the layer. The role of the pooling layer is to merge semantically similar features into one (LeCun et al., 2015). Common pooling operations are maxpooling, which takes the maximum value of the neighbourhood, and averagepooling that takes the average.

2.5 Deep Learning

Deep learning is a term used for neural networks with millions of connections together with a great amount of data to train the network (LeCun et al., 2015). The term is often used for very powerful models that manages to approximate complex functions (LeCun et al., 2015; Srivastava et al., 2014). For ConvNets the conventional procedure was to hand design good feature extractors until LeCun et al. (1989) showed that a ConvNet could achieve state of the art results by letting the network construct its own filters by learning. The development of effective software and hardware has made deep learning available for regular practitioners (LeCun et al., 2015).

2.6 Supervised Learning

The objective of training a neural network is to minimize a loss function, L^W(y_i, ˆy_i)_{, where ˆ}y_i = f^W(x_i)is the prediction of the model for the input x_i ∈ X. The loss (or error) function gives a measure of how close the predictions of the model are to the true values y_i ∈ Y. By minimizing the loss function with respect toW, the model’s predictions become more accurate for the training data. Once we have minimised L^W(y_i, ˆy_i) with respect toW, we have obtained the set of weights that fits the training data best. The same set of weights and biases that finds an optimum:

Wˆ =arg min

W L^W(y_i, ˆy_i), (2.8) will be used for predictions. The key to supervised learning is to use back-propagation to find the gradients of the parameters, so the network can update its parameters to get more accurate predictions. This will be discussed further in Sections 2.7 and 2.8.

2.6.1 The Loss Function

There are many loss functions to choose from. The loss function affects the learning dynamics (Janocha & Czarnecki, 2017). Even though many loss functions exist they usually have very complex surfaces and it is optimistic to believe that a neural network will find the global minimum of the loss function. Despite the fact that the model almost never finds the global minimum, it rarely converges to a solution that is much worse (Rumelhart et al., 1986).

(17)

Loss for Regression

A common set of loss functions for regression calledL_p-loss takes the form L(x,y) = ||x−y||_p. This distance metric is used in regression because y ∈ R, and a simple way of evaluating the model is to find out how close the prediction ˆy = f^W(x) is to the true value y. L₁ and L₂ are the most common loss functions. They have different properties, most notably L₂ has a continuous derivative.

Loss for Classification

For classification we need another set of loss functions. This comes from the fact thatyis not a real number but a vectory∈_R^p, also referred to as a label.

In order for the network to produce a class probability vector a transformation has to be applied: the softmax. The softmax transformation is only applied to the last layer:

pi =so f tmax(h^N) = ^exp(h_i^N)

∑c∈Yexp(h_c^N) (Equation (2.4) revisited) The output is denoted asp_ito emphasise that it is a probability vector.

As previously mentioned, L₁ andL₂are common losses for regression.

One could also use a distance measure in classification, but this is not common practice. A common loss function for classification is the cross- entropy loss (Janocha & Czarnecki, 2017), also called log-loss , and can be calculated for a single data pair as

L^W(y_i,p_i) =

∑

c

y_i,c·log(p_i,c). (2.9) The logarithm should have the same base as the number of classes.

In Equation (2.9),p_i,cis the probability that the model assigns for classc given the inputx_i. Cross-entropy is very similar to entropy:∑cpc·log(pc). Entropy has its maximum when all p_c are equal (i.e uniform probability).

In other words a model that outputs uniform probabilities produces high loss. The optimal model for the log-loss function is a model that produces a probability vector identical to the labeled vector.

For classification it is also possible to have the loss dependent on the accuracy, which for a input-output pair would give a 0-1 loss

L^W(y_i,p_i) =

(0 ify_i =arg maxp_i

1 otherwise (2.10)

For multiple data points this will become the missclassification rate. In this case the optimal model does not need to produce probabilities identical to the true label, but probabilities that gets perfect accuracy.

(18)

2.7 Backpropagation

Backpropagation is a backward pass that propagates the derivatives of each parameter so one can do gradient descent (Rumelhart et al., 1986). It is a crucial part for self learning networks. In order to find the equations (Rumelhart et al., 1986) one must start at the loss function, for example the commonL₂:

L(y, ˆy) = ¹ 2

∑

i

(f^W(x_i)−y_i)². (2.11) Scaled by a half for convenience. The true value isy_i for the input datax_i in the data pair(x_i,y_i), and ˆy = f^W(x_i)is the output of the network. The next step would be to differentiate Equation (2.11) for the data pair(x_i,y_i) with respect to f^W(x_i), which produces

∂L

∂f^W(x_i) = f^W(x_i)−y_i. (2.12) With further use of the chain rule we can obtain (Rumelhart et al., 1986)

∂L

∂x^l_j = ^∂L

∂f^W(x_i)·^∂^f

W(x_i)

∂x^l_j . (2.13)

For the weight in connection from neuronitojin layerlthe derivative is

∂L

∂w^l_ji = ^∂L

∂f^W(x_i)·^∂^f

W(x_i)

∂x^l_j · ^∂x

l j

∂w^l_ji. (2.14) The next step from Rumelhart et al. (1986) would be to use∂L/∂w^l_ji to change the weights so the loss function decreases. Rumelhart et al. (1986) propose a simple version to update the weights by accumulating all the gradients of all the data pairs and then update each weight according to its scaled accumulation of gradients∂L/∂w

∆w=−ε·^∂L

∂w. (2.15)

This is the core concept of training a neural network. When we have the gradients of the weights, they tell us how we should change them in order to decrease the loss. Then to update in a small scaled step (ε), would in theory decrease the loss.

An undesirable trait in an optimising objective is the existence of local minima. This is a recurrent problem in neural networks learning. Even though gradient descent may not find the global minimum, Rumelhart et al. (1986) reports that with many tasks the network usually finds a local minimum that is not significantly worse than the global minimum.

(19)

2.8 Optimisation

The objective of training is to minimise the loss (see Equation (2.8)). For this to happen the weights have to be optimised. The loss function describes how good the model fits the data. Backpropagation finds the gradients.

The gradients of the loss function with respect to a weight or bias∂L/∂w, from here on denoted as ∂w, tells us how each weight contributed to the resulting loss. The gradient points in the direction where the loss function is steepest. We will use this information to minimise the loss function by simply moving in the opposite direction. The simple update Equation (2.15) of Rumelhart et al. (1986) would in theory decrease the loss function with small enoughε. The problem with this optimisation is that a very small learning rate would make learning slow, and it could easily get stuck in a local minimum. In addition, saddle points with few directions of improvement are abundant (LeCun et al., 2015). Fortunately, recent theoretical and empirical results suggest that local minima and saddle points produces very similar values of the loss function (LeCun et al., 2015), which supports the findings of Rumelhart et al. (1986).

2.8.1 Gradient Descent

Gradient descent is a method used to minimise the loss function (Rumel- hart et al., 1986). The partial derivative of the loss function with respect to a weight or bias,∂w, points in a direction where the functionL^W increases the most for the current weight w. This information about the gradient is used in order to minimize the loss function. The loss function will decrease if the weight is updated in the negative direction of its gradient. This is what Rumelhart et al. (1986) suggest with the simple update of Equa- tion (2.15). The update of the weight can be regarded as a step taken towards a solution. The step is scaled by a hyper-parameterε∈(0, 1], known as the learning rate which is usually set to be very low (i.e. 0.001). This is to prevent the update from overshooting a good solution (Zeiler, 2012).

With a suitable learning rate, or a more intelligent learning rate schedule, gradient descent will converge towards a better solution (Ioffe & Szegedy, 2015; Zeiler, 2012).

2.8.2 Mini-Batch Gradient Descent

Mini-batch gradient descent, also called stochastic gradient descent (SGD) is a more efficient way to minimise the loss for big data sets (Ruder, 2017). Instead of calculating the gradient based on the whole training set, SGD estimates the gradient from a subset of the training data (see Equation (2.16)), called a batch or mini-batch B ⊆ D. If B is the same as the training setDthen stochastic gradient descent is simply the same as regular gradient descent Equation (2.15). When the batch size is smaller than the training set, a calculation of the estimated gradient based on the batch is obtained. With a batch size of |B| = 1, the estimation of the gradient will only depend on one data point, this is called stochastic

(20)

gradient descent or online learning (Ruder, 2017). However, it is common to choose |B| << |D| (Keskar et al., 2016) so the mini-batch gradient descent is often referred to as SGD (Ruder, 2017). The batch size is regarded as a hyper parameter since it governs the trade-off between speed and reliable gradient estimates (Goyal et al., 2017). It has been proposed that SGD has good generalisation properties, because it avoids "sharp minima"

(Keskar et al., 2016).

wnew= w−ε·∂w,^ˆ where ˆ∂w= ¹

B

∑

∂w_b∈B

∂w_b. (2.16)

2.8.3 Stochastic Gradient Descent With Momentum

SGD with momentum is a simple modification to SGD, that has seen much success (Zeiler, 2012). Momentum has a different update scheme that takes into consideration how big the previous update was (Sutskever et al., 2013;

Zeiler, 2012). A common version used in machine learning is Nesterov momentum¹(Sutskever et al., 2013), given by

v_t+1 =m·v_t+ε·∂w(w+mv_t) w_t+1 =w_t+v_t+1

(2.17) Here, the momentum coefficientm∈[0, 1]is a parameter that decides how much the current update is affected by the previous update (Sutskever et al., 2013). The momentum coefficient is commonly set tomt =1−3/(5+t) (Sutskever et al., 2013)¹. Sutskever et al. (2013) suggest a momentum coefficient schedule, and achieves good results. It improves upon SGD, especially for complex loss surfaces such as a "long valley" (Zeiler, 2012).

However, momentum has to be tuned well for it to not have worse performance (Sutskever et al., 2013).

2.8.4 Adam

Adam (Kingma & Ba, 2017) combines AdaGrad (Duchi et al., 2011) and RMSprop (Tieleman & Hinton, 2012) and is a very popular optimiser (Dozat, 2016) because of its efficiency and good empirical results (Kingma

& Ba, 2017). Adam takes into consideration the first and second order moment of the gradient. The second order gives information about the curvature of the loss function. Together they are used in order to take a stepsize that corresponds to how certain one is about the current gradient estimate. Adam also takes into consideration sparse gradients that can suddenly lead to large parameter updates (Kingma

& Ba, 2017). Furthermore, Adam also corrects for the bias term in the running estimation of the two moments (Kingma & Ba, 2017). Dozat (2016)

1The Nesterov paper that is usually referenced (Nesterov, 1983) is in russian. It is the paper by Sutskever et al. (2013) that says this about Nesterovs paper.

(21)

incorporated Nesterov momentum Equation (2.17) into Adam for a furhter improvement.

2.8.5 Optimising Tool Box

The flexibility of a neural network comes from all the different components. Changes can be made to the architecture as well as the hyperparameters such as learning rate, activation function and optimising method.

The activation function can be changed to get faster and better results (Ramachandran et al., 2017). The linear function can be changed to another function, under the condition that the new function can be differentiated (Rumelhart et al., 1986). Problems with learning is usually solved with the use of some of these tools presented below (Ioffe & Szegedy, 2015).

Learning Rate

The learning rate,ε, is usually set as high as it can without compromising the training (Zeiler, 2012). It is problem dependent, but usuallyε ∈ (0, 1] and it is common to choose a low learning rate (e.g. 0.001). If the learning rate is set too high it can cause the system to diverge, and if it is set too low the training will be slow (Zeiler, 2012). To choose a good learning rate involves trial and error, and becomes more of an art than science (Zeiler, 2012). AdaGrad is a version of a dynamic learning rate, but a learning rate that decays towards zero stops the learning, Zeiler (2012) addresses this problem. When the parameters are close to a minimum in the loss surface, the updates will oscillate back and forth. To prevent oscillations the learning rate has to be decreased. This is why learning rate decay is regularly used (Wu et al., 2016; Zeiler, 2012).

Batch Size

The batch size governs how many data points should be sampled to estimate the gradient for the update step in Equation (2.16). A small batch size would induce more stochastic gradient estimates. If the batch size is too large then learning takes a long time. Usually the batch size is|B| <<

Nand typically|B| ∈ {32, 64, ..., 512}(Keskar et al., 2016; Krizhevsky et al., 2012; Smith et al., 2017), where N is the number of training observations.

An increased batch size in training tends to not generalise as good, and perform worse on unseen data, even though it performs similar on training data (Goyal et al., 2017; Keskar et al., 2016). Smith et al. (2017) propose that batch size is proportional to the learning rate and demonstrates that a learning rate decay schedule can be converted to a batch size growth schedule. However, a larger batch size enables a larger learning rate which facilitates faster learning (Goyal et al., 2017; Smith et al., 2017).

Initialisers

Weights have to be initialised, they can not be trivially set to zero because then only the biases would affect the output, because all the weights would

(22)

be zero. The standard way to initialise weights is to sample from either a uniform or normal distribution (Glorot & Bengio, 2010).

w∼ N(0, 1

√n_l)

w∼Uni f orm(−√¹ n_l, 1

√n_l)

(2.18)

Heren_l represents the number of units in layerl.

The combination of activation function and initialiser affects the learning of the model (Glorot & Bengio, 2010). An initialiser that yields gradients with similar magnitude at different layers to avoid problems with the standard initialiser is given by (Glorot & Bengio, 2010), and often referred to as "Xavier" or Glorot-Normal and Glorot-Uniform

w∼ N(0, 2 n_l+_n_l₊₁)

w∼Uni f orm(−

√6

√n_l+n_l₊₁,

√6

√n_l+n_l₊₁)

(2.19)

This initialisation improves learning for neural networks with sigmoidial activation functions (Glorot & Bengio, 2010). Other intialisations improves learning for networks with ReLU activation functions (He et al., 2015).

Dropout

Deep neural nets have an enormous amount of parameters, but with limited training data the models are prone to overfit (Srivastava et al., 2014). When a model overfits it performs significantly better on training data compared to unseen data. Many methods have been developed to avoid overfitting, dropout is an addition that can be applied to existing methods and improve them (Srivastava et al., 2014).

Dropout, temporarily removes a unit with all of its incoming and outgoing connections from the neural network. This can be seen as sampling a thinned version of the network (a subnet), the subnet consist of all the units that survived dropout (Srivastava et al., 2014). The dropout rate is usually set top=0.5 (Srivastava et al., 2014).

Dropout makes the network not dependent on any nodes and generalises better (Srivastava et al., 2014). At test time, each weight is multiplied bypto simulate the expectation (Srivastava et al., 2014).

Batch Normalisation

Batch normalisation takes the idea of normalising input data, and proposes to normalise layer inputs using batches (Ioffe & Szegedy, 2015). The motivation comes from the stochasticity that emerges when using batched updates. A new batch consists of a sample of training data B ⊆ X (see Section 2.8.5). The gradient updates are dependent on the parameters, as well as the B. Since the optimisation only focuses on updating the

(23)

parameters, it is disadvantageous that B changes. Batch normalisation addresses this problem so the stochasticity in the batch B does not affect the loss function in the same way, and a faster convergence is observed (Ioffe & Szegedy, 2015).

Regularisation

A neural network can have an enormous amount of parameters (Srivastava et al., 2014). With limited training data sampling noise will have an unwanted effect optimisation, as this noise is only present in the training data (Srivastava et al., 2014). A simple way to remove excess weights, that only contributes to noise is to penalise the loss function by adding an extra term (Nowlan & Hinton, 1992):

cost=L^W+λ·complex(W). (2.20) The network will have a new objective of minimising the cost function, and it has to find a trade-off between the usual loss and the model’s complexity (Nowlan & Hinton, 1992). Common ways of estimating the model’s complexity is with the sum of the squared weights∑jw²_j (Nowlan

& Hinton, 1992), which yields:

cost= L^W +λ||W||²₂, (2.21) where ||.||₂ is the L₂ norm, λ > 0 is the penalisation coefficient and it is usually found by cross-validation (Nowlan & Hinton, 1992; Park & Hastie, 2007). Another complexity function is the L₁ norm which acts as weight selection (Park & Hastie, 2007). For a more "complex measure of network complexity" see Nowlan and Hinton (1992), L_p norm with p < 1 has also been proposed (Louizos et al., 2017).

2.9 Common Metrics

Evaluation of a model is a crucial part of machine learning. When a neural network is trained with an optimisation algorithm, we need metrics to know if the method is working fine and the model is converging towards a good solution.

2.9.1 Loss

The loss function (see Section 2.6.1) measures the error of the model. One way of reporting the error is to measure the loss for the whole training set after a pass through the whole data set, called an epoch. It is also possible to report the loss of a batch.

Another crucial part of measuring how good the model generalises to new data, is to sample a part of the training data called the validation data V, before training. The validation data will be held aside, and the model shall never use the validation data to tune its parameters. The loss

(24)

calculated on the validation data illustrates how the model would perform on new data.

If data is limited, cross-validation is preferred instead of sampling a part of the training data as validation data, which can make unreliable estimates. Cross-validation parts the whole training set intokequally sized parts. The model has to train on the data that is not in thek-th fold. Thek-th fold is used to evaluate the loss. Thek-th fold acts as a validation set. The next step is to reset (or re-initialise) the weights and then train the model on the data except thek+1 fold, and use thek+1 fold to evaluate. This is done for all folds, and produceskmodels, withk evaluations. A common thing would be to find thek-th model that produced the lowest validation loss.

2.9.2 Accuracy

Accuracy is another metric that is commonly used in classification. The accuracy is the rate of correct classifications. Accuracy can be computed for an epoch, batch, or with cross-validation, in the same way as loss and validation loss can.

2.10 Bias-Variance Tradeoff

The main objective when training a neural network is to obtain a model that is good at recognising data it has not seen. This is called the bias- variance tradeoff. The bias is how the model performs on training data, and the variance is how good the model performs on data that it has not seen. To measure the performance another set of data is used, the validation data set. A common practice is to report the loss for the training data as well as the validation data. Since we want to minimise the bias-variance tradeoff, we would aim for a model that yields low loss for the training set as well as for the validation set. A sign of overfitting is when the loss for the validation data set starts to increase. This is why the loss function for the training data set alone does not reveal how good the model actually is in general. One has to take into consideration the loss of the validation set as well. A model that performs good at training data and unseen data, is a model that has managed to find a good compromise for the bias-variance tradeoff.

(25)

Chapter 3

Bayesian Neural Networks

A Bayesian neural network (BNN) aims at producing a predictive distribution rather than a predictive point estimate. Even though deep learning have achieved impressive results and high predictive accuracy, they produce overconfident predictions (Lakshminarayanan et al., 2017). Regular NNs do not quantify predictive uncertainty in the same way a Bayesian neural network can (Lakshminarayanan et al., 2017; Lampinen & Vehtari, 2001). Standard NNs are used as non-parametric models that can represent complex transformations (Titterington, 2004). MacKay (1992) argues that Bayesian neural nets penalises overcomplex models, and therefore yields the best generalisation. In a practical setting, where incorrect predictions can have unwanted consequences it is crucial for a model to express uncertainty (Kendall & Gal, 2017; Lakshminarayanan et al., 2017).

A Bayesian neural network has different definitions (Wilson & Izmailov, 2020), most common is that they quantify predictive uncertainty. This thesis focuses on a network that has a distribution over its weights and biases, rather than all model parameters (learning rate, architecture etc.).

Bayes’ theorem states that the posterior distribution over the weights is P(W | D,M) = ^P(W | M)·P(D |W,M)

P(D | M) ^. ^(3.1)

In Equation (3.1),Wis the weights and biases.Dis the training data and M is all other model parameters. What separates the Bayesian modeling from maximum likelihood is the prior P(W | M) (Lampinen & Vehtari, 2001). The termP(D | M)is often called the "evidence" (MacKay, 1992).

MacKay (1992) suggests that the "evidence" incorporates "Occam’s razor"

and thus have the best model selection for generalisation. It is common to omit M from the equations but Lampinen and Vehtari (2001) suspects this can lead to the misinterpretation that the denominator P(D) is the probability of obtaining the data before modeling. The posterior of the weights P(W | D,M)can be used to calculate the predictive distribution for new data (Neal, 1995),Mis omitted for clarity

P(y^new|x^new,D) =

Z

W P(y^new |x^new,D,W)P(W | D)dW (3.2)

(26)

From the predictive distribution the model can produce confidence inter- vals (Lampinen & Vehtari, 2001). Even with an approximation of Equa- tion (3.1) the model would still produce a predictive distribution that is reliable if the model is well calibrated (Guo et al., 2017; Nixon et al., 2019).

Uncertainty can be split into two main categories (see Section 3.5) and it could be used to further train the model on what it is uncertain about (Der Kiureghian & Ditlevsen, 2009; Kendall & Gal, 2017; Wilson & Izmailov, 2020).

3.1 Prior Knowledge

The cornerstone of a Bayesian neural network is the prior (Lampinen

& Vehtari, 2001). A prior distribution is chosen so that it corresponds to prior knowledge. This can be done for all parameters: the model parameters M _{as well as} W. Lampinen and Vehtari (2001), MacKay (1992) and Neal (1995) choose priors on both types of parameters, and this is considered a full Bayesian approach. However, they considered networks of small sizes (e.g. 100 parameters). Deep learning requires an enormous amount of parameters (LeCun et al., 2015), and there has been more focus directed towards making approximate Bayesian methods that scale better to deep learning (Gal & Ghahramani, 2016; Kendall & Gal, 2017;

Lakshminarayanan et al., 2017; Wilson & Izmailov, 2020).

Prior knowledge is incorporated into our choice of model parameters (Lampinen & Vehtari, 2001; Wilson & Izmailov, 2020). For example working with images might lead us to choose a convolutional network.

An optimisation technique might be chosen on the basis of convenience.

Lampinen and Vehtari (2001) write that in practice NNs suffers from too strict priors, which can lead to bad generalisation (MacKay, 1992). Since there could be millions of parameters (LeCun et al., 2015), and we do not have a good understanding of what is a good prior, the standard is to use a normal distribution as in Wilson and Izmailov (2020)

P(W) =N(0,λI). (3.3)

Whereλcan be found through cross-validation (Wilson & Izmailov, 2020).

3.2 Bayesian Learning

As mentioned in Section 3.1, early work on Bayesian neural networks used fewer parameters than what is common today. It required more expert work (Lampinen & Vehtari, 2001), in contrast to deep learning (LeCun et al., 2015), where the network learns autonomously based on big data sets. Exact Bayesian methods are intractable for NNs, this is why different approximations exist (Lakshminarayanan et al., 2017). This thesis will focus approximate Markov chain Monte Carlo methods.

(27)

3.2.1 The Bayesian Loss Function

Gradient descent and SGD are important tools for training a neural network, and they still are when training BNNs. The new objective is to find a distribution over the parameters, instead of a point estimate from Equation (2.8). The first step towards a tractable solution is to find a distribution that is proportional to the posterior (omittingM):

P(W | D)_∝P(W)P(y|x,W). (3.4) Since the loss function is equal to the negative log-likelihood function up to a constant (Gal, 2016; Neal, 1995; Tishby et al., 1989)

−logP(y|x,W)_∝L_W+const

P(y|x,W)_∝exp^(−L^W⁺^const⁾, (3.5) and the prior is given from Equation (3.3)

P(_W) =_N(_0,_λI) P(_W)_∝exp(−^λ

2||W||₂), (3.6)

we can rewrite Equation (3.4) as P(_W| D)_∝_exp(−^λ

2||W||₂+ (−L_W+const))

∝exp(−¹

2(λ||W||₂+L_W)).

(3.7)

However, to approximate the full posterior is not the goal, but rather to estimate the predictive distribution from Equation (3.2) (Wilson &

Izmailov, 2020). The natural logarithm transforms Equation (3.7) into the recognisable loss function

L_Bayes(W):=λ||W||₂+L_W. (3.8)

L_Bayes can thus be interpreted as L2 regularisation for a Gaussian prior (Wilson & Izmailov, 2020). This makes Bayesian neural networks more robust against over-fitting as they have regularisation built in the model (Lampinen & Vehtari, 2001). An interpretation of this is that the prior works as a regulariser. Minimisation ofL_Bayes produces the most probable parameters ofW given the prior. While the most likely set of weightsW_ML is obtained whenL_W is minimised (MacKay, 1992).

3.3 Markov Chain Monte Carlo in Bayesian Neural Networks

Markov chain Monte Carlo (MCMC) methods are used to get samples from the posterior distribution (Lampinen & Vehtari, 2001). The Markov chain has the posterior distribution as its stationary distribution (Lampinen

(28)

& Vehtari, 2001). A Markov chain is a sequence where a value is only dependent on the previous value P(S⁽^t⁾|S⁽^t⁻¹⁾,S⁽^t⁻²⁾, ...) = P(S⁽^t⁾|S⁽^t⁻¹⁾)_. In order to obtain samples from a distribution one can use the Metropolis- Hastings (MH) algorithm, shown in Algorithm 1.

Algorithm 1:Metropolis-Hastings for loss.

W⁽⁰⁾ ∼ p(0)Initialise network (see Section 2.8.5) for1...Tdo

forall w ∈W do

Sample a candidate weightw^∗∼ g(_.|w⁽^t⁾)

Compute the Metropolis-Hastings ratioR(w⁽^t⁾,w^∗) Takew⁽^t⁺¹⁾ =







w^∗ with probability min{R(w⁽^t⁾,w^∗)_{, 1}} w⁽^t⁾ else

end end

The MH algorithm produces T samples from the target distribution.

It can traverse all the weight in different ways. The least computational way to traverse the weights would be to sample a new alternative for all weights and then compute the MH-test to either accept or reject the new set of weights. This only requires one MH-test (explained later), which is computationally heavy, for the whole network, but this model sampling does not offer fine tuning for each weight. Another way would be to propose new weights for a layer and then do an MH-test to decide if it should be accepted. Layer-wise updates have been tried for other optimisation methods with varying results (Belilovsky et al., 2019; Bengio et al., 2007). The number of MH-tests increase for layer-wise, but more fine-tuning is possible compared to model-wise updates. In the case of layer-wise updates, some inspiration can be taken from the Gibbs sampler (Geman & Geman, 1984) and new gradients can be calculated after each layered update. Finally, sampling and testing each weight individually would be most flexible, but this is very computationally heavy.

The sampling distribution, g(. | w⁽^t⁾) in Algorithm 1, should be easy to sample from. We use a sampling distribution because we can not sample directly from the posterior. Usually one would use a uniform or normal distribution for sampling. A common configuration is to have the sampling distribution the same as the prior (Neal, 1995). A good sampling distribution makes the chain convergence faster to the posterior, and explores the posterior density.

The Metropolis-Hastings test consists of computing the MH-ratio to decide if the proposed sample belongs to the target distribution

f. Notice that if R(w,w^∗) = ^f⁽^w^∗⁾^g⁽^w^|^w^∗⁾

f(w)g(w^∗|w), where f is the target function, in our case the posterior, R then becomes R(w⁽^t⁾,w^∗) =

(29)

P(w^∗|D)g(w|w^∗)/P(w|D)g(w^∗|w). This is the ratio of posterior likelihood, and is referred to as a probability, but it can be greater than 1. To account for this we have the min operator in Algorithm 1. IfR>1 the proposed weight is always accepted. This property is desirable because the algorithm will then always accept a sample that is more likely from the posterior. It is only when the proposed weight is worse, thatRis interpreted as the probability of accepting the proposed weight.

The ratio cancels out the "evidence" and we end up with the ratio R(w⁽^t⁾,w^∗) =

P(w^∗)P(y|x,w^∗)g(w|w^∗) P(x,y) P(w)P(y|x,w)g(w^∗|w)

P(x,y)

= ^P(w^∗)P(y|x,w^∗)g(w|w^∗) P(w)P(y|x,w)g(w^∗|w) ^.

(3.9)

The ratio gets simplified whengis a symmetric,g(w|w^∗) =g(w^∗|w): R(_w⁽^t⁾_,_w^∗) = ^P(w^∗)P(y|x,w^∗)

P(w)P(y|x,w) ^, ^(3.10) which is a ratio of posterior likelihood of different weights. Looking back at Equation (3.8) we can rewrite the ratio in terms of loss:

R(w⁽^t⁾,w^∗) = ^exp(−L_Bayes(w^∗)) exp(−L_Bayes(w))

=exp −L_Bayes(w^∗) +L_Bayes(w).

(3.11) The Metropolis-Hastings test is inefficient and not scalable because it requires the whole data set to calculate (Zhang et al., 2020). A naive way is to use mini-batch, but this is a biased estimate of the MH ratio (Bardenet et al., 2017). There are some research (Seita et al., 2016; Zhang et al., 2020) on approximations for the mini-batch Metropolis-Hastings test, and it has some considerable computational improvements. However, there are still computational limitations that makes Bayesian methods intractable for deep learning without crude approximations.

3.4 Bayesian Deep Learning

Bayesian deep learning is a branch of BNNs that have been developed as approximations for exact Bayesian inference (Lakshminarayanan et al., 2017). There are many different ways to approximate the predictive distribution in Equation (3.2), the main focus of this thesis is concentrated on stochastic gradient MCMC variants such as Welling and Teh (2011).

However, since variational methods (Gal & Ghahramani, 2016; Kendall &

Gal, 2017) are more popular they will be mentioned as well. Bayesian deep learning is dependent on the degree of approximation and if the prior is

’correct’ (Lakshminarayanan et al., 2017). A vague prior is preferred over a too specific one (Wilson & Izmailov, 2020). Wilson and Izmailov (2020) suggest that as long as the predictive distribution is good, then it does not matter much if the estimate of the posterior is bad.

(30)

3.4.1 Stochastic Gradient Langevin Dynamics

Langevin dynamics is a MCMC method that captures the posterior by just adding noise to the stochastic gradient descent update step (Welling & Teh, 2011):

∆Wt = ^ε^t

2 ∇logp(W_t) + ^N

M

∑

i∈S

∇p(y_i |x_i,W_t)

! +δ, δ ∼ N(0,ε_t).

(3.12)

The most important feature is that it does not require the Metropolis- Hastings test (Welling & Teh, 2011). R(w⁽^t⁾,w^∗) will go towards 1 when ε_t → 0, on this premise Welling and Teh (2011) can omit the MH test.

Welling and Teh (2011) argue that in the start the stochastic gradient produces more noise than the addedδ, and the update works as regular SGD.

However, over time as the parameters are close to an optimum, the gradients decrease and do not produce the same stochasticity. When the total noise is dominated byδthe model samples from the posterior (Welling &

Teh, 2011). This "automatic switch" between optimising and sampling from the posterior is one of the good features of this method. It is important that the stepsizeε_tstop decreasing when the model is in the "sampling mode", as the samples then will explore the posterior and not just collapse into a point estimate like SGD with learning rate decay. Welling and Teh (2011) provide an estimate for when the model switches to the "sampling mode", and they recognise that the random walk behavior is slow. Gal (2016) suggests that in practice SGLD often only explores one mode, because ε decreases too rapidly, even though in theory it should explore the full posterior (Welling & Teh, 2011). Jospin et al. (2020) ensure that if the dataset has to be split into mini-batches SGLD offers better theoretical guarantee compared to other MCMC methods.

3.4.2 Variational Inference

Variational inference (VI) approaches the problem of approximating Bayesian methods for BNNs from a different angle, instead of simulation, VI turns the problem into optimisation, which makes the problem tractable (Kendall & Gal, 2017). A simple distributionqθ is chosen to represent the very complex posterior P(W | D). VI aims to optimise over the parameters θ of the simpler distribution by minimising the Kullback-Leibler (KL) divergence between the two distributions (Kendall & Gal, 2017):

KL(q_θ||p) =

Z

Wq_θ(W)log

q_θ(W) P(W | D)

dW. (3.13)

Gal (2016) provides the typical loss function in a variational inference setting as

L_{V I} =

Z

q_θ(W)logP(Y|X,W)dW−KL(q_θ||p(W)). (3.14)

(31)

Minimisation of L_{V I} will yield an approximate distribution q_θ that approximate the data well (first term) while being close to the prior (second term) (Gal, 2016). The minimisation of minL_{V I} = q^∗_θ, can be used to calculate an approximate predictive distribution from Equation (3.2)

P(y^new|x^new,D)≈

Z

P(y^new|x^new,W)q^∗_θ(W)dW, (3.15) and other metrics (Gal, 2016, sec. 2.1.1 and 3.3). Since the simple distribution qθ usually is a normal distribution, VI has its shortcomings for multimodal distributions (Korattikara et al., 2015; Wilson & Izmailov, 2020). A good thing is that they are computationally more feasible which is why they are more popular (Jospin et al., 2020). VI is also more accessible, especially after Gal and Ghahramani (2016) interpreted dropout as VI (called MC dropout) and achieved good results. MC dropout could then be applied to already existing models with dropout connections. Fast computations and accessibility has made VI popular.

3.5 Uncertainty in a Neural Network

Uncertainty in neural networks can be categorised into two groups:

aleatoric and epistemic (Der Kiureghian & Ditlevsen, 2009; Gal, 2016;

Kendall & Gal, 2017; Lakshminarayanan et al., 2017). Epistemic uncertainty can be reduced with more data and model refinements (Der Kiureghian &

Ditlevsen, 2009; Gal, 2016; Kendall & Gal, 2017). Aleatory uncertainty can not be reduced (Der Kiureghian & Ditlevsen, 2009; Kendall & Gal, 2017).

The reason uncertainty is split into two categories is to better understand what uncertainty that have the potential to be reduced (Der Kiureghian

& Ditlevsen, 2009). Further, Der Kiureghian and Ditlevsen (2009) suggest that epistemic uncertainties may reveal dependencies between random events that would otherwise be overlooked. Der Kiureghian and Ditlevsen (2009) consider many different cases of uncertainty and assign them into either the epistemic or aleatoric category. The essence is that all uncertainty is epistemic, however due to constraints the model builder has to consider some uncertainties as aleatoric (Der Kiureghian & Ditlevsen, 2009). Kuleshov et al. (2018) suggest that the uncertainty estimates are often inaccurate, which can be the case if uncertainties have not been classified correctly (Der Kiureghian & Ditlevsen, 2009).

3.5.1 Aleatoric Uncertainty

Aleatoric uncertainty is assumed to be the randomness of a problem (Der Kiureghian & Ditlevsen, 2009). Aleatory uncertainties can be modeled in different ways: heteroscedastic and homoscedastic (Kendall & Gal, 2017).

For example with feed forward neural networks that can only achieve a certain test accuracy, we have to regard the uncertainty as aleatoric.

However, with improvement of models, some aleatoric uncertainty turns

Introducing an Efﬁcient Approach for Expressing Uncertainty in Deep Learning with Bayesian Neural Networks

Introducing an Efficient Approach for Expressing Uncertainty in Deep Learning with Bayesian

Neural Networks

Stochastic Target Metropolis-Hastings

Edward F. Bull

Thesis submitted for the degree of Master in Data Science

60 credits

Department of Mathematics

The Faculty of Mathematics and Natural Sciences

UNIVERSITY OF OSLO

Introducing an Efficient Approach for Expressing Uncertainty in Deep Learning with Bayesian Neural Networks

Stochastic Target Metropolis-Hastings

Edward F. Bull

Abstract

Acknowledgements

Contents

Chapter 1

Introduction

1.1 Outline

1.2 Own Contributions

Chapter 2

Introduction to Neural Networks

2.1 Perceptron

∑

2.2 The Neural Network Model

2.3 Feed-Forward Neural Networks

∑

∑

2.4 Convolutional Neural Networks

2.5 Deep Learning

2.6 Supervised Learning

∑

2.7 Backpropagation

∑

2.8 Optimisation

∑

2.9 Common Metrics

2.10 Bias-Variance Tradeoff

Chapter 3

Bayesian Neural Networks

3.1 Prior Knowledge

3.2 Bayesian Learning

3.3 Markov Chain Monte Carlo in Bayesian Neural Networks

3.4 Bayesian Deep Learning

∑

3.5 Uncertainty in a Neural Network