• No results found

Optimisation of a neural network: Optimiser, loss function and back-

2.2 Artificial neural networks and deep learning †

2.2.6 Optimisation of a neural network: Optimiser, loss function and back-

Optimising a neural network for a particular task involves a combination of various compo-nents: optimisation method, loss function, a network configuration and training data. An abstract overview of the training procedure of a general neural network is presented in Al-gorithm1. In this section, a closer look at the inner loop workings (line 3-6) of the training procedure will be presented. Note that we will in general only consider the scalar case as to not clutter the notation. The principles are analogous in multidimensional space ‡.

Forward pass and loss function †

The first overall step of a training procedure is to produce a prediction, be it a continuous variable (regression) or an integer (classification) representing some class that is to be pre-dicted. This is called the forward pass step (line 3 in Algorithm1). Consider a neural network

2.2. ARTIFICIAL NEURAL NETWORKS AND DEEP LEARNING † 25 Algorithm 1An abstraction of a neural network training procedure †

Require: Stopping criteria²>0, loss functionL, optimiserO, training dataD, neural net-work fθwhereθare the parameters

1: forepoch<²do

2: for(input,label)∈Ddo

3: Calculate prediction fθ(input) .Forward pass

4: Calculate loss L(prediction, label)

5: Calculate gradients of loss w.r.t. parameters ∇θL .Backward pass

6: Update parameters O(θL)

7: end for

8: epoch←epoch+1

9: end for

withLlayers, f(x;θ). Given the network parametersθand input datax, a single scalar for-ward pass through the network is given in eq. (2.23)-(2.24),

a0=x l=0 (2.23)

a(l)=σ(θ(l)·a(l1))=σ(zl) l=1, . . . ,L (2.24) where the superscript denotes the layer in focus,σrepresents a nonlinear activation function (see Section 2.2.2) anda(l−1) represents the input to layerl. The output of layer l is thus a(l), which is used as input to the next layer, l+1, in which the same procedure as in eq.

(2.24) is performed with the parameters from layerl+1 anda(l)as input. At the last layer, a prediction, ˆy=f(x;θ) is produced. Note that the z-notation in the latter expression of eq.

(2.24) is a much used notation, partly due to the easier notation in the back-propagation algorithm, which is to be introduced in this section. It is also common to denote the neural network parameters as a weight matrix,w, and a bias vectorb. This is not strictly necessary due to the bias trick, in which the bias vector is appended to the weight matrix (extension as a new column) and the input vector appended with an additional element (1). One may then denote this new matrix that combines the weight and bias parameters asθ for easier notation. See Figure2.12for a conceptual representation of the bias trick for a 3×3 matrix.

Figure 2.12: The bias trick. Multiplying a matrix W∈Rn×mwith a vector X∈R1×mand adding a vector b ∈Rm is equivalent to augmenting the weight matrix with a 1×n column and augmenting the vector x with an additional element, 1, and then multiplying the resulting pairs ‡.

In order to evaluate how close (or far off ) the network prediction is, some measurement

of the error is needed. Loss functions serve this purpose (line 4 in Algorithm1). A frequently used loss function for regression tasks is the mean square error (MSE) [40]. Given a sample ofndata points, x and target values,yi,i ={1, 2,· · ·,n}, the MSE is given in eq. (2.25),

One may observe from eq. (2.25) that when ˆyi approaches yi, the loss function is de-creasing towards 0, in which the two are equal. MSE possess some convenient properties such as differentiability and inner product inducement. The first ensures that it is differ-entiable everywhere. The second property yields that the MSE of yi and ˆyi is the euclidian distance between the two. This is related to the intuition above: the error increases when the euclidean distance between the target and the prediction increases ‡.

The backpropagation algorithm †

One of the essential steps of neural network training is finding the gradients of the loss func-tion with respect to the parameters. These gradients are then used by the optimiser in min-imising the loss function, and as such, enable learning. The backpropagation algorithm4is a widely used algorithm for this purpose. In a multi-layered neural network, the backpropaga-tion algorithm propagates the informabackpropaga-tion from the loss funcbackpropaga-tion backwards in the network in order to compute the gradients on a layer-basis (line 5 in Algorithm1).

Consider a L-layer deep neural network with parametersθ, the loss functionL(a(L),y), where a(L) is defined in eq. (2.23)-(2.24), and z=θx. The gradients of a specific layer are found by utilising the chain rule, as shown in eq. (2.26) [6, Chapter 2].

dL

wherea(L)represents the activation of layerL‡.

Backpropagation through time

Backpropagation remains an essential ingredients in producing useful neural network mod-els. The backpropagation rules from eq. (2.26) are general purpose. A special case of the backpropagation algorithm has been developed for training RNNs. The application of the general purpose backpropagation algorithm on RNNs is called backpropagation through time (BPTT) [42]. It works by firstly expanding the computational graph of the RNN (see Section2.2.4) in order to acquire the sequence dependencies. The general purpose back-propagation is then applied to to compute the gradients.

4The backpropagation algorithm is associated with automatic differentiation. Automatic differentiation is similar to backpropagation, albeit being a general family of techniques [41].

2.2. ARTIFICIAL NEURAL NETWORKS AND DEEP LEARNING † 27 Assume we are to train the RNN given in eq. (2.14). We have sequential data, x ∈RS, where each elementx1, . . . ,xSin the vector x correspond to an element in the sequence set S ={1, . . . ,S}. Given an appropriate loss functionL(x,y,θ), it is derived in [43] that the gradi-ents of the loss with respect to the parametersθhfor an arbitrary input element index,k∈S, are as shown in eq. (2.27).

The expression forU is equivalent, only withUinserted forW. The term∂hWk is in [43] shown to equal the term in eq. (2.28)

∂hk

We observe from the two terms in eq. (2.27) and eq. (2.28) that the gradients of the loss with respect to the parameters are indeed dependent on the previous sequence elements. Note that the expression for ∂Uhk is equivalent to the one in eq. (2.28), only withU inserted forW. If the sequence is very large, we see from eq. (2.28) that the computed gradient may vanish or explode, depending on initialisation and optimisation method. Different strategies for dealing with the problem of vanishing or exploding gradients are discussed in [43], among them several time step truncation techniques. We refer to [43] for the interested.

Optimisation methods: gradient descent based optimisation †

The optimiser is responsible for updating the parameters in the neural network (line 6 in Algorithm1), based on the information (gradients) provided by the backpropagation algo-rithm. The stochastic gradient descent (SGD), Adam optimiser and variants of them are typi-cally used in deep learning. They are based on the principle of gradient descent: a minimum of a proxy function, J:Rn→R, for instance a loss function, is iteratively sought by moving in the direction of the negative gradient, −∇θJ(θ). The parameter update rule for the k’th gradient descent iterate is given in (2.29) [44],

θk+1=θkα∇θJ(θk) (2.29)

whereαis the step size. In deep learning, this is typically called learning rate.

For loss functions, for example the MSE in eq. (2.25), the basic gradient descent (batch gradient descent) calculates the gradient with respect to the whole dataset.

θJ(θ)= 1

The number of operations for a single gradient step isO(nd) [45], wherenis the number of samples anddis the input dimensionality . When the dataset (n) is large and/or the number

of features (d) are large, the number of operationsnd become large and not feasible for optimisation in deep learning.

The SGD [46] is an approximation of the batch gradient descent aimed at reducing the computational load. The gradient is approximated as shown in (2.31),

1 n

n

X

i

θLi(xi;θ)≈ ∇θLj(xj;θ) (2.31) where j∈{1, 2, 3, ..,n} is chosen by uniform sampling.

SGD is among the most used optimisers. Several variants of SGD exist, for example SGD with momentum, which seeks to speed up the learning process. It is described more in de-tail in [10, Chapter 8]. Adaptive learning rate optimisers, such as Adam [47], have become increasingly popular due to it usually working well with less hyperparameter tuning effort, and converge fairly quickly to decent solutions thanks to its adaptive learning rate and mo-mentum. A caveat to these optimisers is that it has been observed that adaptive methods converge to solutions that generalise worse for DNNs [48,49] ‡.