Backpropagation - Neural networks - L ATENT V ARIABLE M ACHINE L EARNING

3.1 Neural networks

3.1.1 Backpropagation

In the vernacular of the machine learning literature the aim of the optimiza-tion procedure is to train the model to perform optimally on the regression, reconstruction or classification task at hand. Training the model requires the computation of the total derivative in equation 3.13. This is also where the

48 Deep learning theory Chapter 3

biological metaphor breaks down, as the brain is probably not employing gra-dient descent.

Backpropagation of errors by automatic differentiation, first described by Linnainmaa [28], is a method of computing the partial derivatives required to go from the gradient of the loss to individual parameter derivatives. Con-ceptually we wish to describe the slope of the error in terms of our model parameters, but having multiple layers complicate this somewhat.

The backpropagation algorithm begins with computing the total loss, here exemplified with the squared error function,

E=C(y,ˆ y) = ¹ 2n

∑

(yˆ_nj−y_nj)². (3.14) The factor one half is included for practical reasons to cancel the exponent under differentiation. As the gradient is multiplied by the learning rateη, this is ineffectual on the training itself.

The sums over nand jenumerate the number of samples, and output di-mensions respectively. Finding the update for the parameters then starts with taking the derivative of equation 3.14 w.r.t the model outputyj =a^[_j^l^]

∂E

∂yj

=yˆ_j−a^[_j^l^]. (3.15) We have dropped the data index, as the differentiation is independent under the choice of data. In practice the derivative of each sample in the batch is averaged together for the gradient update of each parameter.

The activation function, f, has classically been the logistic sigmoid func-tion, but during the last decade the machine learning community has largely shifted to using the Rectified linear unit (ReLU). This shift was especially ap-parent after the success of Krizhevsky et al. [29]. In this section we then ex-emplify the backpropagation algorithm with a network with ReLU activation.

The ReLU function is defined in such a way that it is zero for all negative in-puts and the identity otherwise, i.e.

ReLU(x) = f(x) =

(x, ifx >0

0, otherwise. (3.16)

The ReLU is obviously monotonic and its derivative can be approximated with the Heaviside step-function which we denote with H(x) and is mathemati-cally expressed as

H(x) = f⁰(x) =

(1, ifx >0

0, otherwise. (3.17)

Section 3.1 Neural networks 49

Common to most neural network activations the computation of the deriva-tive is very light-weight. In the case of the ReLU function the computation of the derivative uses the mask needed to compute the activation itself, requiring no extra computing resources.

It is important to note that the cost and activation functions introduced in equations 3.14, 3.16 and 3.17 are not a be-all-end-all solution, but they are chosen for their ubiquitous nature in modern machine learning.

Returning to the optimization problem we start to unravel the backpropa-gation algorithm. We use equation 3.15 to find the derivatives in terms of the last parameters, i.e. W_ij^[ⁿ^]andb^[_jⁿ^]

∂E

∂W_ij^[ⁿ^]

= ^∂E

∂y_j

∂yj

∂z^[_jⁿ^]

∂W_ij^[ⁿ^], (3.18)

= ^∂E

∂yj

o⁰(z^[_jⁿ^]) ¹

∂W_ij^[ⁿ^]

a^[_iⁿ⁻¹^]W_ij^[ⁿ^]+b^[_jⁿ^]

, (3.19)

= (yˆj−yj)o⁰(z^[_jⁿ^])a^[_iⁿ⁻¹^]. (3.20) The differentiation of the error w.r.t tobjcan be similarly derived to be

∂E

∂b^[_jⁿ^]

= (yˆj−yj)o⁰(a^[_jⁿ^]). (3.21) Repeating this procedure layer by layer is the process that defines the back-propagation algorithm. From equations 3.20 and 3.21 we discern a recursive pattern in the derivatives moving to the next layer. Before writing out the full backpropagation term we will introduce some more notation that makes bridging the gap to an implementation considerably simpler. From the repeat-ing structure in the aforementioned equations we define the first operation needed for backpropagation,

δⁿ_j = (yˆ_j−y_j)o⁰(z^[_jⁿ^])_. _(3.22) Note that this is an element-wise Hadamard product and not an implicit sum-mation, expressed by the subscript index in ˆδⁿ_j. The element-wise product of two matrices or vectors is denoted as

a◦b. (3.23)

This short-hand lets us define equations 3.20 and 3.21 in a more compact way

50 Deep learning theory Chapter 3

∂E

∂w^[_ijⁿ^]

=δⁿ_ja^[_iⁿ⁻¹^], (3.24)

∂E

∂b^[_jⁿ^]

=_δⁿ_j. (3.25)

From the iterative nature of how we construct the forward pass we see that the last derivative in the chain for each layer, i.e. those in terms of the weights and biases, have the same form

∂z^[_j^l^]

∂w^[_ij^l^]

=a^[_i^l⁻¹^], (3.26)

∂z^[_j^l^]

∂b^[_j^l^]

=1. (3.27)

These derivatives together with a general expression for the recurrent term δ^l_j are then the pieces we need to compute the parameter update rules. By summing up over the connecting nodes, k, to the layer, l, of interest δ^l_j can be expressed as

δ^l_j =

∑

∂E

∂a^[_k^l⁺¹^]

∂z^[_k^l⁺¹^]

∂a^[_j^l^]

∂z^[_j^l^], (3.28) δ^l_j =

∑

δ^l⁺¹∂z^[_k^l⁺¹^]

∂a^[_j^l^]

∂z^[_j^l^]

. (3.29)

From the definitions of the z^[_j^l^] and a^[_j^l^] terms we can then compute the last derivatives. These are then inserted back into 3.29, giving a final expression forδ_j^l,

δ^l_j =

∑

δ^l_k⁺¹w^[_jk^l⁺¹^]f⁰(z^[_j^l^]). (3.30) Finally the weight and bias update rules can then be written as

Section 3.1 Neural networks 51

∂E

∂w^[_jm^l^]

=δ^l_ja^[m^l⁻¹^], (3.31)

∂E

∂b^[_j^l^]

=_δ^l_j. (3.32)

To finalize the discussion on the algorithm we illustrate how backpropagation might be implemented in algorithm 2.

Algorithm 2: Backpropagation of errors in a fully connected neural network for a single sample x.

Data: Iterablesa^[^l^] z^[^l^]W^[^l^] b^[^l^]∀ l ∈ [1, 2, . . . , n] Input: ^∂E_∂y, o⁰(z^[ⁿ^]), f⁰(·)

Result: Two iterables of the derivatives ^∂E

∂w_ij^[l] and ^∂E

∂b^[l]_j

Initialization;

δⁿ_j ← ^∂E_∂y ◦o⁰(z^[ⁿ^]); Compute derivatives;

forl ∈ [n−1, . . . , 1]do

∂E

∂w^[l]_jm ←_δ^ˆ^l_j⁺¹a^[_m^l^];

∂E

∂w^[l]_jm ←δ^ˆ^l_j⁺¹;

δ^l_j⁺¹←_∑_kδ^l_k⁺¹w^[_jk^l⁺¹^]f⁰(z^[_j^l^]) return ^∂E

∂w_ij^[l] and ^∂E

∂b^[l]_j

The backward propagation framework is highly generalizable to variations of activation functions and network architectures. The two major advance-ments in the theory of ANNs are both predicated on being fully trainable by the backpropagation of errors. Before we consider these improvements made by the introduction of recurrent neural networks (RNN) and convolutional neural networks (CNN), we remark again on the strength of this algorithm.

We are not only free to chose the activation function from a wide selection, the backpropagation algorithm also makes no assumptions on the transformation that constructsz_j. As long as it is once differentiable we are free to choose a different approach. This flexibility of the framework is part of the resurgent success of deep learning in the last decade.

52 Deep learning theory Chapter 3

In document L ATENT V ARIABLE M ACHINE L EARNING (sider 53-58)