Artiﬁcial Neural Networks - Machine learning

3.2 Machine learning

3.2.1 Artiﬁcial Neural Networks

The artiﬁcial neuron proposed by McCulloh and Pitts [45] was characterized by an "on" or "off" behavior and is commonly referred to as a perceptron. Neural networks made up of perceptrons limits the possibility to train networks efﬁciently, as we will see later. Perceptrons do however conveniently introduce the basic concepts of artiﬁcial neurons and neural networks.

Figure 3.2 shows a perceptron with two input signals x1 and x2, both of which can have a zero signal intensity or a signal intensity equal to one. The signals are related to weightsω1andω2, which indicate the importance of the signals. Further, θis the activation threshold of the neuron, andyis the output which can be either one or zero according to:

Figure 3.2:A perceptron neuron with two input signals,x1∈ {0,1}andx2∈ {0,1}with corresponding weightsω₁andω₂. The neuron has an activation thresholdθand an output y∈ {0,1}

(1 ifP

iωixi ≥θ , 0 ifP

iωixi < θ . (3.9) Hence, if the sum of the signal intensities multiplied by the weights of the signals exceeds the activation threshold, the neuron is activated and "ﬁres".

This simple one-neuron two-input network represent the basic concepts of artiﬁcial neural networks. Moreover, this network can also represent an "and"-function (e.g.

ω1 =ω2 = 1, θ= 1.5) and an "or"-function (e.g. ω1 =ω2 = 1.5, θ = 1). The reader may verify this.

Normally, the activation threshold is represented as a bias b = −θ, such that Eq. (3.9) is rewritten:

(1 ifP

iωixi+b≥0, 0 ifP

iωixi+b <0. (3.10) This convention is of little conceptual importance, but it has some mathematical beneﬁts, and will thus be used from here on.

Fully connected feed forward neural networks

The previous example shows a simple construction of a neural network. The power of neural networks, however, is greatly improved when neurons are connected in layers. This section looks at fully connected feed forward neural networks, with multiple layers, and shows that they can be represented in terms of a series of matrix multiplications. In this context:

• feed forwardinformation is propagated in one direction only (from input to

x y

Figure 3.3: A fully connected feed forward neural network with one input and output neuron and two hidden layers with three and two neurons respectively.

output)

• fully connectedmeans that all neurons in a layerlis connected to all neu-rons in the previous (l−1) and next (l+ 1) layer.

Also of note is that there is no direct passage of information other than via neigh-bouring layers in fully connected feed forwar neural networks.

Figure 3.3 shows a neural network with one input neuron, two hidden layers with three and two neurons respectively, and one output layer with one neuron. The ﬁg-ure depicts the input and output (activations), all weights connecting the neurons, biases and activations of each neuron. The following naming convention is used:

• ω^l_j,kis theweightfrom thek^thneuron in the (l−1^th) layer to thej^thneuron in thel^thlayer

• b^l_jis thebiasof thej^thneuron in thel^thlayer

• a^l_j is theactivationof thej^thneuron in thel^thlayer

With this naming convention the activation of neurona^l_j is given by:

a^l_j =σ X

ω^l_j,ka^l−1_k +b^l_j

, (3.11)

where σ is an activation function. In the case of using perceptron neurons the output ofa^l_jwould be:

output =





 1 if

kω^l_j,ka^l−1_k +b^l_j

≥0 0 if

kω^l_j,ka^l−1_k +b^l_j

, (3.12)

but we note thatσcan be any function. Further, the activation of a layerl,a^lcan be represented in avectorized form:

a^l=σ

ω^la^l−1+b^l

. (3.13)

With this, the activation of the different layers in the example above can be com-puted as:

Training of neural networks Neural networks are universal function approxima-tors, meaning that they can in theory describe any functional relation from input to output, provided the correct network architecture. In practice, however, this is seldom achieved, since application of neural networks (other than for education purposes) involves real data, from measurements and observations that come with a level of noise. Hence the problem involves ﬁnding the weights and biases of the network, such that the output ypredicts the true (labeled) quantity yˆas good as possible. This task may be formalized by minimization of a cost function,C, for example the mean squared error between the predicted and true quantities:

C = 1

whereN is the number of observations. A reasonable strategy to achieve this, is to deﬁne small changes in the weights,∆ω, and biases,∆b, that ensures a small negative change in the cost function,∆C <0. It turns out that that this is achieved by choosing:

provided that∆ωand∆bare sufﬁciently small. The latter may be controlled by adjusting the learning rate,λ. Eq. (3.18) is known as the gradient descent update

2.5 0.0 2.5

Figure 3.4: The original perceptron activation function (left), the sigmoid (middle) and the rectiﬁed linear unit, ReLU (right).

rule, and involves the calculation of the partial derivative of the cost function with respect to (all!) the weights and biases in the network. This also reveals the limita-tion of the perceptron neuron (σ(x)), since its derivative is zero everywhere except atx = 0, where its derivative is not deﬁned. The gradient descent algorithm (and its variations) require continuous differentiable activation functions. The percep-tron along with common activation functions, the Sigmoid and the rectiﬁed linear unit (ReLU) are shown in Figure 3.4.

The gradient descent algorithm provides a method to iteratively update weights and biases such that the network output,y, approaches the true quantity,y(i.e. it pro-vides a method for training the neural network). However, the algorithm requires the calculation of partial derivatives with respect to all the weights and biases in the network. This process can be very time consuming in large networks if per-formed weight by weight or layer by layer. Instead, this is achieved by application of the backpropagation algorithm, where errors are propagated from the output layer, throughout the network, and all the way to the ﬁrst layer of the network.

These errors are used to estimate the partial derivatives of the weights and biases of a layer, and calculations of errors and gradients are reused when estimating the gradients of the next layer.

Generalization of neural networks Neural networks are extremely powerful, given their universal ability to ﬁnd relations from input to output. However, this ability is considered dangerous because of the possibility to ﬁnd noisy relationships in the training data that does not generalize to unseen data (i.e. input-output relations

that were not used during training). This is known as over-ﬁtting and can typically occur if complex networks are trained on sparse data.

Overﬁtting can be avoided by reducing the complexity of the network (fewer lay-ers, and neurons in each layer), however, this may lead to under-ﬁtting, which is characterized by a network that is unable to ﬁnd adequate relationships from input to output. Other methods that can improve generalization include the application of a validation set and regularization:

Validation set A validation set is a certain fraction of the training set that is not used to estimate gradients and update weights and biases, but is used to test the network after each epoch. If the loss on the actual training data continues to go towards zero but the validation loss increases (a sign of over-ﬁtting), the training is stopped (early stopping).

Regularization Large weights are associated with high sensitivity to certain sig-nals, and is often indicative of over-ﬁtting. Regularization of neural net-works is performed by adding a term to the cost function that penalizes large weights, such as theL1orL2norm of all the weights of the network.

Even with the application of these and other procedures, over-ﬁtting may occur.

Hence, it is considered mandatory to always leave part of the data (a test set) aside, which is only used to test the ﬁnal network.

In document Physics-based and data-driven reduced order models: applications to coronary artery disease diagnostics (sider 39-44)