Artificial Neural Networks (ANN) - Why Machine Learning?

1.3 Why Machine Learning?

2.1.2 Artificial Neural Networks (ANN)

The origin of Artificial Neural Networks (ANN) is associated with the idea of designing an algorithm that tries to mimic the nervous system of biological organisms² (see Figure 5). Mainly, they are composed of computation units connected among themselves through weights, similarly to what happens with neurons and synaptic connections. In this sense, ANN can be seen as computational graphs of basic computational units in which greater predictive power is achieved by connecting them in certain ways. With this architecture, ANN compute a function of the inputs by propagating the calculated values from the input neurons to the output neurons, and the process of learning consists in varying the weights in order to minimize a cost function of the output values and the inputs truth labels [Agg18].

With regard to this scheme, ANN have the potential of approximating any continuous nonlinear functions.

Basic computation unit

Consider a given neuron (or node) which has the input valuesx∈ Rⁿ. Then, the scalar product of the input values and the weights vector is determined,w∈ Rⁿ, and this quantity

2This biological comparison is usually criticized because of the oversimplified vision of the brain functioning it gives. Although, neuroscience investigation has provided fresh and useful ideas in designing new neural network architectures.

CHAPTER 2. INTRODUCTION TO MACHINE LEARNING

Figure 5. Artificial Neural Network vs Human Brain Processing [Dik19].

is added to a bias value,b ∈ R, and finally an activation function,f, is applied. Hence, the output value (or activation),a∈ R, is given by

a=f(wx+b). (2.11)

Activation functions

One of the key points of ANN’s power resides in the non-linearity of the activation functions. Particularly, we will employ two types of activation functions:

Sigmoid. The sigmoid activation function is useful for binary classification as it outputs a value in{0,1}, and is given by

f(z) = 1

1 +e^−z. (2.12)

ReLU. The ReLU activation function has replaced the sigmoid activation function in deep neural networks because of the computational speed gain in training these architectures with this replacement, and its expression is

f(z) = max(0, z). (2.13)

Multi-layer neural network

A multi-layer neural network consists in connecting a set of computation units with a given layout. In Figure6we can observe an illustrative example of a 2 layer ANN where the forward propagation of the input values is shown. More hidden layers can be added to the architecture resulting in even more complex features. However the forward propagation of the input values will follow the same scheme.

Notice that to obtain a binary classification output ({0,1}) it is convenient to place a sigmoid activation function in the output layer of the ANN.

CHAPTER 2. INTRODUCTION TO MACHINE LEARNING

Figure 6. 1-layer Artificial Neural Network diagram [Dik19].

Loss function and backward propagation

Up to this point we know how to obtain the binary classification prediction from our ANN. Now, it is time to evaluate the prediction made. For that purpose, we define a cost function over the training set and update the weights of the different layers by minimizing it. This process of update, based on the derivatives of the cost function, is widely known as backpropagation. Many different cost functions can be chosen depending on the nature of the problem and the kind of results we are aiming for. Moreover, the most commonly used cost function in binary classification problems is thebinary cross-entropy. This classification metric provides fine details on the classifier performance and we will use it as cost function of every neural network we train. Given the output of the model,yˆ_i ∈ {0,1}, over sampleiand the truth label,y_i ∈ {0,1}, for that sample, the binary cross-entropy,L, would be given by

L(y,y) =ˆ −1 m

i=1

y_ilog ˆy_i+ (1−y_i) log (1−yˆ_i), (2.14) wheremrepresents the size of the considered dataset. Remember that yˆ_i represents the probability of sampleibeing positive. In the case of the ANN,yˆ_icorresponds to the output value of the output layer. Note that the binary cross-entropy has no upper bound and exists on the range[0,∞], where values close to0mean a high accuracy.

CHAPTER 2. INTRODUCTION TO MACHINE LEARNING Regularization

Throughout the training process our neural network will get better at predicting over the training set and will start to make worse predictions over the test set. This phenomena is known asoverfittingand there are several ways to prevent it.

L2 Regularization. This method reduces the magnitude of the neural network weights in order to get a simpler hypothesis which may be less prone to overfit the training set. To achieve this, a new term is introduced into the cost function (2.14) as follows whereλis the regularization parameter, the second sum extends over all the neural network layers and||. . .||F is the Frobenius norm.

Dropout regularization. The idea of the dropout technique is to eliminate with probabilityp^[l]some nodes from the neural network layerl in each example during the training. This procedure causes the neural network weights to shrink. During the testing time dropout can not be used.

There are some concerns with the use of artificial neural networks (ANN) such as finding how many neurons are needed for a given task as well as the ANN architecture. Also, there may not exist a unique solution of the problem as there may be many linear classifiers (hyperplanes) which can classify accurately the data. These are the main advantages of SVM over ANN. Otherwise, an ANN will outperform a SVM when there is a large training set. However, it should be borne in mind that there is no better model over the full range of problems.

In document Machine learning for remote sensing of Xylella (sider 19-22)