• No results found

Deep learning is a computational model which allows learning representations of data with increasing levels of abstraction by using multiple processing layers [17].

These machine learning methods have significantly improved visual object detec-tion, speech recognition and generation as well as other domains based on large amounts of data. Deep learning is capable of processing large amounts of data and learn to recognise specific patterns and intricate data structure by using backprop-agation algorithm, which indicates to the network which parameters should be changed and how to provide a more accurate answer. Deep convolutional neural networks (CNNs) are typically used to process images, videos, speech and audio.

Recurrent nets (RNNs) are used for sequential and time series data analysis and

forecastings such as speech and text.

2.6.1 Convolutional Neural Networks

Given that mainly the visual information was being processed in work described in this thesis, convolutional neural networks provided the best performance on the collected data. CNNs are suitable when the data to be learned is provided in a format of arrays, which can have one or multiple dimensions. Images are supplied as 2D arrays if grayscale and 3-layered 2D arrays for colour images.

CNNs typically consist of many interconnected layers placed in sequence, mainly containing convolutional and pooling layers. The network is generally fin-ished with one or two fully connected layers, which eventually connect to output values to provide an answer to the given problem. An example of a CNN structure can be seen in Figure 2.7. However, the structure of CNN can be application-dependent.

Figure 2.7. Example structure of the Convolutional Neural Network (CNN). Image source [39]

Deep neural networks exploit the idea of compositional hierarchies, where higher level features are composed of lower level features, which is the reason for a sequential structure. Similar approaches can be found in other signal processing applications.

Units in a convolutional layer are organised in feature maps, within which

each one is connected to local patches in the feature maps of the previous layer

through a set of weights called a filter bank. The result of this locally-weighted

sum is then passed through a non-linearity such as a ReLU. The same filter bank

is shared among all units in a feature map. Different feature maps in a layer use

different filter banks. The reason for this architecture is twofold. First, in array

data such as images, local groups of values are often highly correlated, forming

distinctive local motifs that are easily detected. Second, the local statistics of

images and other signals are invariant to location. In other words, if a motif can

appear in one part of the image, it could appear anywhere, hence the idea of units at different locations sharing the same weights and detecting the same pattern in different parts of the array.

Between the convolutional layers, pooling layers can be used. First of all, it reduces the dimensionality of the layers, thus reducing the number of parameters that have to be adjusted. Furthermore, it adds robustness, especially regarding the relative position of the detected features by selecting the most distinct ones in the local area.

The training is done by providing images together with the correct classifica-tion labels, correctly marked mask or other informaclassifica-tion. If the input images are too large to fit into the memory of the GPU, they are sub-divided into mini-batches and fed to the network for training one mini-batch at a time. At each epoch, the result using current network parameters is estimated and an error calculated by us-ing the defined loss function. Then, usus-ing this information, the backpropagation algorithm adjusts the parameters of the neural network to reduce the error. The process is repeated until no more improvements are seen or the desired accuracy is achieved.

Loss function allows CNN to evaluate the quality of the learning process by penalising the network for deviations between the predicted output and the actual desired output. The loss is explicitly defined for the type of output and considers possible values, so it is often application-specific.

2.6.2 Multi-Objective Convolutional Neural Networks

Normally the CNN optimises for one type of objective, like classification or re-gression. However, there are occasions, when having outputs of multiple types would be beneficial. For example, let it be a face recognition task. The primary objective would be to identify who is the person in an input image. On top of that, we can have a face localisation to say where accurately in the picture the face is located, emotion recognition and even gaze direction estimation can provide use-ful data. To achieve this, CNN has to be able to estimate and optimise for multiple heterogeneous objectives.

To solve this issue, a method called "multi-objective convolutional neural

net-works" was developed. It uses a similar structure to a standard CNN. However,

it provides branching off to multiple outputs. Branching off can be done either

at the final fully-connected layer or in the middle of the CNN with some

addi-tional objective specific layers before the output layer. An example structure of a

Input Conv Conv Conv

Conv Max

Pool

Max Pool

FC

FC

Outputs Figure 2.8. Example structure of the Multi-Objective Convolutional Neural Network (CNN).

multi-objective CNN can be seen in Figure 2.8.

The loss function for a multi-objective CNNs has to comprise a combination of losses, one for each of the objectives. One approach is to have a weighted sum of the losses, where weight can represent the importance, or the impact, of each of the outputs to the whole system. CNN is trained simultaneously for all of the objectives.

2.6.3 Transfer Learning

One problem of training CNNs is that large amounts of well-marked and diverse training data are required to successfully train the network. It also results in rather long training processes even on powerful GPUs. It is mainly caused by having large amounts of parameters that have to be tuned at each epoch.

However, when we consider the compositional hierarchy of the CNN and some observations that the first couple of initial layers commonly learn general visual features. These include edges, corners and contrasts of the image. It has been noticed that when given one trained CNN, it is possible to re-train it for a different domain by adjusting just some of the layers of the network instead of all. Typi-cally, some closest layers to the output layer are adjusted. This approach is called transfer learning [22] [40].

The benefit of transfer learning technique is that the parameters contained in

so-called frozen layers are copied from the previously-trained network, while only

some of the layers are trained during the process. It speeds up the training process

and requires smaller training datasets compared to the full CNN training, while

still capable of reaching the equivalent accuracy compared to the fully trained

CNNs.

Chapter 3

Research Summary

This chapter will give an overview of the PhD research progress and achievements by the publications.

Paper IV Paper VI Paper V Paper VII

Robot Detection using Convolutional Neural Networks

Transfer Learning

Deep Learning Multi-Cam Systems

Paper I

Paper II Paper III

Automatic Robot-Camera Calibration

Predictive and Reflexive Collision Avoidance

Robotic EV Charging Station

Research Question 1 Research Question 2

Figure 3.1. Publication overview graph with indications of research questions addressed by each paper or paper group.