Deep learning is a computational model which allows learning representations of data with increasing levels of abstraction by using multiple processing layers [17].
These machine learning methods have significantly improved visual object detec-tion, speech recognition and generation as well as other domains based on large amounts of data. Deep learning is capable of processing large amounts of data and learn to recognise specific patterns and intricate data structure by using backprop-agation algorithm, which indicates to the network which parameters should be changed and how to provide a more accurate answer. Deep convolutional neural networks (CNNs) are typically used to process images, videos, speech and audio.
Recurrent nets (RNNs) are used for sequential and time series data analysis and
forecastings such as speech and text.
2.6.1 Convolutional Neural Networks
Given that mainly the visual information was being processed in work described in this thesis, convolutional neural networks provided the best performance on the collected data. CNNs are suitable when the data to be learned is provided in a format of arrays, which can have one or multiple dimensions. Images are supplied as 2D arrays if grayscale and 3-layered 2D arrays for colour images.
CNNs typically consist of many interconnected layers placed in sequence, mainly containing convolutional and pooling layers. The network is generally fin-ished with one or two fully connected layers, which eventually connect to output values to provide an answer to the given problem. An example of a CNN structure can be seen in Figure 2.7. However, the structure of CNN can be application-dependent.
Figure 2.7. Example structure of the Convolutional Neural Network (CNN). Image source [39]
Deep neural networks exploit the idea of compositional hierarchies, where higher level features are composed of lower level features, which is the reason for a sequential structure. Similar approaches can be found in other signal processing applications.
Units in a convolutional layer are organised in feature maps, within which
each one is connected to local patches in the feature maps of the previous layer
through a set of weights called a filter bank. The result of this locally-weighted
sum is then passed through a non-linearity such as a ReLU. The same filter bank
is shared among all units in a feature map. Different feature maps in a layer use
different filter banks. The reason for this architecture is twofold. First, in array
data such as images, local groups of values are often highly correlated, forming
distinctive local motifs that are easily detected. Second, the local statistics of
images and other signals are invariant to location. In other words, if a motif can
appear in one part of the image, it could appear anywhere, hence the idea of units at different locations sharing the same weights and detecting the same pattern in different parts of the array.
Between the convolutional layers, pooling layers can be used. First of all, it reduces the dimensionality of the layers, thus reducing the number of parameters that have to be adjusted. Furthermore, it adds robustness, especially regarding the relative position of the detected features by selecting the most distinct ones in the local area.
The training is done by providing images together with the correct classifica-tion labels, correctly marked mask or other informaclassifica-tion. If the input images are too large to fit into the memory of the GPU, they are sub-divided into mini-batches and fed to the network for training one mini-batch at a time. At each epoch, the result using current network parameters is estimated and an error calculated by us-ing the defined loss function. Then, usus-ing this information, the backpropagation algorithm adjusts the parameters of the neural network to reduce the error. The process is repeated until no more improvements are seen or the desired accuracy is achieved.
Loss function allows CNN to evaluate the quality of the learning process by penalising the network for deviations between the predicted output and the actual desired output. The loss is explicitly defined for the type of output and considers possible values, so it is often application-specific.
2.6.2 Multi-Objective Convolutional Neural Networks
Normally the CNN optimises for one type of objective, like classification or re-gression. However, there are occasions, when having outputs of multiple types would be beneficial. For example, let it be a face recognition task. The primary objective would be to identify who is the person in an input image. On top of that, we can have a face localisation to say where accurately in the picture the face is located, emotion recognition and even gaze direction estimation can provide use-ful data. To achieve this, CNN has to be able to estimate and optimise for multiple heterogeneous objectives.
To solve this issue, a method called "multi-objective convolutional neural
net-works" was developed. It uses a similar structure to a standard CNN. However,
it provides branching off to multiple outputs. Branching off can be done either
at the final fully-connected layer or in the middle of the CNN with some
addi-tional objective specific layers before the output layer. An example structure of a
Input Conv Conv Conv
Conv Max
Pool
Max Pool
FC
FC
Outputs Figure 2.8. Example structure of the Multi-Objective Convolutional Neural Network (CNN).
multi-objective CNN can be seen in Figure 2.8.
The loss function for a multi-objective CNNs has to comprise a combination of losses, one for each of the objectives. One approach is to have a weighted sum of the losses, where weight can represent the importance, or the impact, of each of the outputs to the whole system. CNN is trained simultaneously for all of the objectives.
2.6.3 Transfer Learning
One problem of training CNNs is that large amounts of well-marked and diverse training data are required to successfully train the network. It also results in rather long training processes even on powerful GPUs. It is mainly caused by having large amounts of parameters that have to be tuned at each epoch.
However, when we consider the compositional hierarchy of the CNN and some observations that the first couple of initial layers commonly learn general visual features. These include edges, corners and contrasts of the image. It has been noticed that when given one trained CNN, it is possible to re-train it for a different domain by adjusting just some of the layers of the network instead of all. Typi-cally, some closest layers to the output layer are adjusted. This approach is called transfer learning [22] [40].
The benefit of transfer learning technique is that the parameters contained in
so-called frozen layers are copied from the previously-trained network, while only
some of the layers are trained during the process. It speeds up the training process
and requires smaller training datasets compared to the full CNN training, while
still capable of reaching the equivalent accuracy compared to the fully trained
CNNs.
Chapter 3
Research Summary
This chapter will give an overview of the PhD research progress and achievements by the publications.
Paper IV Paper VI Paper V Paper VII
Robot Detection using Convolutional Neural Networks
Transfer Learning
Deep Learning Multi-Cam Systems
Paper I
Paper II Paper III
Automatic Robot-Camera Calibration
Predictive and Reflexive Collision Avoidance
Robotic EV Charging Station
Research Question 1 Research Question 2
Figure 3.1. Publication overview graph with indications of research questions addressed by each paper or paper group.
In document
An Environment-Aware Robot Arm Platform Using Low-Cost Sensors and Deep Learning
(sider 28-33)