DEEP LEARNING AND NEURAL NETWORKS 9 The different hyper-parameters can and should be tweaked to attain optimal results

2.2.1 Convolutional Neural Networks

The convolutional network was initially conceptualized by Fukushima (1988)[28], but only the architecture was proposed here without any learning algorithm to go with it. Later LeCun et al. (1998)[45] applied the learning algorithm back-propagation[63], and laid the basis of the convolutional network (CNN) used in newer methods in his classification network.

As input to a fully connected network, it doesn’t matter how the data is ordered as long as all the data is consistently ordered in the same fixed way. This makes them unable to conserve local contextual connections in the data. The convolutional structure seen in fig. 2.4 makes better use of the local information by the convolution operation. It is a mathematical operation used for filtering images usually for blurring, sharpening, smoothing, edge detection, and more. This is done by convolving akernel matrix orreceptive fieldwith an image. The definition of a discrete 2-dimensional convolution is shown in eq. (2.9).

(𝑔∗𝑓) [𝑥 , 𝑦]= Compared to a fully connected layer having a weighted sum going from one layer to another, this is simply a bit more sophisticated operation, and the learned weights can now be found as the entries in the kernels of the different filters in each layer.

The strength of the convolutional architecture is the ability to learn low-level concepts early in the network and higher-level concepts and specialized feature maps later in the network. This is done by aggregating the low-level features by pooling them together. This is one of the fundamental steps in a classification method. Pooling in a CNN is usually done by representing an area in the feature map by either the average or the max value - named average pooling and max pooling respectively. As a network grows deeper it usually grows wider, adding more specialized feature maps

4Image from http://what-when-how.com/wp-content/uploads/2012/07/tmp725d63_thumb.png

10 CHAPTER 2. THEORY BACKGROUND

Figure 2.4: An example of a convolutional network structure with increasing number of filters or feature maps per layer and pooling in between to reduce dimentsionality.⁴

to maintain expressiveness. The pooling reduces computational complexity and the added maps increase it.

2.2.2 The evolution of commonly used architectures

Multi-layer feed-forward neural networks are very flexible and can be constructed in virtually an unlimited number of different ways. The performance of the architec-ture or method applied to a specific problem will depend on aspects including the number of filters per layer, kernel sizes, different types of pooling, optimization, regu-larization techniques, and activation functions. Having this many different properties to change is what makes these types of networks notorious for being described as

"black box" systems. This is also the reason why a lot of the breakthroughs in deep learning with ANN’s have come iteratively with new architectures introducing new techniques or a beneficial combination of already known techniques. This section will present some of the most influential architectures. These methods have made such significant contributions to the field that they have become widely accepted standards and their architectures pose as base building blocks for new methods. They were all introduced as winners through the annual ImageNet Large Scale Visual Recogni-tion Challenge[41] (ILSVRC)⁵, mainly a classification challenge. The architectures in

5http://www.image-net.org/challenges/LSVRC/

2.2. DEEP LEARNING AND NEURAL NETWORKS 11 question are AlexNet[44], VGG-Net[67], GoogLeNet[70] and ResNet[37] as seen in fig. 2.5.

Figure 2.5: Winners of the annual ILSVRC by year.⁶The graphs are showcasing the top-5 classification error-rates achieved by the winning method each year. As a comparison, human error-rate on the same data by an expert annotator was measured to get as low as 5.1% by Russakovsky et al. (2015)[64].

• AlexNetproposed by Krizhevsky et al. (2012)[44] was the first deep architecture to win the ILSVRC challenge in 2012. It achieved a top-5 test accuracy of 84.6% compared to the second-best entry using traditional feature engineering methods with an accuracy of 73.8%. This was a huge improvement, solidifying the potential of CNN’s in the field. The architecture showed in Figure 2.6, consists of 5 convolutional layers with max-pooling, ReLU activation function, followed by 3 fully connected layers. The convolutional layers produce the downscaled feature vector which is classified by the fully connected layers. It also features dropout to combat overfitting.

6Image from https://medium.com/analytics-vidhya/cnns-architectures-lenet-alexnet-vgg-googlenet-resnet-and-more-666091488df5

12 CHAPTER 2. THEORY BACKGROUND

Figure 2.6: Topology of AlexNet, the first CNN to win the ILSVRC. Figure from [44].

• VGG-Netproposed by Simonyan and Zisserman (2014)[67] was a set of vari-ous models and configurations with slightly different numbers of layers and configurations. The submitted configuration won the ILSVRC-13 challenge and achieved a top-5 test accuracy of 92.7%. This configuration is often referred to as VGG-16 as it had 16 weight layers - 13 convolutional layers and 3 fully connected ones. The main contribution and changes from the previous architectures was more layers making the network deeper, and the use of smaller receptive fields.

Where AlexNet used a receptive field of 7×7 in the first convolutional layer then pooling, VGG had three consecutive convolutional layers with receptive fields of 3×3 before pooling. The effective receptive fields in both cases are the same, but the three sequential layers with ReLU activation between each layer results in more non-linearity and almost halving the number of parameters with an equal number of filters. The increased non-linearity through more activations makes the objective function more discriminative making the network easier to train. The reduction in parameters can be seen as a regularization imposed on the effective 7×7 receptive field.

• GoogLeNetproposed by Szegedy et al. (2015)[70], winning the ILSVRC-14 chal-lenge followed the trend of being deeper than the previous winners. It consisted of 22 layers but showed a greater complexity than simply stacking layers se-quentially. Even though it did not show as significant a leap in performance as the previous years, it showed a top-5 test accuracy of 93.3%. Its structure was

2.2. DEEP LEARNING AND NEURAL NETWORKS 13

In document Object detection and instance segmentation of planktonic organisms using Mask R-CNN for real-time in-situ image processing. (sider 23-27)