Convolutional Neural Networks

Background Theory

2.3 Object Detection

2.3.1 Convolutional Neural Networks

CNNs are deep learning models widely used in image classification. It began its advance in 1989 with ”Backpropagation applied to handwritten zip code recognition” [27] by Yann LeChun et al. with further developments in ”Gradient-based learning applied to document recognition” in 1998 [28]. In the same paper, LeChun et al. introduces the MNIST dataset [29] which shows the potential in CNNs for classifying handwritten digits. The potential was further emphasised and popularized in 2012 by Krizhevsku et al. which won the ImageNet 2012 classification benchmark with their convolutional neural network, described in [24]. With the development of datasets, algorithms, and better hardware, several methods based on CNNs have made their breakthrough. For instance, R-CNN [17], Fast R-CNN [44] and Faster R-CNN [43] for detec-tion in images, or SECOND [59] and PointRCNN [53] for classificadetec-tion in lidar data.

The basis of a CNN is a neural network which is built up by neurons that are connected to each other in layers. This will be explained in more details in the next section, based on theory adapted from ”Pattern recognition and Machine learning” by C. M. Bishop [9]. Next, some of the key characteristics of CNNs will be explained, based on theory from Goodfellow et al. [18].

Artificial Neural Network

The Artificial Neural Network (ANN) is a network that connects artificial neurons together in a system. The neurons take an input, and produces an output that is fed forward in the network.

They are arranged in layers, where the layers between the input and output layers are referred to as hidden layers, this is illustrated in figure 2.3a.

(a)Basic neural network with two hidden layers (b)Example of a simple neural network Figure 2.3: Illustrations of neural networks

To explain the flow of information in a neural network, an example of a simple neural net-work will be used. It consists of an input layer withN variables that can be denotedx₁, . . . , x_N, a hidden layer withM neurons that can be denoted z₁, . . . , z_M and an output layery₁, . . . , y_K with K outputs. This is illustrated in figure 2.3b. The neurons in the hidden layer can be described with the activation’s, a_m usingM linear combinations of the input variables on the form the activation’s are in the first layer. Each activation is transformed with an activation function to form the output

z_m =h(a_m) (2.8)

A Rectified Linear Unit (ReLU) is a popular choice for the activation function in the hidden layers, and is given by

z_m =max(0, a_m). (2.9)

The outputs in the second and final layer can be described by K linear combinations on the form

wherek = 1, . . . , K. For the output layer a typical activation function in classification tasks is the softmax function, which is given as

y_k= e^a^k PK

k=1e^a^k. (2.11)

When equation 2.7 and equation 2.10 are combined, the result is a model with weights and biases, which is trained using inputs xn and outputs yk. The process of training the network consists of finding the correct weights and biases that leads to the inputs being classified with the correct output. This is achieved by training the network on inputs where the class is known.

The goal is to minimize the loss function, which is also referred to as the error function. This is done by using an optimizer such as gradient descent to adjust the weight and biases with backpropagation.

CNN

A CNN is a type of neural network that specialize in data that has a grid-like topology. Ex-amples of this can be time-series which represent a one-dimensional grid, or a grayscale image representing a two-dimensional grid. The input in the case of an image withn×mpixels would be an array of size n×m×1for a grayscale image, orn×m×3for a RGB image. Like a conventional ANN, a CNN consists of neurons structured in layers interconnected. What sets them apart is that a CNN consists of one or more convolutional layers, and that pooling is per-formed to downsample the amount of neurons in the network.

In traditional neural networks each input unit interacts with each output unit, which can be both unnecessary and expensive. CNNs on the other hand uses sparse interactions. This allows the

2.3 Object Detection network to detect meaningful features in subsections of the image, which reduces memory re-quirements and improve its efficiency. This is accomplished by convolving the image with a square matrix with given weights, referred to as a kernel. A convolutional layer is made up by a set of kernels that convolves the image, and returns the dot product of the weight in the kernel and the pixels in the image. An activation function layer is then applied, which perform an element wise activation function, such as ReLU. This creates an activation map that repre-sent features detected with the kernel. Throughout the network, different kernels are applied to extract information of features in the image, for instance edges and round shapes. Combined, these activation maps represents characteristics in an image. The result of convolution with a 2×2diagonal kernel is illustrated in figure 2.4. The input is a4×4matrix, while the output is a3×3matrix. To maintain the size of the input, zero padding can be applied, this process is adding zero elements around the input.

Figure 2.4: The result of a2×2kernel convolution performed on a4×4matrix

Pooling layers are applied to downsample the spatial dimensions of the input. This is ac-complished by applying filters that summarizes the statistic of nearby values. The filters are usually smaller than the input, and are applied with a stride which defines how the filter is ap-plied throughout the input. A popular pooling function is the max pooling operation, it outputs the maximum value within a defined neighborhood. This is illustrated in figure 2.5, where a 2×2filter is used to transform a4×4input to a2×2output using max pooling with a stride of 2. Pooling helps making the representation more invariant to small translations of the input, meaning that if the input is slightly translated, max pooling still outputs the same value. This is because pooling is interested in the value itself, not the placement of the value within a neigh-bourhood. Pooling also provides an abstracted form of the representation, which helps with overfitting. Thirdly, it reduces the size of the input which results in lower computational cost.

Figure 2.5:2×2max pooling performed on a4×4grid

Figure 2.6 illustrates an example of a typical architecture of a CNN. The network consists of an input, for example a RGB image, two convolutional layers and two pooling layers. The last layer is a fully connected layer which is similar to the layers in a traditional neural network.

Here, each neuron is connected to all the neurons in the previous layer. There can be several fully connected layers in a CNN. The last fully connected layer uses an activation function to output the class score, for instance the softmax function.

Figure 2.6: Example of a CNN with two convolutional (Conv) layers, two pooling layers and fully connected layers (FC, Output)

In document Lidar based object detection for an autonomous race car (sider 27-30)