CNN Architectures - Methodology and Implementation

Methodology and Implementation

4.8 CNN Architectures

The procedure for generating an image with intensity is similar. The difference is that the average intensity for the points relating to a cell will be the value in the given cell. The intensity data is given in the range (0, 255) and the result is therefore an×ngreyscale image of a cone candidate.

Figure 4.11:Illustration of 2D projection

4.8 CNN Architectures

CNNs have proven to perform well for image classification tasks as mentioned in section 2.3.1.

It is therefore selected as the classification method for the images received from the 2D pro-jection. There exists several choices for CNN architectures, that vary in depth, layer types, optimization function and loss function. A well-known architecture is LeNet-5, introduced by Y. LeCun et al. [28] to classify handwritten digits in the MNIST dataset. The size of a cone image and the image of a digit from MNIST is almost the same, and the amount of details in the images is also comparable. The proposed architecture is therefore based on the LeNet-5 architecture, with some modification.

4.8.1 LeNet-5

LeNet-5 consists of 7 layers, with three convolutional layers, two pooling layers, fully con-nected layers and an output layer. The architecture is illustrated in figure 4.12. The first convo-lutional layer, C1, consists of 6 filters of size5×5and stride of one. The S2 and S4 pooling layers uses average pooling with a filter size of2×2and a stride of two. Average pooling works by taking the average value across the pixels in a a grid. The second convolutional layer, C3, consists of 16 filters of size 5×5and stride of one. The third convolutional layer C5 is fully connected with 120 filters of size1×1. C5 is followed by a fully connected layer, F6 with 84 nodes. The last layer is the output layer of size 10, one node for each output digit. For a more thorough explanation of each layer the reader is referred to [28].

All the layers except the output layer uses the scaled hyperbolic tangent function, tanh as acti-vation function. The tanh function is given as

f(a) = Atanh(Sa), (4.6)

where A is the amplitude of the function, and S determines the slope at the origin. The output layer is made up by a Euclidean Radial Basis Function (RBF) unit for each of the output classes, which is computed as

y_i =X

(x_j −w_ij)², (4.7)

wherex_j is the input vector from nodej, andw_ij is a chosen parameter vector. The parameter is corresponding to a 7×12bitmap for each class. If x_j deviates fromw_ij, the outputy_i gets greater, which can be used in the loss function to penalize deviations.

Figure 4.12:Architecture of LeNet-5, adapted from [28]

4.8.2 Modifications to LeNet-5

In order to adapt LeNet-5 to this project, some modifications are suggested. The first modifi-cation is to flatten the output from S4 so that a conventional fully connected layer can be used instead of C5. Average pooling is kept, as it is thought to potentially counteract spikes in in-tensity and generalize the wanted features from a cone. The chosen activation function for each hidden layer is the ReLU, while for the output layer the softmax function is used.

The optimizer used is the Adam optimizer [23]. Adam is an abbreviation for adaptive momen-tum estimation, it uses adaptive learning rates for parameters to estimate the first and second moments of the gradients. The n-th moment of the gradient is defined as the expected value of the gradient to the power of n. A benefit with Adam is that it is fast, giving lower training times.

A thorough introduction to Adam is given by D. Kingma and J. Ba [23]. The loss function used is the cross-entropy function, explained in ”Pattern recognition and Machine learning” [9],

H(p, q) = −X

pilog(qi). (4.8)

It measures the performance of a classification that outputs a class-probability between 0-1, which is the case for this project since the output layer uses the softmax activation function. In equation 4.8,pis the class label andqis the predicted probability.

The proposed CNN is summarized in table 4.6. Compared with LeNet-5, the network maintain its simplicity in terms of depth and layers, which should result in fast training times without the need of GPUs. On the other hand it is adapted with more modern features to enhance its performance.

4.8 CNN Architectures

Layers Parameters

Convolutional filters: 6, size: 3×3, activation: ReLU

Average pool size:2×2

Convolutional filters: 16, size: 3×3, activation: ReLU

Average pool size:2×2

Fully connected size: 120, activation: ReLU Fully connected size: 84, activation: ReLU

Output layer size: number of outputs, activation: Softmax

Model Methods

optimizer adam

loss function cross-entropy

Table 4.6:CNN architecture

4.8.3 Implementation and Datasets

The model is implemented in python using tensorflow. For the binary class, the images are labeled 0 or 1, depending on whether they are non-cones or cones, respectively. The network is trained with 5 epochs. For the multi-class case, they are labeled 0,1 or 2 dependent if they are non-cones, blue cones or yellow cones, respectively. This network is trained with 8 epochs.

For neural network based classification methods, datasets are important. The datasets are used to train the method and to evaluate the performance of the method, to see if it has learned relevant features from the data. Neural networks are often developed with thousands of images which is split in training/testing sets of different fractions. The gathering and labeling of data can be a tedious process where an eye for detail is important such that the training process can be effective.

For this project, there were no datasets that could be adapted for training purposes. The data has therefore been collected throughout the autumn of 2019 and during the spring of 2020, as a part of the author’s specialization project and for this thesis. For the binary case, this relates to around 3000 samples of cones and 2000 samples of non-cones. For the multi-class network, this relates to around 1500 samples of blue cones, 1500 samples of yellow cones and 2000 samples of non-cones with intensity. The datasets is split into 70% for training and 30% for evaluation.

Afterwards it is evaluated on scenarios which has not been a part of the training/evaluation data, which will be introduced in the next chapter.

Chapter 5

In document Lidar based object detection for an autonomous race car (sider 51-55)