The proposed CNN - Face Detection using an a contrario approach

Chapter 3 Face Detection using an a contrario approach

4.4 The proposed CNN

Several architectures for facial expression recognition have been developed in this last decade. The accuracy results of some of them are shown in Table 4.1. Note that the majority of architectures [44, 60, 67, 82, 89, 7] use k-cross-validation (explained in Section 4.5.2) to obtain the accuracy results reported in Table 4.1, with the exception of paper [36] which performed tests using 98% of the data for training and only 2% for testing. In [82] the authors used the pre-trained model Face-VGG. In [67] the research team designed a complex architecture using convolutional layers in parallel and combined them to obtain the final result. Papers [44, 60, 89]

presented better results using more simple architectures than papers [36, 67, 82]. Although a

result similar to paper [89] was presented by paper [7], which used a more complex architecture.

Note that paper [60] obtained 96.76% accuracy, but the authors only tested with 1 subject for each partition of the k-cross-validation set and ran the experiment 10 times to select the best result. Their method also includes a pre-processing step, tuned using the k-cross-validation method described in Section 4.5.2, and for which they report 89.7% accuracy.

Model Year Accuracy

Table 4.1. Results of recent models in the literature. These models have been trained and tested with the CK+ dataset to classify the 6 basic expressions.

These models used the CK+ dataset in their experiments and classified the six basic expressions. The growing interest in creating new CNNs¹² to improve results in facial expression recognition has been a motivation to deepen in this field. In [30], the authors affirmed that a network with three hidden layers forms a very good generative model of the joint distribution of the images to be classified and their labels. Starting from this base and seeing the different architectures proposed by papers such as [46, 44, 89, 7], we implement a model with more than three hidden layers to recognize facial expressions. We tune this architecture and its parameters by empirical evidence, until we get a model that improves results with respect to expression). The best result is in bold text with high mean and low standard deviation.

12 A Convolutional Neural Network (CNN) is designed to automatically and adaptively learn spatial hierarchies of features using typically three types of layers: convolution, pooling and fully connected layers. The first two layers perform feature extraction, and the fully connected layer gives the output [109].

Chapter 4. Facial Expression Recognition | 71

Some of fine tuning of the model are shown in Table 4.2, where we test with three, four, five, and six convolutional layers, using the CK+ dataset for training and k-cross-validation (with k=5) (see Section 4.5.2). The average and standard deviation of the detection accuracies obtained after each step of the cross-validation are shown in Table 4.2. The best results are obtained with 5 convolutional layers, for which we achieve the highest mean accuracy and the lowest standard deviation. The addition of more convolutional layers doesn’t improve the results.

The final CNN model is depicted in Figure 4.7. Our network receives as input a 150x150 grayscale image and classifies it into one of the six classes. The CNN architecture consists of 5 convolutional layers, 3 pooling layers and two fully connected layers. The first layer of the CNN is a convolutional layer that applies a kernel size of 11×11 and generates 32 images of 140×140 pixels. This layer is followed by a pooling layer that uses max-pooling, with a kernel size of 2×2 and stride 2, to reduce the image to half of its size. Subsequently, another two convolutional layers are applied with a kernel of 7×7 and a filter of 32, each. This is followed by another pooling layer, with a kernel size of 2×2 and stride 1, two more convolutional layers that apply a kernel of 5x5 and a filter of 64 each one, and two fully connected layers of 512 neurons each. The first fully connected layer also has a dropout [28] to avoid overfitting in the training. Finally, the network has six output nodes (one for each expression) that are connected to the previous layer. Although these output nodes can be changed in case of having more expressions and fine-tune again the network. The output node with the maximum value is classified as the expression of the image. Table 4.3 compares our architecture with recent proposals in the literature. Note that in [36, 7, 67] the architectures are more complex than the rest. In [36] the authors use 6 convolutional layers and 2 residual blocks which consist of 4 convolutional layers. Paper [7] uses 1 convolutional layer and 2 blocks. Each block consists of parallel path. The first path uses 2 convolutional layers and the second path uses 1 pooling layer and 1 convolutional layer. In [67] they use 2 convolutional layers and 3 modules which consist of 4 parallel convolutional layers.

Figure 4.7. Architecture of the CNN proposed with 5 convolutional layers, 3 pooling layers and two fully connected layers.

Weight initialization is an important step in Neural Networks as a careful initialization of the network can speed up the learning process and provide better accuracy results after a fixed number of iterations. Therefore, we carried out a study of the weight initialization techniques most used in CNNs. In Table 4.4 we show accuracy results with different initializations of weights. These initializations consist of combinations of Xavier [45] (used in the experiments

whose results are shown in Table 4.2), MSRA [53] and Gaussian [26] methods. The Gaussian method uses a standard deviation of 0.01. We trained our CNN using k-cross-validation (described in Section 4.5.2) with each initialization method. We can see in Table 4.4 that the combination of Xavier and Gaussian methods, and the combination of Gaussian and MSRA methods result in higher average accuracy values (marked in bold).

Model [36] [60] [7] [67] [82] [46] [44] [89] Our Model Images 128x96 32x32 224x224 48x48 224x224 224x224 96x96 96x96 150x150

LRN¹³ No No Yes No No Yes No No No

Conv. 6+2* 2 1+2* 2+3* 13 5 3 4 5

Pooling 3 2 5 4 5 3 3 4 3

Dropout 2 0 1 0 2 2 1 2 1

FC 2 1 1 2 3 3 1 1 2

Table 4.3. Results of recent models in the literature. These models have been trained and tested with the CK+ dataset to classify the 6 basic expressions. *The authors use an architecture more complex.

Initialization Mean σ that all the accuracy values are close to the average. For these reasons, we have decided to use, in all our experiments, the Gaussian + MSRA initialization (i.e. a Gaussian filler is used for the convolutional layers and a MSRA filler for the fully connected layers). The loss is calculated using a logistic function of the softmax output as in several related works [82, 103, 68], the activation function of the neurons is a ReLu, which generally learns much faster in deep architectures [27] and the method used to calculate the weights between neurons is the Adam method [46], since this method shows better convergence than other methods.

13Local Response Normalization (LRN) is a layer that square-normalizes the pixel values in a feature map in a local neighborhood [46].

Chapter 4. Facial Expression Recognition | 73

In document Facial detection and expression recognition applied to social robots (sider 69-73)