Structure of Thesis - 3D Facial Reconstruction from Front and Side Images

The thesis is structured as follows:

Chapter 1introduces the work in this thesis.

Chapter 2covers the necessary background theory.

Chapter 3describes relevant works and datasets for 3D facial reconstruction.

Chapter 4contains the proposed method implementation.

Chapter 5presents the evaluation pipeline and results on the MICC Florence dataset.

Chapter 6provides the conclusion and outline further work.

Appendix Alists relevant code from our implementation.

Appendix Bcontains the installation manual for our prototype.

Chapter 2

Background

This chapter covers the necessary background theory to understand the methodology de-scribed later in this thesis. Only relevant theory will be covered. To gain a further insight the reader is encouraged to examine referenced sources.

Section 2.1covers the basic concepts of Convolutional Neural Networks(CNNs) Section 2.2examines two relevant CNN architectures

Section 2.3covers relevant data augmentations for image processing Section 2.4looks at example uses of synthetic data

Section 2.5introduces generative models for faces Section 2.6describe the FaceGen software

Section 2.7defines relevant matrix transformations Section 2.8covers UV mapping and UV Position mapping Section 2.9give a description of a point cloud alignment algorithm

2.1 Convolutional Networks

If the reader is not familiar with the biological background of Neural networks and the basic Artificial Neural Network perception the reader is encouraged to read Nielsen[8] or Goodfellow et al. [2, p. 164-224]. Unless explicitly stated otherwise, the theory in this section is from Goodfellow et al. [2].

Convolutional networks, also known as convolutional neural networks (CNNs) are neural networks which contain at least one convolutional layer. Typically a CNN contains one or more convolutional layers interspersed with pooling layers and one or more fully connected layersin the end. The following sections 2.1.1-2.1.11 detail the CNN basics.

2.1.1 Convolution operator

The nameconvolutional networks comes from the mathematical operation which these networks use, namely convolution. The convolution operation can be defined as an opera-tion on two funcopera-tionsxandwof a real-valued argumentt.

s(t) = Z

x(a)w(t−a)da. (2.1)

The operation is typically denoted with an asterisk: s(t) = (x∗w)(t). As the data in computer applications usually are discrete, we define a discrete convolution:

s(t) = (x∗w)(t) =

∞ a=−∞

∑

x(a)w(t−a). (2.2)

the xin equations 2.1 and 2.2 is, in CNN terminology, referred to as the input, while the w is called the kernel. The output can be referred to as feature maps. The input in computer applications is usually a multidimensional array of data, while the kernel is usually a multidimensional array of parameters, or weights. The weights are what the learning algorithm is adapting. Assuming that the functions are zero everywhere but in the finite set of point values, the infinite summation can be replaced by a summation of a finite number of array elements. Additionally convolutions are often used over more than one axis at a time, for example over a two-dimensional image. With a two-dimensional image Iinput and a two-dimensional kernelKthe convolution is defined as:

s(i,j) = (I∗K)(i,j) =

∑

I(m,n)K(i−m,j−n). (2.3) As convolution is commutative, 2.3 is equivalent to 2.4.

s(i,j) = (K∗I)(i,j) =

∑

I(i−m,j−n)K(m,n). (2.4) The commutative property occurs because the kernels are flipped relative to the input.

This way the input index increases withmas the kernel index decreases. In practice a more commonly used function is the cross-correlation function. The function is the same as a convolution, but without flipping the kernel:

s(i,j) = (I∗K)(i,j) =

∑

I(i+m,j+n)K(m,n). (2.5) An example of a cross-correlation can be found in figure 2.1. As many neural network libraries implement the cross-correlation function, these functions are not differentiated any further and both are referred to as convolutions.

Figure 2.1: Figure from Goodfellow et al. [2] showing an example of a cross-correlation, with input, kernel and output

2.1.2 Fully Connected Layer

The traditional feedforward neural networks, also called Multi-layer perceptrons (MLPs), predict a category based on an input using perceptrons. The input is passed through layers of perceptrons to approximate a function mapping the input to the output. Increasing the number of layers allows the network to approximate more complex functions. Each of the outputs from the previous layer is passed through every perceptron in the following layer. The traditional feedforward neural network uses fully connected layers to predict categories from an output, while CNNs contain at least one convolutional layer.

2.1.3 Convolutional Layer

Convolutional layers use convolutions to compute their output. The convolutional opera-tion entails three beneficial properties, namely sparse connectivity, parameter sharing and equivariance to translation.

Sparse connectivity

In fully connected layers every input unit interacts with every output unit by a matrix multiplication. However, in convolutional layers the convolution kernels are smaller than the input, each output unit is then only connected to a subset of input units specified by the kernel. This causes sparse connectivity. When passing an image through a convolutional layer the input image might be thousands or millions of pixels, while the kernel which can detect meaningful image features might be tens or hundreds of pixels. The result of sparse connectivity is a reduced model size and fewer mathematical operations.

Parameter Sharing

In a fully connected layer each weight is used only once when computing the output, while in convolutional layers each weight is used multiple times. This is called parameter sharing and further reduces the storage needed by the network. As each kernel is used on every input unit, the learning algorithm has to learn far fewer weights.

Equivariance to translation

Convolutional layers also have the property of equivariance to translation. The particu-lar form of parameter sharing found in convolutional layers means that any translation changes in the input causes the same translation in the output. This property is very useful when detecting edges in an image input, as the edge location is simply translated together with the image, if the input image changes position.

2.1.4 Pooling layer

Typical convolutional network layers consists of three stages. At the first stage, the layer performs several convolutions to produce a set of linear activations. The output of the linear activations are then passed through a nonlinear activation function. For the third and final stage of a typical convolutional network layer we use some sort of pooling function to further modify the output from the nonlinear activation function.

The pooling function produces a summary statistic of nonlinear activation outputs. An example of a pooling function can be the max pooling function. This function simply reports the maximum nonlinear activation output within a set neighbourhood.

2.1.5 Transposed Convolutional Layer

A transposed convolution is a transposed convolution operation. A transposed convolu-tional layer is similar to a convoluconvolu-tional layer, but uses transposed convolution matrices to calculate its output. Transposed convolutional layers usually use feature maps predicted by a neural network to predict an aspect of the input image that produced the feature maps.

Transposed convolutional layers are for instance used in deconvolutional networks to pre-dict the original input image from a set of feature maps [9].

2.1.6 Training a CNN

The weights of CNNs are trained similarly to fully connected neural networks. After calculating the network prediction loss by the selected cost function, typically the squared loss, the update gradient is calculated and passed backwards through the network through backpropagation.

2.1.7 Transfer Learning

Using a previously trained network to initialize the weights or predict features relevant for another network is called transfer learning. If a prediction application lacks labeled training data or want to make use of a large dataset transfer learning is applicable. Trans-fer learning is particularly relevant in image processing problems as datasets can contain millions of pictures [10], and labeled data can be hard to come by. There are two main transfer learning approaches, using a pretrained CNN as a feature extractor and fine-tuning a pretrained CNN [3].

By fine-tuning a pretrained CNN the classification layers of the pretrained CNN is replaced and trained on new data relevant to a new problem. Some of the upper layers of the pretrained network might be freezed to reduce overfitting. Using a lower learning rate is crucial to limit big gradient updates in the pretrained layers. If we want to train a pretrained network on a large dataset fine-tuning can be an effective approach [3].

2.1.8 Generalization, Overfitting and Underfitting

The goal of a CNN is to perform well on new, previously unseen relevant inputs. The abil-ity to do well on unseen inputs is called generalization. Observing the generalization is simply done by labeling a percentage of the input data as validation data, on which the net-work never optimizes, but simply evaluates. When the netnet-work is unable to further lower the error cost of the validation data, or validation loss, the network should not optimize any further.

A network is overfitted if the network has low error on the training data, but has a high error on new unseen data. Overfitting occurs when the gap between these loss values are too large. If the network is unable to find a low error value on the training error, the network is suffering from underfitting. By increasing or reducing the number of layers, and network parameters, we may control the capacity of the network. This capacity may affect the networks likelihood of overfitting or underfitting.

2.1.9 CNN Hyper Parameters

This section details relevant hyper parameters for the network implementation later in this thesis 4.3.2. Hyper parameters are the variables defining the network architecture and learning parameters. Selecting the best hyper parameters for a neural network is not a trivial problem, it is done through experimentation and intuition. An approach to find the best parameters can be through a grid search. A grid search manually or automatically trains the target network with all relevant hyper parameter combinations and select the best one. To save resources and time, an alternative approach is to first set the hyper parameters based on experience and intuition, and then grid search the hyper parameters you see most relevant.

Network Architecture

When deciding on the architecture of the network the appropriate number of units and the number of layers is important. Increasing the number of layers in CNNs generally produce better result, but at a certain depth the gain of prediction accuracy drops as the network is harder to optimize. Also, increasing the depth of the network increases the number of weights to optimize, which will affect the performance of the network. The size of the different layers also affects the network’s ability to learn. To maximize the network generalization it is important to find the best ratio between depth and size. The depth and size of the network is found through experimentation and careful monitoring of the validation loss result. It is also useful to use previous successful network architectures to find inspiration when creating a new network.

Activation Functions

Activation functions define the output value of a unit from it’s inputs. We will go through the relevant two activation functions relevant for the network implementation later in this thesis. Plots of the activation functions is found in figure 2.2

(a) Sigmoid (b) ReLU

Figure 2.2: Plots of activation functions from the CS231n webpage[3]

Sigmoid

The Sigmoid activation function is defined asσ(x) =1/(1+e^−x). The function takes a numerical input at squashes it into a number between 0 and 1. Large negative numbers become 0 while large positive numbers become 1. One issue with the sigmoid function is that it saturate and kill gradients, as the gradient for 0 or two is almost zero. The sigmoid function is also not zero centered, causing undesirable learning behaviour [3].

ReLU and Leaky ReLU

A commonly used activation function in recent years is the Rectified Linear Unit (ReLU).

It is defined by the function f(x) =max(0,x). The activation is 0 if the input value is below 0, and the same as the input otherwise. It has been found to greatly increase learning compared to the sigmoid function, and is computationally inexpensive. A problem with the ReLU function is dead units. If the input to a ReLU unit is too great, the unit may never be able to update the unit again [3].

A proposed solution to the dead ReLU unit problem is the Leaky ReLU. Instead of re-turning zero when the input is smaller then 0, the network instead outputs a small negative slope. This is thought to relieve the dead ReLU unit problem, but results have varied [3].

Convolution Kernels

For each network layer, the number of kernels and their size needs to be specified. Increas-ing the kernel sizes increases the capacity of the network, but in turn increases the storage requirements of the network. Balancing the size and capacity of the network is important to improve network generalization [11].

When applying the specified kernels to the input of each layer, the stride of the kernels needs to be specified. In the example in figure 2.1, the stride is 1, as the kernel moves one space for each convolution. To reduce the output size, the stride can be increased, for example by moving the kernel 2 steps over the input for each kernel operation. Using a stride of 2 in the example in figure 2.1 would produce 4 outputs instead of 6. However, increasing the stride might discard information as some inputs are not covered by the kernel. This is useful if the input is of too high resolution and the goal is to down sample the input data. Selecting the correct stride for a network layer is important to down sample the input data when relevant.

A problem when applying convolution kernels is that some information might be lost as the kernel is applied fewer times at the perimeter of the input [11]. By padding the input with extra information, typically zeros, zero-padding, the effective size of the input is increased the kernel is applied the entire input. Zero-padding preserves information and the spatial dimension.

Learning rate, threshold

When training a network the learning rate also needs to be specified. The learning rate is used by the optimizer to calculate the size of the optimization gradient. The best learning rate depends on the optimizer, ans is found empirically. The learning rate might also change depending on how long the network has trained for. There are several approaches for modifying the learning rate during training. One way to change the learning rate is to cut it in half every X epochs. This makes the gradient size smaller and leads to the network making smaller adjustments during the later stages of training.

2.1.10 Regularization

To increase the generalization of the network, several regularization techniques have been proposed. In this section two regularization techniques relevant for the network imple-mentation later in this thesis are covered.

Data augmentation changes the training data to increase the networks performance on new input data. By augmenting the data the dataset size can be increased and more relevant real world cases can be covered. Data augmentations relevant to the network training implementation in 4.3.3 are covered in section 2.3.

Monitoring the networks performance on validation data, and stopping the network training when the validation loss stagnates is another regularization technique. This tech-nique is called early stopping and is easy to implement. After each training epoch, the validation loss is measured. If the validation loss is the lowest recorded, the network in-stance is saved, and the training continues. If the validation loss is higher than the lowest recorded validation loss, the network instance is not saved and the training continues. If the network does not improve over a set threshold, the training is terminated.

2.1.11 Adam Optimizer

After defining a loss function a optimization algorithm is applied to minimize the loss.

In this thesis the Adam [12] optimizer is used. Adam combines the techniques of previ-ous relevant optimization algorithms and requires little tuning. It is an adaptive learning algorithm and is closely related to RMSProp [13] and AdaGrad [14].

Adam uses minibatches to increase performance, as the gradient is computed over batches instead of the whole dataset to support parallelization. Adam also implements momentum through exponential weighted moving averages, to reduce the chance for the algorithm to be stuck in a local minimum. By finding the relevant momentum and using the user defined learning rate we find the scale of the gradient which is then used to update the network. The algorithm is robust, but may encounter some problems if the gradients have significant variance. A solution can be to increase the batch size.

In document 3D Facial Reconstruction from Front and Side Images (sider 15-24)