Ultra-Low
Image Resolution Human Identification
Akhsarbek Gozoev
Thesis submitted for the degree of Master in Robotics and Intelligent Systems
60 credits
Department of Informatics
Faculty of mathematics and natural sciences
UNIVERSITY OF OSLO
Ultra-Low
Image Resolution Human Identification
Akhsarbek Gozoev
© 2020 Akhsarbek Gozoev Ultra-Low
Image Resolution Human Identification http://www.duo.uio.no/
Abstract
Human identification based on one’s gait is a rapidly growing field of research and a natural successor of other recognition meth- ods, such as face and fingerprints recognition. Gait identification has the ability to identify subjects at a distance, same as in face recogni- tion, but can do so from multiple view angles, something face recog- nition lacks. One of the challenges of gait recognition is the distance between the subject and the camera. This distance is proportional to the person’s size on the image. The further the subject is from the camera, the smaller he/she appears on the image. Modern neural networks have the restriction on image size, where they require all in- put images to be of the same, previously determined, size. Proposed method addresses this issue by creating a system able to recognize subjects regardless of the input image resolution. The complete sys- tem consists of the three subsystems. First super-resolution network will recreate a scaled version of the input image that is later fed into image2vec type of network to produce a feature vector correspond- ing to each subject. Finally person’s identity is determined using nearest neighbour classifier. Multiple neural network architectures for gait recognition with super-low resolution images are designed.
Keywords— SFDEI, GAIT Identification, Multi-resolution image, Siamese neural network, ArcFace, SuperResolution
Acknowledgements
I would like to acknowledge both of my supervisors, Yumi Iwashita and Jim Torresen. Yumi Iwashita for constant assistance throughout the year by providing insightful information about this field of research and actively and frequently discussing ideas and ways of improvement.
Jim Torresen for creating an opportunity for students at ROBIN (Robot- ics and Intelligent Systems, University of Oslo) to enrol in Collaboration on Intelligent Machines (COINMAC) program. Additionally, I would like to thank him for sharing his experience in academic writing and help me produce a better document than it would ever be without his assistance.
Finally, I would like to thank my mother, Iveta Bugulova, for supporting me during this year abroad. Thanks to all co-students and friends, both in Oslo and Pasadena, for all the good memories these past two years.
Contents
1 Introduction 1
1.1 Motivation . . . 1
1.2 Project goals . . . 2
1.3 Contributions . . . 3
1.4 Structure of the thesis . . . 3
2 Background 5 2.1 Brief history of deep learning . . . 5
2.2 Types of Machine Learning . . . 6
2.3 Black Box approach . . . 7
2.4 Perceptron . . . 8
2.4.1 Dense layer . . . 9
2.4.2 Multi-layer Perceptron (MLP) . . . 9
2.5 Convolutional Neural Network (CNN) . . . 10
2.6 Pooling layer . . . 13
2.7 Activation functions . . . 16
2.8 Backpropagation . . . 19
2.9 Data in Machine Learning . . . 19
2.9.1 Lack of data . . . 19
2.9.2 Changing number of classes . . . 20
2.9.3 Few-shot learning . . . 22
2.10 ML optimisation strategies . . . 22
2.10.1 Data normalization . . . 22
2.10.2 Batch normalization . . . 23
2.10.3 Dropout . . . 24
2.10.4 Label smoothing . . . 25
2.10.5 Data augmentation . . . 26
3 Person Identification System Using Deep Machine Learning 29 3.1 System overview . . . 29
3.2 Feature Extraction using Deep Learning . . . 30
3.2.1 Feature vector . . . 30
3.2.2 Feature vector extraction . . . 32
3.3 image2vec Network type . . . 33
3.3.1 Siamese network . . . 33
3.3.2 Base Model . . . 35
3.3.3 ArcFace . . . 37
3.4 Feature Comparison . . . 41
3.4.1 Nearest Neighbour Classifier . . . 42
3.4.2 Support Vector Machine . . . 43
3.5 Ensemble neural network . . . 43
4 Improved gait features extraction 46 4.1 Extracting SFDEI from video . . . 46
4.1.1 Background subtraction . . . 46
4.1.2 From silhouettes to Gait energy image (GEI) . . . 47
4.1.3 SFDEI: New features for gait identification . . . 48
4.2 Dataset Considerations . . . 50
4.3 Human ID Gait Challenge Dataset (USF) . . . 55
5 Achieving scale invariance 57 5.1 Traditional up-scaling methods . . . 59
5.2 Super Resolution approach . . . 61
5.2.1 SRResNet . . . 61
5.2.2 proSR . . . 62
6 Experiments and Results 65 6.1 Testing environment . . . 65
6.2 Preparation phase . . . 66
6.3 Siamese vs Triplet Loss network . . . 67
6.4 Training ResNet and its variations . . . 68
6.5 ArcFace as feature extractor . . . 72
6.6 proSR as up-scaling method . . . 74
6.7 Divide SFDEI into separate channels . . . 76
6.8 Divide SFDEI into head, torso, legs . . . 77
6.9 Pooling layers with fixed output size . . . 80
6.10 Upscale vs Downscale . . . 84
6.11 Nearest Neighbour vs SVM . . . 84
6.12 Transfer learning approach . . . 87
6.13 Network attention heatmaps . . . 88
7 Conclusion and Future work 95 7.1 Conclusion . . . 95
7.2 Future work . . . 95
Chapter 1
Introduction
1.1 Motivation
Using biometrics as a security measurement slowly but surely takes for methods like PIN-codes and password phrases. Specialists recommend [23] creating a unique password for each new service that requires authentication. In the long run, it becomes troublesome to come up and remember secure combinations that are hard to guess or brute-force. In the era of first personal computers (PC) computational power was limited, and older computers could not perform the same heavy-lifting operations (e.g. speech recognition or image analyses) as today’s PCs can. Therefore, when designing text-based authentication methods, people had to find a compromise between usability and complexity. The decision made favoured complexity to match the computational power available at that time. Furthermore, remembering passwords may not have been as big of an issue in the past because typical PC user had nowhere as many services requiring authentication as there are today. Modern computers have far more processing power and can take on more substantial tasks to ease user’s everyday tasks, like remembering passwords. Biometric identification (bio-ID) systems are always accessible to users as they use their very own bodies as identification. The need to remember dozens of passwords suddenly fades away, allowing us to focus on more important things.
Up until now, fingerprint [20] and face recognition [39] have been dominating the bio-ID market. Fingerprints provide high accuracy, while face recognition can identify subjects at a distance. By introducing new bio-ID methods, one can discover ways to incorporate advantages of the existing methods and come up with new exciting ways to solve the identification problem.
Using fingerprint-based methods require the subject to make physical contact with the fingerprint sensor. This action can be troublesome in the long run. Furthermore, multiple people using the sensor throughout the
day can lead to the spread of bacteria and even virus. Additionally, the fingertip used for identification must be completely dry, or the sensor may struggle to identify the target successfully due to interference and bad contact. Even a hint of sweat, e.g. on a sunny day or after a workout, can lead to sensor failure, restricting one’s access to an apartment if the sensor is controlling the main entrance.
Face-based identification (face-ID) methods solved this issue, but in the process of doing so introduced another one. The subject has to face the camera for face-ID to work. All face-ID systems have this constrain is by the very nature of this method.
Gait-based identification (gait-ID) is another branch in the bio-ID field.
It allows for identification at a distance and in theory, can do so regardless of the angle of the camera placement o the subject. Additionally, it has no dependency on having a clear shot of the subjects face, which can be covered by a face mask or a scarf. In summary, gait-ID combines the advantages and solves some shortcomings of previous bio-ID methods, taking humanity closer to easy and secure identification.
Gait-ID may not have achieved accuracy levels of the face-ID methods yet. Nevertheless, the ability to identify subjects from various angles make gait-ID systems superior in various areas, like medical-assistant robots. These robots are required to follow the stuff or patients around the perimeter. Often robot places itself behind the subject and tries not to interfere. Using face-ID or similar methods here would be problematic as they have no way of identifying people from behind.
Many Closed-circuit television cameras (CCTV) are already installed to monitor corporate areas. Typically they are of a lower resolution due to multiple factors. One example is preserving footage of higher quality over more extended periods as this will requires more space. Due to lower resolution footage using conventional deep learning systems will result in lower identification accuracy. This thesis will address the current issue with low-resolution human identification by introducing a new way to extract data from images and modifying existing networks to account for this information.
1.2 Project goals
Verification, identification and re-identification based on gait features is not a novice idea. The very first approaches used classical machine vision. Support Vector Machine (SVM described in Section 3.4.2), produces decent results, but is heavily outperformed by latest approaches using deep machine learning (rapidly growing field, with many exciting breakthroughs) To this day, face and fingerprint recognition methods are outperforming gait recognition. This high accuracy comes at a cost. In both of these methods, the user must perform specific actions. Touching
the fingerprint sensor or facing the camera is the bare minimum.
Thus, the main goal of this project is to overcome these weaknesses by developing a system that can passively identify a subject, without any unwanted actions by the user. Although beating previously mentioned methods may be optimistic, improving the existing results of gait recognition will contribute to the field. Current gait-ID systems struggle to identify subjects that are far away. Depending on the distance to the camera, subjects appear smaller in the video stream, making it challenging even for the human eye.
The goals of this thesis are three-fold:
1. Develop a system that uses improved features for gait recognition to increase the identification accuracy of subjects regardless of the distance to the camera.
2. Track and record the accuracy values across multiple distances to the camera and produce a summarised table for future reference.
1.3 Contributions
Contributions of this thesis are three-fold:
• Multiple new model architectures are developed. These models combine ideas from face recognition, image classification and pre- vious gait research achieve state-of-the-art performance in low- resolution deep learning gait identification.
• New data augmentation methods tailored to gait identification field are proposed in Section 6.8.
• Additionally, dozens of tests with various architecture combinations were performed to gain an intuitive understanding of the direction further research should continue. These tests present all unsuccess- ful experiments and explanation of why follows. Basic rules are de- veloped that later help to construct a network that achieves higher accuracy on ultra-low resolution images using machine learning.
1.4 Structure of the thesis
This thesis makes no assumptions of the readers’ knowledge in the field.
Therefore, technical terminology will be explained to make the understanding of the text as easy as possible.
Chapter 2is a Background section. This chapter will present the field of machine learning, its origins, historical development until now, and all of the ideas necessary to follow this thesis to the end.
Chapter 3describes the complete system of person identification using deep machine learning. The system is a combination of three smaller systems, image upscaling neural network, feature extractor and feature classification system. All of these sub-systems are thoroughly described in corresponding sections.
Chapter 4a step-wise instruction on improving existing features for gait recognition systems. Additionally, this chapter compares available datasets for human gait identification.
Chapter 5explains how new scale-invariant datasets can be created from already existing gait datasets. This step is necessary to achieve accurate scale invariance human identification described this thesis.
Chapter 6presents experiments and their corresponding results. Each section is divided into three parts, introducing the theory behind current experiment, its implementation and summary of the results.
Chapter 7finally derives the conclusion and ideas for the future work.
Chapter 2 Background
2.1 Brief history of deep learning
It was always a challenge to try to understand how human brain works.
Many theories were proposed, but one that stands til this day was made by Warren McCulloch and Walter Pitts in [19]. They introduced a neuron, described how neurons in the brain might work and even modelled it in electrical circuits.
This was a single neuron, our brain is made out of billion of these, where every neuron is connected to 10000 others. Single neuron performs simply operation as calculating sum of all of it’s inputs and comparing this sum to a threshold, when output of one is connected to input of many others they create a complex network that is able to perform immensely difficult tasks as classifications, cognitive decision making and other intelligent action that are yet impossible for computers to replicate. This paper sparked interest in the field. In 1949, Donald Hebb published The Organization of Behavior [11] where he proposed the pathway connecting neurons gets stronger/thicker the it is used to propagate signal. In simpler terms the more someone practices the better he/she becomes at that thing.
An illustration of a neuron can be viewed in Figure 2.1 where x_1 to x_n are inputs (perhaps from previous neurons) and y is the output of current neuron. Each input is connected to the neuron with a path. As mentioned earlier this path has some amplification factor to it. This can be modelled in computer by assigning a weight to it. Each input value is then multiplied by this weight before it gets fed into the perceptron. Perceptron then takes all this value and sums them up, the output of this operation is a scalar that is compared to a threshold. This this scalar is over the threshold neuron fires, outputs one. It not, neuron remains un-fired, outs zero. Output of a neuron can only be binary, one or zero.
* Origins of deep-learning/ neural networks Neuron performed well at finding simpler functions, but did poorly in higher abstractions.
Classifying images is one example. As images are stored as series of 2D
Source: Simple Wikipedia: Neuron
Figure 2.1: Neuron and myelinated axon, with signal flow from inputs at dendrites to outputs at axon terminals
arrays there were no good method to feed this data to a neural network.
One way of doing it was to flatten the 2D array which resulted in huge vector. Flattening operation concatenates every row to the first one, so a 2d vector of size 5x3 would result in a 1d vector of size 1x15 or just 15.
This is shown in Figure 2.4. Doing this to a picture would transform it in a way that even humans would not be able to classify objects on the image.
Using this type of transformation and fully connected neural networks one would expect to achieve around 60% accuracy on Cifar10 (dataset with 32x32 images consisting of 10 different classes with a total of 60000 examples). To achieve this results the depth of neural networks had to increase and so did number of trainable parameters. This led to slower training times. At some point increasing the networks depth would not yield expected results anymore, accuracy would not increase, but even become worse. These was caused due to a phenomenon called vanishing gradients and others. This results were impressive, but there was room for improvement. Other techniques had to be developed.
2.2 Types of Machine Learning
There are three major branches in machine learning. These are unsuper- vised learning, supervised learning and reinforcement learning.
Unsupervised learning tries to find a pattern in data. Based on this pattern, the data is then grouped into clusters. Usually, one does not have any previous knowledge about the data, making it challenging to support the learning process. Because the data is fed into the system and no additional actions are performed, this method is called unsupervised.
In supervised learning, prior knowledge about the data exists or
is achieved through data analysis before the training process begins.
Usually, this knowledge is in the form of labels, boundary boxes or similar information describing what is shown in the image/video stream. Having this information combined with the output of the neural network will help guide the network learning process, hence the name of supervised learning.
Last major branch in machine learning is reinforcement learning. This type of machine learning takes a different approach compared to the previous two. Reinforcement learning systems do not operate on pre- collected data. Instead, there is an agent and the environment. The agent starts with little to no knowledge about the environment and how it operates. To learn more about the environment agent can perform specific actions and receive the state of the environment after those actions took place. Given that the agent can track how the environment reacts to its actions, the agent can derive specific rules. Additionally, the environment also produces a reward score for each action the agent takes. The goal of the agent is to maximise this reward value. This type of machine learning is best suited if the system is placed in the unknown domain and needs to learn how to behave there. Examples of practical use can be creating an agent that plays video games [24] given only visual input from the game screen. Real-world use of the same scenario can be robot navigation in dynamic environments [42] or robot discovering it is configuration/anatomy, e.g. how many joints it has, type of movements it can perform (rotation/extension), and applying this knowledge to better its coordination.
2.3 Black Box approach
A neural network can be viewed as a block box. A black box is a complicated system whose internal mechanism is not of any importance to the end-user. A black box has some inputs and produces an output from those. In the case of a neural network, all complex mathematics involved are abstracted away and the operation of the entire system is viewed from the outside. A visual representation of this consent can be viewed in Figure 2.2. This network takes 196.608 inputs and has three outputs. Each of the outputs represents the probability of the input being a cat, a dog or neither. These outputs can range from 0 to 1.0. The value being closer to zero means less of a chance and vice-versa. Figures 2.3 and 2.4 illustrate how rows in an image are stacked into a one-dimensional vector, and fed into a neural network. After the network’s inner system has finished its’ calculation, three output values are ready. In Figure 2.2 first output value is 0.97 telling us the network concluded there is 97%
possibility the input image is an image of a cat, which in this case is true.
Figure 2.3: Original image
Figure 2.4: Flattened image
Source: Neural Networks : A 30,000 Feet View for Beginners
Figure 2.2: Neural network as a Black Box
Following sections will describe what can hide in this black box and how the output probabilities are calculated.
2.4 Perceptron
It did not take long from the discovery of neuron to the first virtual version of it to appear. It was named perceptron. An ideal perceptron should
operate exactly as a neuron in int the human brain does. Perceptron has inputs, some inner logic to process the inputs and produce some output, where the results of calculations are presented. Perceptron is the basic building block of a neural network, in the same manner as neurons are the basic building blocks of the human brain.
Source: The Fundamentals of Neural Networks
Figure 2.5: Perceptron
2.4.1 Dense layer
Next natural step is to try and group multiple perceptrons. This will fur- ther imitate the human brain structure. A network of multiple perceptrons allows for more complex computation. One row of perceptrons sharing inputs and each having one output value is called Dense layer or Fully Connected Layer (FC). Dense layers are often called hidden layers as they are not exposed to the outside world. This further references the black box example, where only inputs and outputs are presented, and inner layers are hidden.
2.4.2 Multi-layer Perceptron (MLP)
Stacking multiple dense layers form an even bigger network called Multi- layer Perceptron (MLP). MLP has an input layer, one or more dense layers and finally an output layer. All outputs from the previous layer are connected to every perceptron in the following layer. Figure 2.6 shows a neural network with 5 inputs, 2 dense layers and 2 outputs. This figure also illustrates the high number of connection between nodes in this network.
Source: Convolutional Neural Networks - Full Connection
Figure 2.6: Simple neural network
2.5 Convolutional Neural Network (CNN)
CNNs were developed because MLP networks struggle to achieve high accuracy classifying images. This is partially due to the way images are fed into the MLP network.
Convolution is a term from signal processing and describes an operation where a kernel slides across the signal multiplying each pixel with its corresponding kernel value and finding mean of the sum.
Convolution can be 1d, 2d, 3d and so on. A kernel is a vector, in case of 1D convolution, or a matrix with carefully chosen values that produce wanted output results. One such example is smoothing. Given a graph of random shape on could apply convolution with vector[0.25, 0.5, 0.25]and end with a smoothed version of it as shown in Figure 2.9.
Mathematically convolution can be formulated as shown in equation 2.1. Hereyis the the output result,xis the input signal andhis the kernel.
As you may have noticed the symbol for convolution is~. y[n] = x[n]~h[n] =
∑
∞ k=−∞x[k]h[n−k] (2.1) Pre deep-learning era, convolution kernels had to be designed manu- ally, carefully picking values that would produce expected output images.
The simplest example of a convolution kernel is the Sobel operator that
is a 3x3 matrix with values shown in Figure 2.7. When using Sobel_x, all horizontal edges on the image are highlighted. In the same manner, So- bel_y will highlight all vertical edges. Combining both output images will produce a grey-scale image highlighting all edges in the image. A system performing these operations is called a Sobel edge detector. Another ex- ample is the Gauss kernel 2.8, a matrix with high value in the middle and decaying values towards the edges of it. Applying this kernel on an image will produce a smoothed version of the original. Changing the size of the kernel and value decade ratio can control the smoothing factor
Source: An Implementation of Sobel Edge Detection
Figure 2.7: Sobel_x and Sobel_y kernels
Source: Discrete approximation of the Gaussian kernels 3x3, 5x5, 7x7
Figure 2.8: Gauss kernels of three different sizes
Combining convolution operations with dense layers new type of network called Convolutional Neural Network (CNN) appeared. This discovery pushed image classification accuracy of neural networks to new
Source: Intuitive understanding of 1D, 2D, and 3D convolutions in convolution neural networks
Figure 2.9: Graph smoothing with 1D-convolution
Source: Intuitive understanding of 1D, 2D, and 3D convolutions in convolutional neural networks
Figure 2.10: Simplified version of a 2D convolution showing how output image highlights where the exactly copies of the kernel are in the input image
heights. CNN networks achieved over 90% accuracy on Cifar10 and over 99% on MNIST datasets. CNN pushed accuracy levels higher but at the end suffered the same symptoms as MLP networks. That being vanishing/exploding gradients.
The next big breakthrough came with the introduction of Residual Network (ResNet) [9]. The novel idea they introduced was skip- connection shown in Figure 2.11. Placing skip connection between CNN layers allowed for more robust data flow in the network. Backpropagation (Section 2.8) especially benefited from skip-connections as the error could easier propagate to the earlier layers of the network. Using skip- connection made it possible to create a deeper network, up to several hundred layers deep.
While ResNet had skip-connection from previous layer to the next one, it didn’t take long before a new architecture emerge where all layers were
Source: Source: ResNet block
Figure 2.11: ResNet block, skip-connection highlighted as red arrow interconnected like showed in Figure 2.12. This made it possible for deeper layers to learn directly from earlier layers, avoiding features learned in middle layers.
Another variation of ResNet is ResNeXt network. Main idea is that making wider networks is more beneficial that deeper networks. ResNeXt has multiple paths in parallel from one layer to the next one. One visualisation of this can be viewed in Figure 2.13.
2.6 Pooling layer
Every learnable parameter in the network needs to be trained and stored.
Depending on the number of kernels, their sizes and number of convo- lution layers, one network can consist of billions of learnable parameters.
Finding value for each parameter takes some time. Techniques to reduce the size of learnable parameters had to be discovered. Pooling layer is one way of reduction. These layers usually are placed after convolution layers. During the initialization, the pooling kernel size must be chosen.
This kernel must not be confused with Convolutional layer, as this one does not have any learnable parameters. Instead, it slides across the input
"looking" at the values under the kernel and decides what values should be propagated to the output. This decision is based on the type of pooling layer. Two types are presented in the following paragraphs.
Max pooling
Max pooling layer finds the maximum value from the ones exposed to
Figure 2.12: DenseNet arch
Figure 2.13: ResNeXt arch
the kernel and preserves this in the output, while others are left behind.
Doing this will preserve robust features and lower the dimensionality of the input. Depending on the kernel size slight rotation or translation invariance is achieved because the max value is chosen regardless where it is under the kernel.
Source: https://www.researchgate.net
Figure 2.14: Example of max pooling and average pooling operations. In this example a 4x4 image is downsampled to a 2x2 by taking the maximum value or the average value of each sub-region
Average-pooling
Average-pooling layer computes the average of all values exposed to the kernel. Doing so will lower the dimensionality, but all previous features are preserved in the output to some degree. Average pooling is more used in modern networks.
This operation of computing a matrix average is more computationally expensive as multiple additions have to be performed, as well as a final division.
Stride
Stride in deep learning is a number of pixels the kernel is moved after each operation. Figure 2.15 show example of two stride values. In the first one, stride equals to one. Kernels are placed next to each other in consecutive steps. In the second example, the stride equals 2. Depending on the stride and kernel size, the area of the image under kernels may overlap from step to step.
(a) Stride = 1
(b) Stride = 2
Figure 2.15: Stride visualization
2.7 Activation functions
Many activation functions have been proposed since the dawn of machine learning. Choosing an activation function heavily depends on the field it is applied to, or more specifically to the model/layer types used in the neural network.
Sigmoid
In image processing, most commonly known activation functions are Sigmoid, ReLu and Tanh. Sigmoid was mostly used in the past. The reason new researcher go with other function is due to many shortcomings present when using sigmoid like saturating and killing the gradient and not-centred outputs. For historical reasons, Sigmoid is always the first activation function introduced in machine learning tutorials. This may be due to it being the pioneer of activation functions and easy implementation.
ReLU
Rectified Linear Units (ReLu) is a linear activation function which formula looks like this y = max(0,x). ReLu is a computationally cheap function which gained lots of popularity in the last decade or so. Many variations of ReLu have been proposed, and in some cases, they show an improvement in the overall accuracy. But as a rule of thumb it is advised to find general model architecture using original ReLu and only then, start tweaking the activation function, trying Leaky ReLU & Parametric ReLU (PReLU), Exponential Linear (ELU, SELU), Concatenated ReLU (CReLU) and many others (Source ) *Dying ReLU*
Tanh
Tanh is another activation function. Tanh’s graph shape closely resembles that of a Sigmoid, but the range is (-1, 1) instead of (0, 1) for Sigmoid. The advantages this brings are negative inputs are actively mapped to negative values, and the zero inputs are mapped close to zero in the Tanh graph.
The Tanh function is defined as follows:
tanh(x) = 2
1+e−2x −1 (2.2)
It is nonlinear and can be used in all hidden layers. The gradient flow is stronger for Tanh than Sigmoid ( derivatives are steeper). Like sigmoid, Tanh also has a vanishing gradient problem. In practice, calculating Tanh is easier. Therefore, It is always preferred over Sigmoid function in practice. And it is also common to use the Tanh function in a state to state transition model (recurrent neural networks)
Tanh has been tried in this thesis, but the overall accuracy decreased by 10% comparing to ReLu.
Softmax
Softmax is a function that takes an input vector K, and translates it into a probability distribution consisting of K probabilities proportional to the exponential of the input numbers. Equation 2.3 shows how softmax is computed mathematically. Softmax will assigns decimal probabilities to each class in a multi-class problem. Those decimal probabilities must add up to 1.0. This additional constraint helps training converge more quickly than it otherwise would. These attributes make softmax a good fit as activation function for the final layer.
s(xi) = e
xi
∑nj=1exj (2.3)
For the demonstration of softmax lets imagine a network that detects shrimps on the image solely based on the number of legs on the creature.
Shrimp anatomy: number of legs
Shrimp have five pairs of jointed legs on the thorax used for walking and five pairs of swimming legs.
That’s a total of 5∗2+5∗2=20 legs.
This network is fed images of a dog, cat, flamingo, shrimp and a starfish, in this particular order. The inner part of the network counts the legs on these animals and produces a vector [4, 4, 2, 20, 5]. This vector is then fed into the Softmax function, and probabilities of each class being a shrimp are generated. The output of softmax is a new vector
[1.12535113e−07, 1.12535113e−07, 1.52299714e−08, 9.99999454e− 01, 3.05902153e−07]. The same information put into table 2.1 for clearer visualization. It is evident that softmax would correctly guess which one of the input images is a shrimp based on the number of legs, with high confidence too. Notice how the sum of all probabilities equals to 1.
Animal Probabilities Dog 0.0000001125 Cat 0.0000001125 Flamingo 0.0000000152 Shrimp 0.9999994538 starfish 0.0000003059 Table 2.1: Softmax probabilities
One-Hot encoding
Using Softmax as final activation function sets restrictions on how the labels should be encoded. One-hot encoding is a way to transform a set of labels into numerical values that resemble the output of the Softmax function. This vector is of size C, where C is the number of classes and holds binary values (0 or 1). For the vector to be considered a one-hot encoded, it should consist of all zeros, except for the indexes of the classes presented in the image.
Every image in the dataset should have a corresponding one-hot encoded label. Bringing back a dog and cat classifier from Figure 2.2 there are three classes (dogs and cats and others, C=e). Under the training process, every input to the system should be a pair of an input image and one-hot encoded label, a vector of 3 binary values.
If this label is [0, 1, 0] the input image is an image of a cat, while [1, 0, 0] tells us the input is an image of a dog.
2.8 Backpropagation
Backpropagation is a technique used to adjust network parameters to increase classification accuracy slightly. During the training process, every forward run of the network is followed by comparing network output to the ground truth. The error is calculated and propagated back into the network. Backpropagation finds the weights contributing to the error end adjusts them in a way to lower output error. Doing this over and over will lower the overall classification error, and the network will learn.
2.9 Data in Machine Learning
Machine learning is told to be data-driven. This means the programmer develops the system and feeds a massive amount of data to it. From this point, the programmer does not interrupt the learning process. Neural network itself find patterns and correlations between the data. In this approach, only two things can affect the outcome, the network architecture and quality of the data fed into the network.
Labelled datasets are required to train any deep ANN in a supervised manner.
2.9.1 Lack of data
The type and amount of data fed to the network can be a big deciding factor from the network achieving state-of-the-art performance or having worse accuracy than a model outputting random values. Perhaps the most known dataset in ML is MNIST, a database of handwritten digits. It has a training set of 60,000 examples and a test set of 10,000 examples. Some digits from MNIST can be seen in Fig. 2.16. Today classifying MNIST is considered an entry-level problem that many freshmen students are challenged to take. Collecting and labelling 70 000 images can be time- consuming and draining process. Therefore, better models need to be developed to extract more information from less amount of data.
Data collection is time consuming
Data collection process varies depending on what type of data is to be collected. To train lane detecting system will require video capturing the view from the car. To collect data for this problem would require capturing the scene in front of the moving car for an extensive amount of time. Lucky many videos of this sort are easily accessibly on the Internet, but these still need labelling fro supervised learning systems. In this case, data collection could take from 1 day to a couple of weeks. On the other hand, researching how infants learn requires to log some of the infant’s actions throughout
Figure 2.16: Digit samples from MNIST dataset
their growing period. Collecting such data could take years before it is ready to be feed to an ML system.
Data augmentation
A common way to generate more data for ML training purposes is data augmentation. There are a couple of ways to augment images like zooming in and out, rotation, translation, stretching, skewing. All of these can be applied to some degree without changing the essence of what the image represents. Example of some face image augmentations are illustrated in Figure 2.17. The human brain can recognize those augmentations and still identify who the person on the image is. At the same time, neural networks would struggle unless they are trained on images augmented in the same manner. By having a pool of X augmentation operations, one can extend the dataset from N images to N*X new images.
Tone change, changing RGB image to a greyscale one, multiplying an image with some random noise are other types of image augmentations that are important for neural networks to learn. Otherwise, they can struggle to correctly classify objects in an image of the image is slightly distorted.
Additional specialized augmentation methods can be developed de- pending in the field ML learning is applied to as can be seen in Sections 6.7 and 6.8.
2.9.2 Changing number of classes
Traditional neural networks can be viewed as a black box with a variable number of inputs and a fixed number of outputs, corresponding to the number of classes one wants to differentiate. If one designed a network
Figure 2.17: Geometric transformation examples:
Leftmost: Original image.
First row: horizontal flip, rotation, translation.
Second row: zoom, stretching, change in perspective
to recognize cats and dogs, the final network architecture would look similar to the one shown in Figure 2.18. Where yellow circles represent input nodes, green ones represent hidden nodes, and finally, red ones are the outputs of the network. Values in output nodes tell us how sure the network is that this particular input is a dog or a cat. The closer the number in dog node is to 1.0, the more chances that the input image is a picture of a dog, and vice-versa for cats. To train a network like this requires thousands of labelled images of cats and dog, hours of training and parameter tweaking to achieve reasonable accuracy.
Now if one would like to extend the network to recognize pandas, one would need to add a new output node representing probabilities of input image being a picture of a panda. The whole training process has to be redone from the very start, given one still has all labelled images of dogs and cats, and add labelled images of pandas to the mix. This would take require another chunk of time. Basically, there is little difference in adding a new class to an existing neural network, or designing new architecture from scratch, including this new class.
A corporation wanting to create a system of automated doors that open only if some of the employees are in the front will face the issue of changing number of classes. What happens if the company acquires new employees? The whole neural network will need to be redesigned.
Depending on the company size, multiple employees may come and go every week. It would be a disaster, as more time would be spent training the network compared to it actually being used.
One solution to this problem is explored in Section 3.3.
Figure 2.18: Simple neural network to differentiate dogs from cats
2.9.3 Few-shot learning
As presented in Section 2.9.1 is a known fact that neural networks need thousands of labels examples for training purposes. Very few such datasets exist for human gait recognition. Some human gait datasets exist, but they have a limited number of samples compared to more popular datasets like MNIST - 60000 samples, ImageNet - 14 million, CIFAR-100 - 60000. Many surveillance cameras filming pedestrian are installed all around the clock, providing a huge video stream of human gait examples.
However, the videos from those streams still need to be labelled to be used as ML systems. An ideal solution would be having a neural network that can learn gait to human mapping from just a few examples for each person.
2.10 ML optimisation strategies
Various strategies can be applied to the neural network to battle effects like over-fitting, exploding and vanishing gradient problems, slow training, overconfidence and others. This chapter presents some of them and explains precisely how they work. Some of these tricks are common, while others are just recently developed and making their ways to the public.
2.10.1 Data normalization
Data scaling and normalization is a recommended pre-processing step when working with deep learning neural networks. The idea was presented in [41] to speed up back-propagation. The purpose of data normalization is scale input data to have zero mean and unit variance.
To perform this operation one will need to calculate two values, the mean and the standard deviation of the data. These are calculated for the whole dataset and later applied to each image. Mean value (µ) is calculated by
finding the sum of all images in the dataset and dividing it by the total number of images (N) like shown in Equation 2.4.
µ = ∑
Ni=1xi
N (2.4)
The formula for the standard deviation (σ) is shown in Equation 2.5.
This formula uses the mean from a previous calculation.
σ= v u u t 1
N
∑
N i=1(xi−µ)2 (2.5)
After finding both these values, one can proceed to the next and final step, normalizing images in the dataset, Equation 2.6. Dataset normalization is done by subtracting µ value from each image and dividing the result byσ. The output valuezis a normalized image, that is centred around the origin with unit variance for the whole dataset.
z = x−µ
σ (2.6)
Figure 2.19 illustrates what happens to the non-normalised data during this process.
Source: CS231n Convolutional Neural Networks for Visual Recognition
Figure 2.19: Input data normalization
2.10.2 Batch normalization
Another normalization method applied to each batch before it enters a new layer in the network. This method is essentially data normalization from the previous section applied to the output of every layer in the network. Training speed is significantly increased, as shown in [12]
without any accuracy loss. To intuitively understand batch normalization, one can split the neural network into smaller subsets like shown in Figure
2.20. Each subset gets some input from the previous layer. Each hidden layer hi (i being the index of the layer) outputs some values ai. This output can be viewed as an input to a new neural network that consists of all the following layers. If it helps to normalize global input to the whole neural network, it should also help to normalize each input to each hidden layer. There is one difference between normalizing the input and inner data values. Input is normalized across all training data, while batch normalization often is applied to individual mini-batches. Batch normalization should be applied before the activation function. This adds a slight regularisation that helps against over-fitting.
Source: Edited version of: [Applied Deep Learning - Part 1: Artificial Neural Networks]
Figure 2.20: Visualization of how bigger network can be view as a series of small networks connected in series
2.10.3 Dropout
Dropout is another regulation technique applied under the process of training. It is a technique that was introduced in [35] with a purpose to prevent neural networks from over-fitting. As the name suggests, it drops some nodes during forward-propagation, making them inactive as if they never were a part of the network. This constantly changes the structure of the network. Every forward pass a new set of nodes is selected to be dropped. What nodes are selected is random. This randomized
Source: Normalizing your data (specifically, input and batch normalization)
Figure 2.21: Training time speed-up with batch normalization selection allowing to create new subnet from the original network. The new subnet is only used during the current training step. One such example can be viewed in Figure 2.22, where on the left, one has the original neural network, while on the right, a subnet that is generated by applying dropout. By dropping some nodes, one enforces the rest of the network to adapt to the lost information. All nodes in the network become less dependant on each other. This method makes each layer act more independently, allowing for more reliable data flow through the network.
Because a random set of nodes is dropped under each training step, over- fitting of the whole network is less likely to occur, further improving on network generalization.
2.10.4 Label smoothing
Softmax (Chapter 2.7) based neural networks tend to be overconfident in their predictions which can lead to unwanted actions. One way to tackle this issue is by using label smoothing, a technique presented in [25] and further discussed in [21]. Label smoothing (LS) can be applied to supervised learning methods where the system expects a pair of input and a labelled output. Using softmax as final activation function to get the probabilities on the class distribution in our neural network requires the labels to be one-hot encoded as described in Section 2.7.
Label smoothing adds some noise to this one-hot encoded vector, basically telling the network there is a slight possibility that the data is mislabelled. Applying label smoothing to a vector [0, 1] may produce [0.05, 0.95]. In this case, the one-hot encoded vector is no longer binary, but numbers with floating point (floats, Figure 2.23b). In the second graph,
Source: Dropout in (Deep) Machine learning
Figure 2.22: Network before and after Dropout
one can see a much smoother transition between 0 and 1. Using network example from Figure 2.18 and choosing values close to, but not exactly 0 or 1 tells the network that the input image is most likely to be a picture of a cat (95%), but there is a small chance it is a picture of a dog (4%). By looking at these probabilities network will conclude that the image is most likely a picture of a cat, but it will not be so sure any more. Doing this makes it harder for the network to over-fit. Figure 2.23c shows regions to choose values from (green) that show performance increase while choosing values in the red zone either do nothing or worsen the performance.
2.10.5 Data augmentation
As mentioned earlier data augmentation is crucial when one does not have enough raw data to feed to the network. Basic augmentation approaches used on RGB images were presented in Section 2.9.1. These include horizontal and vertical flip, rotation, translation, zoom, stretching, change in perspective and others. Applying each method to the dataset will generate many new examples, but will also broaden the domain of the model. Applying a horizontal flip will allow us to recognize the gait features of people walking both from right to left and vice versa.
Changing the perspective will emulate capturing the video at a certain angle. While all these benefit the model, they also slow the training speed as more data is available and need to be traversed during each epoch. The decision to keep the augmentation methods to the minimum under network architecture search and testing was made. Only few
(a) with-out Label smoothing (b) with Label smoothing
(c) Common values for LS highlight
Figure 2.23: Label smoothing
augmentation methods were tried. One of them is dividing the image into head, torso and leg parts (presented in Section 6.8) and exchanging those image parts between samples of the same class. Dividing image into body parts resulted in accuracy increase, but exchanging those parts between samples provided no further improvement. Close inspection of the result data after this augmentation was applied showed that the feature images were not aligned properly from sample to sample, leaving sharp edges where merging took place. An example of this is shown in Figure 2.24.
Another augmentation method explored was separating channels of gait feature images. Data in such images is less correlated across channels compared to RGB images. An improved version of the basic gait energy image (GEI) is presented in Section 4. This is a 3-channel image where each channel introduces an additional feature to the previous one.
(a) Multi scale aug- mentation
(b) Sharp neck (c) Sharp legs
Figure 2.24: Augmentation by swapping head, body and legs
Chapter 3
Person Identification System Using Deep Machine Learning
3.1 System overview
The complete system for person identification is shown in Figure 3.1 identification consists of mainly three blocks: feature extractor, database, and comparator. The feature extractor is a deep-learning based model that accepts a special image as an input. This image is a result of preprocessing footage of human gait. Details of how to extract this type of image are explained in Section 4. The purpose of the feature extractor is to analyze input image and extract a vector of features used for later identification.
After the feature vector is obtained, it is compared to every entry in the database. The database is a collection of all feature vectors of known individuals. This module should contain information about each subject one cares to identify. Comparison is made by looking at the similarities between the obtained feature vector and the ones stored in the database.
When the database entry with the highest similarity score is found, the task is complete, and the search is finished. In Figure 3.1, the input gait image is assigned to the subject with identification number 4.
NB! It is crucial to test the output feature vector from the DL system against every dataset entry as those entries are not sorted in any way. If any early stopping is implemented, there is always a possibility of the next feature vector having an even higher similarity score. Additionally, if the probe person’s gait is not stored in the database, the current system will choose the subject with the gait most similar to the one requested. This feature can be used to group subjects by gait.
Figure 3.1: Overview over the system
3.2 Feature Extraction using Deep Learning
3.2.1 Feature vector
A feature is a numeric or symbolic characteristic of something. This could be an object, motion or anything else that can be given a characteristic. An apple has a feature of being round. One can treat this feature as binary.
An object is either round or not. Features can also be continuous, like temperature. It is a known fact that normal human body temperature is around 36.5–37.5 °C (degrees Celsius). These are all numeric features. To give an example of a symbolic feature, one can look at colours. There are basic colours like red, brown, grey, but also more complex colours like toolbox, tangelo, turquoise. Examples of which are shown in Figure 3.2. Computers were designed to crunch numbers, so they naturally are better at working with numeric features. To work with symbolic features, one must find a way to convert symbolic features into number features.
Colours, for example, are prevalent in everyday lives, this translates to their frequent usage in machines. Many representations were developed to convert colours to numbers. RGB colour-space is perhaps the most commonly known in computer graphics. Pixel is the smallest piece of information in a digital image. Every pixel on an image has its RGB
Red Brown Grey
Toolbox Tangelo Turquoise
Figure 3.2: Symbolic features of colour
(255, 0, 0) (150, 75, 0) (125, 125, 125)
(116, 108, 192) (249, 77, 0) (0, 255, 239) Figure 3.3: Numeric (RGB-space) features of color
value, a vector of 3 elements, e.g. [r, g, b], where every number (r, g or b) represents how much of red, green and blue need to be mixed to get this colour. Black is known as the absence of colours. Therefore, in RGB-space it is represented by [0, 0, 0], red on the other hand is represented by [255, 0, 0]. 255 is the highest number one can store in this type of vector. [255, 0, 0] means 100% of red colour, and none of either green or blue, which will result in the colour red. RGB values for previously named colours (red, brown, grey, toolbox, tangelo, turquoise) are shown in Figure 3.3
Multiple single features can be put together to form a feature vector.
Depending on how discriminative those features are, one can get a good approximation of what the original object is and sometimes even reconstruct it. Having all measurements of a wooden table would make it trivial to make many other similar tables without ever seeing the original
one. Feature vectors hold an immense power to describe our world. In some cases, they can be viewed as compressed versions of the object they represent.
3.2.2 Feature vector extraction
A feature vector is a 1-dimensional array of numbers. A simple example from audio processing field could be an array of 5 numbers representing [genre, audio tempo, voice to melody ratio, length of the track, dominated frequency] for a song track. In image classification case, feature vector could be composed of a number representing eyes, ears, nose, arms, legs on the image, objects hair/fur/eye colour, shapes this object is composed of, object size, etc.
Creating a system to identify every single individual on the planet can be a challenging task due to the significant population size and its constant change. Rather than creating a system with 7 billion outputs, a more sustainable solution would be to create a system to extract features of a person and use those features to compare subjects. Typical Deep Neural Network architecture for image processing consists of blocks of CNN layers fed into a block of dense layers. These can be viewed as two independent systems. The first system (CNN blocks) takes an image. It encodes/compresses it into a smaller size that is more fit to be fed for the second system (dense blocks) that are less fit to process images directly but exceed in classifying smaller dimension data. CNN system will take an image and compile information about what that image represents, extracting information like shape, size, lines, edges, colours, object placement and object relations. This information in itself can be used as a future vector, or one could be fed this feature vector into a chain of dense layers and try using the output of the n-th layer as a new feature vector. Experiments show that one or two layers of dense generally generate better results in terms of accuracy. A vector in computer science is numerous values grouped in a one-dimensional array, e.g. [1, 5, 3, 7]. A feature vector is the same array, where each number represents a feature of some sort. Ideally, every element in a feature vector is uncorrelated to the rest of the features in the same array. Meaning, if one feature would change, this change would not lead to other features change based on a rule or system as well. If they do, their representations overlap, leading to waste in computation power and storage space. An example of such correlated features is length, width and area of a rectangle. The area of a rectangle can be calculated by multiplying its length by its width. There is little to no point in having 3-element vector to represent a rectangle, such as [length, width, area], this vector can be written only using two values, [length, width], and the missing information can be calculated on the go.
Figure 3.4: Original Siamese network used by Jane Bromley et al. for signature verification
3.3 image2vec Network type
3.3.1 Siamese network
The Siamese network concept was first introduced in 1994 by Jane Bromley et al. in [2]. Authors used it for Signature Verification task where each signature was represented by 800 sets of z, y and pen up-down points.
Siamese network can be viewed as a system with two inputs and one output, as shown in Figure 3.4. The output value is a measurement of how similar the inputs are. If the similarity score is under a pre-defined threshold, inputs are treated as representatives of the same class. If the similarity score exceeds the threshold, the inputs are not equal (e.g. two signatures are not made by the same person in this case).
Inside the Siamese neural network, there are two parallel networks with shared parameters (weights and biases). Each of these networks
produces a feature vector for a given input. Feature vectors are then compared, and the similarity score is calculated. In most cases, the comparison is made using some metric distance equation, e.g. Euclidean distance [1] [18].
Contrastive loss
A fully trained Siamese network should output small score values if the inputs are of the same class, and high values otherwise. This is a two-fold operation that should be coded in the loss function. One such loss function is Contrastive Loss. Contrastive Loss has the following mathematical implementation:
L
W,Y,~X1,~X2=
(1−Y)12(DW)2+12Y{max(0,m−DW)}2 (3.1) Here L is the loss function that takes four arguments. W are the weights, Y are Siamese labels, X1 and X2 are two samples one tries to compare. Y-labels is can either be 0 or 1, o if the inputs are of different classes and one if the are samples of the same class (two fingerprints of the same person). The equation is a sum of two products.
Left one ((1−Y)12(DW)2) accounts for samples of different classes.
Here if the Y is one, (1-Y) will evaluate to 0. The rest of the left equation will, therefore, be nullified, if Y, on the other hand, is 0, meaning that the samples are of the different classes, (1-Y) will evaluate to 1 and 1/2(Dw)2 will be included in the global output.
Right one (12Y{max(0,m−DW)}2) accounts for samples of the same classes. When Y is 0, the whole equation will evaluate to zero and will not make any contribution to the global output. When Y is 1, this equation will evaluate the difference between margin m and the distance Dw or 0, whichever is higher.
Contrastive loss introduces a new hyperparameter m, margin. This is the value Contrastive loss uses for loss calculations. All input pairs of the same class are pushed together and should have a similarity score lower than m. Accordingly, all examples of different classes are pushed away from each other to have a similarity score higher thanm.
This margin can theoretically be anything, and one would think that a bigger margin will produce a more significant difference between positive and negative classed. However, in practice, Contrastive loss performs best when this margin is somewhere between 0.0 and 5.0. This is a small range to chose from but still adds another hyperparameter to tune under training.
3.3.2 Base Model
This thesis is a continuation of Naoki Setoguchi-sans thesis. He presented four network architectures (2in, 3in, diff, 2diff), which are different variations of parallel neural networks presented in the previous section.
Simplified version of the Base Model (BM) architecture is presented in Figure 3.5. More detailed version is shown in Figure 3.6.There are four input images. Ideally, there should be four parallel networks. In this case, there are only two. With an increased number of networks increased training time will follow. Some juggling with input images was made to optimize the training process. An optimized version of the network only requires two networks in parallel, where four input images are concatenated into two groups of two. These images are fed into the networks and outputs of each network are split, and then concatenated across networks. In theory, doing this should revert original concatenation operation made in input images.
In summary, the network can still be treated as standard Siamese network.
Figure 3.6: BaseModel
Figure 3.7: Softmax distribution of 8 subjects
3.3.3 ArcFace
Figure 3.5: Simplified BaseModel ArcFace [5] is an improved ver-
sion of Softmax for face classifica- tion task. The improvement was possible due to Softmax not for- cing any additional distribution between different classes. A toy example of class distributions of 8 subjects using Softmax is illus- trated in Figure 3.7. Samples at the edges can be challenging to classify correctly. Next Figure (3.8) shows distribution of same subjects but using ArcFace Loss.
It is clear that ArcFace Loss cre- ates more distinguishable features compared to the Softmax.
ArcFace is currently state-of- the-art network outperforming its predecessors like ArcCos, Adapt- iveFace, P2SGrad. Novel architec- ture using ArcFace is presented in Figure 3.9
Focal loss vs CrossEntropy loss Focal loss [16] is a novel loss
Figure 3.8: ArcFace distribution of 8 subjects function developed by Facebook
AI Research (FAIR) group.
Focal loss function was originally developed for object detection
networks where background classes outnumber foreground classes. An image usually contains up to 7 objects. The rest is the background. This uneven distribution makes it hard to train neural networks.
CrossEntropy loss that is often used for object detection produces higher loss values for miss-classifications of hard (foreground) classes and lower values for easy (background) classes. In theory, this should penalise hard class miss-classification more, forcing the neural network to learn better attributes. However, in reality, the sum of the overwhelming number of loss values for easy classes out-weights the loss of hard classes confusing the network and affecting overall accuracy. This is shown in the Figure 3.10. Loss value highlighted with red (2.3) is the loss value for hard class, the one highlighted with green (0.1) is the loss value for the easy class.
There is 23 times difference, but if there are 23 times more easy classes, the network will focus equally on easy classes as it will on hard ones.
The equation for CrossEntropy loss is shown in 3.2. Focal loss, equation 3.3, acts as a modulator for CrossEntropy loss, adding(1−pt)γto the equation, whereγis a regulator of how much balancing one want to apply. This should balance uneven class distribution and total loss value.
CE(pt) = −log(pt) (3.2) FL(pt) =−(1−pt)γlog(pt) (3.3)
Figure 3.9: BaseModel paired with ArcFace layer
Source: Focal Loss for Dense Object Detection
Figure 3.10: Cross Entropy penalising miss-classification of hard classes The effect of applying FocalLoss is shown in Figure 3.11. In the same scenario as before FocalLoss will assign different loss values, 2.1 for hard class, and 0.01 for an easy one. This time there is a 210 times difference, meaning one would need 210 times more easy classes to fool the network.
All of these examples are using gamma value of 1, choosing a higher gamma value, e.g. 5, will increase the ratio of the loss values, making it even harder to fool the network.
As the dataset used in this thesis have few samples per class, the lowest being four and the highest being seven. The difference of three images may sound like a small amount, but it is almost 50% when compared to the average number of samples per class. That is why the decision was made to use FocalLoss with a margin of 5 for all further experiments.
Using Focal as a loss function will have the most effect in Siamese type of network. This is due to empirical discovery through trial and error that this type of network achieves the best accuracy when the ratio of positive to negative pairs is as low as 2%. In the case of Siamese network, 2%
represent positive pairs, while as much as 98% are examples of negative
Source: Focal Loss for Dense Object Detection
Figure 3.11: Focal loss vs Cross Entropy loss
pairs. Meaning for the majority of training the network is fed samples of negative classes, making their loss values overshadow loss values for the positive classes. This creates an artificial uneven distribution of Siamese classes. Focal loss takes a margin as a parameter when it is initialised.
This margin is a small number in range 0 to 5. Five meaning more significant penalty for the wrong classification of hard classes, while 0 means no penalisation at all. Using zero as margin, FocalLoss acts exactly as CrossEntropyLoss.
3.4 Feature Comparison
Now that there is a method to extract distinct features of a person, one can start the identification process. Testing the final system requires a pool of people. For every person in this pool, a feature vector for needs to be extracted and preserved for later comparison. Depending on vector size and the number of people, one would end up with an array of size N x M, where N represents a number of subjects, and M is the feature vector size.
This array will represent our database of features.
When new identification is required, new feature vector needs to be extracted for the person one tries to identify. This M-sized vector is then compared to every entry in the database, and distances between each pair are calculated. Since the original system was trained to produce similar features for the same person, the pair that has minimal distance is selected.
Ideally, a person id of the vector in the database should correspond to the same id of the person one tries to identify. Of course, it only holds if this person was previously ’examined’ and his/her vector was added to the database. If not, the system will select the person that has the most similar gate to the one queried.
3.4.1 Nearest Neighbour Classifier
One way to compare vectors of various length is to use the nearest neighbour (NN) classifier. NN is a simple, supervised, non-parametric classification model introduced in 1951 by Fix, E. and Hodges. NN classifier is straight forward and simple to understand. It can be implemented in a few lines of code. NN classifier is considered supervised because it expects a pair of data and label value. NN classifier has no hidden parameters that need training, as it operates and compares data directly. Therefore is it called non-parametric. There is another version called k-NN or k-nearest-neighbours, where k is a hyper-parameter that affects the outcome of the classification and must be chosen carefully.
Nearest Neighbour Classifier is a simplified version of k-NN where k=1, meaning only the nearest data point is selected. Since this classifier operates on 1-d vectors, there are multiple metrics one can use as the distance between them, but the default and a most used one is L2 norm.
The L2 norm calculates the distance of the vector coordinate from the origin of the vector space. As such, it is also known as the Euclidean distance as it is calculated in the Euclidean space. The formula for L2- norm between vector p and vector q is shown below:
d(p,q) = d(q,p) = q
(q1−p1)2+ (q2−p2)2+· · ·+ (qn−pn)2
= s n
i
∑
=1(qi−pi)2
(3.4)
Because the whole equation is under a square root, the final result is always a positive value.
In this thesis Nearest Neighbour classifier is used instead of its more complex k-Nearest Neighbour version because there is no guarantee of have multiple examples (n >= k) in the training set for every sample in the test set. If k-NN with k=3 were used, while only having 1 sample of training data for subject J, the classifier would yield wrong results as it would find the correct training sample. However, it will look further