GIL - GPS Independent Localization

(1)

Localization

Thesis submitted for the degree of Master in Robotics and Intelligent Systems for 60

credits

Vegard Bergsvik Øvstegård

June 1, 2021

(2)

Abstract

Localization for Unmanned Aerial Vehicles (UAV) is crucial for most tasks, and despite its extensive usage, GPS does have some shortcomings. E.g., GPS signals can be occluded, jammed, and encrypted. A prevalent localization method applicable for various mobile robots is the Monte Carlo localization (MCL) algorithm.

While MCL for UAVs using satellite images has shown promis- ing results in previous works, the most significant shortcoming has been its lack of robustness against adverse environmental conditions such as differences in seasons, weather, lighting, and moving objects. This master thesis presents a framework that attempts to solve said shortcomings utilizing Machine Learning (ML) combined with MCL.

The framework attempts include invariance against the mentioned conditions by converting a dynamic environment to a static one by utilizing machine learning to recognize static objects available in a ground truth map. It does so by performing pixel-wise classification of, e.g., buildings. Frames from a downwards-looking camera are feed to the ML network, and then the segmented images are passed to the MCL algorithm. The MCL algorithm tries to localize the image in a ground truth map containing the static objects. It compares the segmented image from the camera to many different locations on the map. The MCL algorithm is a somewhat modified version allowing it to adjust the number of particles for better efficiency on localization and solve the kidnapped robot problem to re-localize on a failure.

The framework provides good performance during simulation. It achieves global location, solves the kidnapped robot problem, and has good accuracy during position tracking. The simulations offer reasonable theoretical grounds for the framework. It had significant problems during de-facto tests due to the poor segmentation performance of the ML network. However, it did manage to localize with some invariance to environmental conditions successfully. In conclusion, the framework works successfully but is highly dependent on a good segmentation model.

(3)

Acknowledgments/Preface

Many people deserve thanks and gratitude.

I want to start by thanking my supervisor Jim Tørresen for his guidance and feedback. His keen eye on various problems and issues was of immense help, allowing me to keep track of the research’s essential parts. I would also like to thank him for approving my research problem and recognizing that it had potential.

I would like to express my gratitude to Griff Aviation and Phillip Johan Hofset Holand for backing my research problem and allowing me to come on sight for an insight into the works of developing industrial-grade drones.

Further, I would like to express my gratitude to my fellow students. Having this group to discuss the research with has been highly motivating and a great resource.

Finally, I would like to give a loving thanks to my better half, family, and friends for their continuous support.

(4)

List of Tables

1 GSD Variance of rastermaps ordered from Kartverket. Drone height and FOV are approximations based on the camera of a

DJI Mavic Pro Drone. . . 33

2 Examples of different areas. . . 34

3 Ground truth issues. . . 35

4 Additional maps produced for the dataset. . . 37

5 Data augmentation examples . . . 42

6 Test evaluations with the U-net segmentation models. First three columns describe the particular variance of their architecture, while the two last are metrics describing their performance on a test set post training. . . 54

List of Figures

1.1 Example of a UAV. The GRIFF 135 from Griff Aviation. . . 2

1.2 Global localization is attempted by comparing the segmented image captured by the UAV, c), with the a priori map representing the ground truthb), describing the environmenta)with segmenting long-term objects such as buildings and roads. . . . 5

2.1 A typical outline of a very shallow Convolutional Neural Net- work. One Convolutional layer and one Pooling layer. Input images are depicted as a 3D block as they often have three channels(RGB). . . 10

2.2 The outline of a convolutional layer. . . 12

2.3 Example of max-pooling in a single channel activation map with kernel sizes of 2x2 and stride of 2. . . 13

2.4 “ImageNet classification error versus BSs. This is a ResNet-50 model trained in the ImageNet training set using 8 work- ers (GPUs), evaluated in the validation set.” Source: [47] . . . 15

(7)

2.5 “Normalization methods. Each subplot shows a feature map tensor. The pixels in blue are normalized by the same mean and variance, computed by aggregating the values of these pixels.

Group Norm is illustrated using a group number of 2.” Source

[47] . . . 16

2.6 “The loss surfaces of ResNet-56 with/without skip connections. The proposed filter normalization scheme is used to enable comparisons of sharpness/flatness between the two figures.” Source: [9] . . . 20

2.7 U-net architecture designed by Ronneberger et al. Source: [26] . 21 2.8 “Monte Carlo Localization, a particle filter applied to mobile robot localization.” Source: [12] . . . 25

2.9 The KLD-Augmented-MCL algorithm Source: [30] . . . 27

3.1 Simplified overview of the framework. . . 30

3.2 Alteration of the conv 3x3, ReLU step of the vanilla U-net model. 31 3.3 Example of city center. . . 34

3.4 Example of an industrial area . . . 34

3.5 Example of turfed roofing. . . 34

3.6 Example of residential area. . . 34

3.7 Rectification issue. . . 35

3.8 Missing annotation issue. . . 35

3.9 Inaccuracy in water segmentation. . . 35

3.10 Random sample from the dataset created form Kartverkets data. 35 3.11 Vinter environment map created using a drone. . . 37

3.12 Post rainfall map created using a drone. . . 37

3.13 Visualization of the large portion of roads with missing annotations. . . 38

3.14 Visualization of Binary Cross-entropy loss. Source: [48] . . . . 39

3.15 Original image . . . 42

3.16 Brightness decrease . . . 42

3.17 Brightness increase . . . 42

3.18 Gaussian noise . . . 42

3.19 Speckle noise . . . 42

3.20 Salt Pepper noise . . . 42

3.21 Saturation decrease . . . 42

3.22 Saturation increase . . . 42

(8)

3.23 Contrast decrease . . . 42

3.24 A more detailed description of the segmentation block. . . 44

3.25 Drone image prior to pre-processing . . . 45

3.26 Drone image after pre-processing. . . 45

3.27 Original prediction from the network. . . 45

3.28 Prediction after dilation and erosion. . . 45

3.29 Visualization of the sampling step. Red dots indicate pixel locations of the sampling step. . . 46

3.30 Result of the sampling step. This image is flattened to a binary vector. . . 46

3.31 Visualization of the a priorimap used the KLD-Augmented- MCL algorithm utilized in the localization block. . . 47

3.32 Visualization of the continuous uniform distribution of particles after initialization. The yellow block represents the average posé of the particles, while the green block represents the actual UAV posé . The colored block areas represent the approximate field of view for a drone at 120 meters above ground level, with a camera setup similar to that of a DJI Mavic Pro. . . 48

3.33 Low variance resampliung for the particle filter. Source [12] . . 50

3.34 Visualization of differentσvalues applied to the PDF. . . 52

4.1 Segmentation results visualized as white masks in the bottom row. . . 55

4.2 Comparison of segmentation result vs ground truth. . . 56

4.3 Results from the first simulation with navigation data. . . 58

4.4 Visualization of the particles posés versus the ground-truth during the first simulation with navigation data. The ground-truth path is displayed as a continuous line in blue. The average particles posé are scatter plotted in red dots, while the highest weighted particle is plotted as green x. . . 59

4.5 Results from simulation 1 without navigation data. . . 60

4.6 Visualization of the particles posés versus the ground-truth during the first simulation without navigation data. The ground- truth path is displayed as a continuous line in blue. The average particles posé are scatter plotted in red dots, while the highest weighted particle is plotted as green x. . . 61

4.7 Results from simulation 2. . . 62

(9)

4.8 Visualization of the particles posés versus the ground-truth during the second simulation with navigation data. The ground-truth path is displayed as a continuous line in blue. The average particles posé are scatter plotted in red dots, while the highest weighted particle is plotted as green x. . . 63 4.9 Results from simulation 2. . . 64 4.10 Visualization of the particles posés versus the ground-truth dur-

ing the second simulation without navigation data. The ground- truth path is displayed as a continuous line in blue. The average particles posé are scatter plotted in red dots, while the highest weighted particle is plotted as green x. . . 65 4.11 Results from flight 1. . . 67 4.12 Visualization of the particles posés versus the ground-truth dur-

ing the first de-facto experiment with navigation data. The ground-truth path is displayed as a continuous line in blue. The average particles posé are scatter plotted in red dots, while the highest weighted particle is plotted as green x. . . 68 4.13 Results from flight 1. . . 69 4.14 Visualization of the particles posés versus the ground-truth dur-

ing the first de-facto experiment without navigation data. The ground-truth path is displayed as a continuous line in blue. The average particles posé are scatter plotted in red dots, while the highest weighted particle is plotted as green x. . . 70 4.15 Results from flight 2. . . 71 4.16 Visualization of the particles posés versus the ground-truth dur-

ing the second de-facto experiment with navigation data. The ground-truth path is displayed as a continuous line in blue. The average particles posé are scatter plotted in red dots, while the highest weighted particle is plotted as green x. . . 72 4.17 Results from flight 2 without navigation data. . . 73 4.18 Visualization of the particles posés versus the ground-truth dur-

ing the second de-facto experiment without navigation data. The ground-truth path is displayed as a continuous line in blue. The average particles posé are scatter plotted in red dots, while the highest weighted particle is plotted as green x. . . 74

(10)

1 Introduction

1.1 Motivation

In recent years, the popularity and use of Unmanned Aerial Vehicles (UAV) have increased drastically [10]. The usage and potential for both civilian and military applications are many. From monitoring and crop dusting, infrastructure inspections to surveillance and accident reporting. Drones have an increasingly more prominent presence and influence on various industries. With this in mind, the systems that they depend on must be robust and work as intended.

Figure 1.1:Example of a UAV. The GRIFF 135 from Griff Aviation.

In most classes of robots, in civilian and military applications, localization and navigation are fundamental capabilities. They are crucial for UAVs to execute complex operations. Inertial Navigation System (INS) and Global Positioning Systems (GPS) are often an integrated and critical part of a UAV’s navigation systems, despite their weaknesses.

State estimation such as movement and orientation from INS is rarely used alone and mainly aids position estimation when the GPS signal is unavailable or corrupt. However, due to hardware imperfections [25] that are not easy to avoid, INS will drift over time. Carrol [32] and Caballero [2] et al. state that the

(11)

number of satellites and the quality of their respective signals play an essential part in estimating a GPS receiver’s position. Few satellites or degraded signals will affect the estimation. Many factors affect GPS signals, and estimation [42], those most relevant to Aerial Vehicles are, for instance, Signal Occlusion. Signal Occlusion happens when satellite signals get blocked due to buildings, bridges, trees, or other obstacles. Multipath propagation also affects the accuracy of GPS signals and occurs when signals reflect off surfaces such as the walls of buildings. These issues, along with jamming, radio interference, or atmospheric conditions such as major solar storms, cause corrupt signals and inaccuracies.

Satellite maintenance or maneuvers creating temporary gaps in coverage are also problematic.

As mentioned, UAVs are heavily dependent on GPS for navigation and localization. The many potential errors related to GPS must be solved to obtain a robust positioning system. In recent years, the Norwegian authorities have claimed active jamming of GPS signals [5], resulting in a disruption of civilian flights.

With the risks of losing positioning abilities due to the problems mentioned above, having a GPS-independent localization solution is of high importance and, in some situations, crucial. For this reason, motivations as justified for developing alternate localization methods, able to work both alongside and independent of existing ones.

In addition to GPS and INS, cameras are ubiquitous sensors embedded on UAV platforms. With the continuing reduction in weight, price, and size compared to LIDAR or Laser Range Finder, camera usage has increased and is often regarded as a standard sensor. As images contain massive amounts of information about the environment, they are helpful for many tasks. Vision-based localization systems use one or several embedded cameras on the UAV and a map of the environment.

They do not rely on other external systems such as ground stations or satellites.

Thus, a camera serves as a good candidate for a redundant localization solution or replacement when GPS fails.

Mantelli et al. [35] mention a possible approach and its challenges for the vision- based UAV localization problem. The approach was using a downwards facing camera, providing aerial images of the environment and estimating the position of the images on ana prioriknown map based on aerial or satellite images. Most of the planet is already mapped, and there are many free and online sources providing maps like these, such as Google™ Earth, Bing ™ Maps, and others.

(12)

There are also increasingly more areas with detailed topography information from LIDARS and other sensors. Some of the quoted challenges with this problem are the update frequency and resolution of the maps. The images collected by the UAV might also have a significant difference compared to thea priorimap, such as Illumination conditions, transient ground modifications caused by moving objects, weather conditions in particular rainfall and snow, but also long-term static modifications such as new roads or buildings.

Many works propose different maps and measurement models to overcome the challenges related to the UAV localization problem. However, they come with caveats, some only work in specific scenarios where robustness falls out, and others have high computational cost or lacking precision. Hence a novel approach with invariance to environmental conditions and low power usage to localize the UAV has its merits.

1.2 Problem statement

This master thesis work proposes a somewhat novel strategy but inspired by Mantelli et al. [35], Masselli et al. [43], and Nassar et al. [22]. The approach uses a downward-facing camera, a vision-based measurement model on an a prorimap containing only static objects, and an extra step using machine learning to improve robustness and decrease computational costs. Such a map is freely available via the Norwegian Mapping Authority. The idea is to include an image segmentation network such as U-net [26] and use a very simple binary image descriptor on the segmented images from the camera and thea priori map. Robustness is induced by training the network to be invariant to some of the challenges mentioned earlier. E.g., it should segment out stationary objects contained in the a priori map, such as buildings regardless of illumination conditions, weather, and seasonal changes such as snow. Estimating the UAV posé in 4 degrees of freedom (DoF), the vision-based localization framework will apply the measurement model in a particle filter approach such as Monte Carlo Localization(MCL) [36]. In short, a segmented image of the UAV’s view compares to several random patches on the a priori segmented map, and the measurement model describes their similarity. With enough particles over time and a sufficient segmentation model, the framework should, hopefully, find one that is similar enough to provide a likely position.

(13)

Figure 1.2:Global localization is attempted by comparing the segmented image captured by the UAV,c), with thea priorimap representing the ground truthb), describing the environmenta)with segmenting long-term objects such as buildings and roads.

As perFigure 1.2, the suggested framework compares the image observed and segmented by the drone, c), with a patch from the ground truth, b). As we pass the observed image through a network, there might be a need to pre-process the observed images. However, as the segmentation reduces the dimensionality of the images, the computational cost of comparing the observed image to a patch may possibly be lowered. E.g., the measurement model could classes such as static objects from the background and not 3x255 different pixel values.

This, in turn, results in higher robustness and hopefully lower computational cost. The approach is proposed as a redundant framework to estimate the UAV’s posé . There are nevertheless some situations where the framework will have trouble producing a precise position as when passing over areas with little to no buildings or roads. There might also be errors when flying over homogeneous regions. To implement and evaluate such a framework, the following topics must be addressed:

• Obtain a dataset capable of training a segmentation model with the framework’s need. The segmentation component of the framework is supposed to extract static objects from an orthogonal aerial viewpoint.

(14)

Hence a dataset capable of training a network on such a task must either be procured or produced.

• Develop and train a segmentation model able to learn the objective from the dataset. The segmentation model used in the framework must learn the given goal from the dataset and its features. It must also provide satisfactory results when used in the framework.

• Develop and program the framework. Produce such a framework capable of meeting the set requirements. The framework must be able to segment frames from a video feed and localize them via the MCL algorithm, in real-time on limited computational resources.

• Evaluate the performance of the framework. With all these requirements met, determine if the framework is at all able to localize, and if so, compare performance to other works and goal standard, GPS. The framework is intended to run in real-time on a UAV. Hence a balance must be made on the performance and computational cost of each component of the framework. Therefore, the evaluation must also include if the framework can run on a small computer, with the weight and power consumption within the boundaries of UAVs payload capabilities.

Three research questions will be also be examined:

• Will a measurement model work after the proposed segmentation?

By only segmenting out the static objects, one reduces the dimensionality and removes many details that distinguish various locations. Will a possible measurement model work in such a simplified environment?

• Is it possible to achieve proper segmentation of static objects such as buildings? How well can the segmentation task be done regarding the ground truth, and can it be performed in the various conditions previously mentioned. Also, can it reach the performance needed for e proposed measurement model.

• Is the framework able to achieve localiztion in various environmental conditions?Can it localize despite changes in weather or seasons?

(15)

1.3 Main contributions

The main contributions of this thesis are plural. It introduces a method of producing a dataset capable of training a segmentation model to segment static objects in an orthogonal view, such as building using existing georeference data. Investi- gations are made into the performance gains of using group normalization versus batch normalization in the U-net architecture. And it tests if said segmentation network is capable of segmenting buildings under environmental conditions it has not seen.

The main goal of this work is introducing a framework capable of localizing a UAV using a camera as a sensor and is invariant to changes in weather and seasons.

Investigations are hence made into the possibilities of using a measurement model on data with severely fewer details than previous work, essentially introducing a novel method of preprocessing the environment before being localized in an MCL algorithm. It does so by converting the environment from dynamic to static and shows that the suggested framework is viable.

In a sense, this thesis also paves the way for numerous other uses of the MCL algorithm as it proves it can utilize the pattern recognition traits of deep neural networks.

(16)

1.4 Outline

• Chapter 2: Background

This chapter gives an opening to the technical background information of the project. It provides context for the project by describing techniques, concepts, and methods from earlier works done in the field.

• Chapter 3: A framework for UAV localization utilizing object segmentation

This chapter gives an overview of the framework proposed and then describes the framework in parts.

• Chapter 4: Experiments

This chapter gives insight into the different experiments and tests performed on the framework and its components. It contains descriptions of the setup of the tests and the execution of the experiments. Lastly, the results are presented and discussed.

• Chapter 6: Conclusion

This chapter summarizes the master thesis and reviews the conclusions drawn from the results. It also discusses future works.

(17)

2 Background

This chapter contains an overview and description of former work and research, which this thesis builds upon and utilizes. The chapter aims to provide enough information and avail the reader to understand and appreciate the rest of the thesis.

Concisely, a summary of previously developed techniques and approaches that this dissertation uses, including descriptions of concepts and terminology adopted here. The three first sections outline the machine learning concepts. Firstly Convolutional neural networks insubsection 2.1, then Fully Convolutional Neural Networks insubsection 2.2, and lastly the selected architecture for this work in subsection 2.3. Mobile Robot Localization is outlined insubsection 2.4, some Machine vision topics are mentioned in??, and the last section,subsection 2.5 contains an overview of related work and its influence on this thesis.

2.1 Convolutional Neural Networks

AConvolutional Neural Network(CNN), in short, is a deep learning algorithm commonly used to assign importance and differentiate between various objects and aspects of an image fed into it. It does so by changing and updating inherent weights and biases based on a ground truth via supervised learning.

To some extent, they are very similar to regular neural networks that useMul- tilayer perceptrons (MLPs); both consist of learnable weights. Nevertheless, contrary to MLPs, CNNs have an architecture that explicitly assumes their inputs have structures like images. This assumption allows encoding said property into the architecture by sharing the weights for each location in the image and having neurons respond only locally. I.e., a CNN is composed of convolutional layers without any fully connected layers or MLPs usually found at the end. This provides efficiency for the forward pass implementation, and most importantly, reduces the number of parameters in the network compared to a fully connected network(FCN). E.g if a 3-channel image of size 256 by 256 pixels were to be feed into an FCN, it would require the first hidden layer to have256∗256∗3 = 196608 input weights.

(18)

Figure 2.1:A typical outline of a very shallow Convolutional Neural Network.

One Convolutional layer and one Pooling layer. Input images are depicted as a 3D block as they often have three channels(RGB).

The preprocessing required in a CNN is also much lower as compared to other classification algorithms. While in primitive methods, filters are hand-engineered, with enough training, CNNs can learn these filters. The CNN architecture is analogous to the connectivity pattern of Neurons in the Human Brain and was inspired by the organization of the Visual Cortex. Individual neurons respond to stimuli only in a restricted region of the visual field known as the Receptive Field.

A collection of such fields overlap to cover the entire visual area. In particular, nodes in the convolutional layers only connect to small regions their previous layers, contrary to some nodes an FCN that connect to all nodes in their next layer. As mentioned, this reduces the number of parameters vastly.

(19)

2.1.1 Forward propagation

Forward propagation in a CNN architecture consists primarily of convolutional layers, pooling layers, and activation functions. Convolution is the first layer to extract features from an input image or a previous layer. Convolution preserves the relationship between pixels by learning image features using small filters, known as kernels, passing them over the image. I.e., layer values from small areas in previous layers or input images calculate new values in the current layer.

A kernel of trainable weights covers the small area. A compact area allows the new value in the current layer to retain surrounding information from that area in the former layer.

Pooling layers offer an approach to downsample areas, known as feature maps, by summarizing the different values in patches of the feature map. Two standard pooling methods are average pooling and max pooling, which summarize the average presence of a feature and the most activated presence of a feature. In short, it is an operation whose objective is to reduce the spatial size of the input by selecting a particular value and discarding the rest.

Activation functions are usually somewhat simple functions implemented at the end of a layer. They are critical as they introduce non-linearity to the networks, which allow the layer and neurons to learn and pass answers down the pipeline.

Commonly, the last layers of CNNs are fully connected layers to make predictions.

However, in this work, other modules are used in the aft parts of the architecture.

For this reason, there are no further mentions of FCNs. The output layer is generally a Softmax layer to clamp the class scores to a value between 0 and 1.

The layers mentioned above are described in this subsection.

2.1.1.1 Convolutional layer Convolutional layers are the essential building blocks used in CNNs, hence their name [45]. In the context CNNs, convolution is a linear operation that involves multiplicating a set of weights with the input, much like a traditional neural network. Given that the technique was designed for two-dimensional input, multiplication is performed between an array of input data and a two-dimensional array of weights, called a filter or a kernel.

(20)

Figure 2.2:The outline of a convolutional layer.

The filter is smaller than the input data. Also, the type of multiplication applied between a filter-sized patch of the input, and the filter is a dot product. This multiplication produces a single scalar, resulting in a two-dimensional activation map. The spatial size of this activation map depends on whether the input was padded or not. Padding refers to the number of pixels added to an image when the kernel is processing it. Adding padding to an image processed by a CNN allows for a more accurate analysis of images. It allows the spatial size of the resulting activation map to be the same as the input.

2.1.1.2 Pooling Layer Pooling is a common feature imbibed into CNN architectures, and the main idea behind these layers is, to sum up, features from maps generated by previous convolution layers [45]. Formally its function is to reduce the spatial size of the activation maps, such that it reduces the number of parameters and hence computation in the network. The most common form is max pooling. A kernel of a chosen size is applied as a sliding window across each activation map individually in a relatively simple operation. The representative for that area is hence the largest value within the kernel area in the activation map. The kernel size and the stride are hyperparameters chosen by the designer of the architecture. The stride tells of how many pixels the kernel jump after each application. If the stride is equal to the number of rows in the kernel, none of the areas the kernel is applied to will overlap. In a max-pool layer, this is the most common approach.

(21)

Figure 2.3:Example of max-pooling in a single channel activation map with kernel sizes of 2x2 and stride of 2.

Max pooling is done in part to help over-fitting by providing an abstracted form of the activation maps. As well as it reduces the computational cost by reducing the number of parameters to learn, and it also provides basic translation invariance to the internal representations. When max pooling, a max filter is applied over non-overlapping subregions of the initial activation maps. The idea is to retain the information which best describes the context of the image from each region and throw away the information which is not essential.

2.1.1.3 Rectified Linear Unit The Rectified Linear Unit(ReLu) is a non- linear activation function used in multi-layer neural networks or deep neural networks. For input values ofxthe function can be represented as:

f(x) =max(0, x) (2.1)

According to equation2.1, the output of ReLu is the maximum value between zero and the input value. The output is equal to zero when the input value is negative, and the when input value positive. I.e

f(x) =







0, ifx <1 x, ifx≥0

(2.2)

(22)

Traditionally, some prevalent non-linear activation functions, like sigmoid functions and hyperbolic tangent, have been used in neural networks to get activation values for each neuron. However, the ReLu function has become a more popular activation function because it can accelerate the training speed of deep neural networks compared to traditional activation functions. This because the derivative of ReLu is 1 for positive input. Owing to constant, deep neural networks do not need additional time for computing error terms during the training phase.

The ReLu function does not trigger the vanishing gradient problem when the number of layers grows because the function does not have an asymptotic upper and lower bound. Thus, the earliest layer (the first hidden layer) can receive the errors coming from the last layers to adjust all weights between layers. By contrast, a traditional activation function like sigmoid is restricted between 0 and 1, so the errors become small for the first hidden layer. The mentioned scenario will lead to a poorly trained neural network.

2.1.1.4 Normalization Layer One of the most common Normalization techniques used nowadays is Batch Normalization (BN). It is a strategy that normalizes interlayer outputs in a neural network. DN hence resets the distribution of the output from previous layers to be more efficiently processed by the subsequent layers [40].

It relieves numerous problems with properly initializing neural networks. In practice, networks that use BN are significantly more robust to poor initialization.

Additionally, BN s can also be described as an integrated preprocessing function at layers in the network it self [38].

The method leads to faster learning rates, as normalization ensures that there are no extreme activation values. It also allows each layer to learn independently from others, and it reduces the amount of data lost between the processing layer.

BN improves learning accuracy throughout the network. Ioeffe and Szegezy [29]

report a state-of-the-art classification model that achieved the same accuracy but requiring 14 times fewer learning iterations to do so using BN.

However, according to Wu et al. [47], using batch sizes (BS) less than 32 with BN results in a dramatically increased model error. There are situations that one has to settle for sizes of BS less than 32. I.e., when memory consumption of each data sample is too high, with extensive networks or simply with lacking hardware

(23)

requirements. This work does handle somewhat high-resolution images. For this reason, alternatives to BN, which work well with small BS, are needed. Group Normalization (GN), proposed by Wu et al. [47], is one of the latest normalization methods that avoids exploiting the batch dimension, thus, is independent of BS.

As visualized inFigure 2.4, they reported that with a ResNet-50 model trained on the ImageNet training set using 8 GPUs, a reduction in BS from 32 to 2 resulted in little to no change in error with GN, contrary to an increase in error when using BN. GN has substantially lower error (by 10%) than BN, with a BS of 2.

Figure 2.4:“ImageNet classification error versus BSs. This is a ResNet-50 model trained in the ImageNet training set using 8 work- ers (GPUs), evaluated in the validation set.” Source: [47]

As shown inFigure 2.5, GN’s computation is independent of BS and is stable in a wide range of BS. Instance normalization (IN) and Layer normalization are also (LN) independent of BS. GN is IN if we set the group number to equal to the number of channels in an extreme case. I.e., one channel per group. However, as shown in [47], GN does infer better training and validation errors than IN and LN on a ResNet-50, which is very similar to the downsampling path of the U-net.

All in all, Wu et al. [47] showed that GN does have benefits over BN when using BS lower than 16 per worker (GPU). Our hardware also has 8 GPUs, and they do not have the memory needed to train on such ha BS.

(24)

Figure 2.5:“Normalization methods. Each subplot shows a feature map tensor.

The pixels in blue are normalized by the same mean and variance, computed by aggregating the values of these pixels. Group Norm is illustrated using a group number of 2.” Source [47]

2.1.1.5 Softmax layer In many neural networks, the last layer is often a softmax layer [40]. It is used to transform values of the class scores to numbers ranging from 0 to 1. The sum of the class-wise predictions is 1, so the layer can be interpreted as a probability spread function and looks like this:

f_j(z) = e^z^j P

ke^z^k (2.3)

wherezis the set of scores to be squashed. As one can see, the formula takes each class score in the power ofe, and divides it by the sum of the entire set in the power ofe.

2.1.2 Backpropagation

Backpropagation, short for backward propagation of errors, is arguably the most important part of the training process. Backpropagation is where the learnable parameters, weights, and biases in the network are updated to improve predictions [44]. When a network is run through the training loop, a loss function is calculated, representing the network’s predictions and its distance from the true labels. Backpropagation allows us to calculate the gradient of the loss function, proceeding backward throughout the network from the last layer to the first.

Calculation of the gradient at a particular layer requires combining the gradients of all following layers via the chain rule of calculus. This operation enables each weight to be updated individually to reduce the loss function over many training

(25)

iterations gradually. Loss functions are described in more detail further down in this section.

The loss function provides the loss, or error, L at each output node, and the objective is to minimize this value. The input to this function isxand consists of the input values in the training data and the learnable parameters in the network.

The input data is static and cannot be altered to minimizeL, but the parameters are. The backpropagation method calculates how much the parameters should be altered, by finding their gradients relative to the error output from the loss function. Said gradient is calculated by taking the partial derivatives of the output with respect to input values. E.g., if an output depends on three input values, the output has three partial derivatives. The gradient itself is the vector consisting of all these partial derivative values.

Partial derivatives between each parameter relative to the error output that the particular parameter contributed to must be calculated to attain the gradient. For this to be possible, the process must start at the output layer. Every node obtains input values from a set of nodes in the previous layer. Each node calculates its output values based on the input and the gradient of the input values relative to its output value in the forward pass. During backpropagation, moving backward from the output layer to the input layer. All the nodes in time learn the gradient of its output value relative to the value it contributed to calculating. The gradient above should then be multiplied with the local gradients that the node has obtained per the chain rule. Hence, the process repeats when the input nodes to the particular node know the gradient of its output relative to the final error output value. In this manner, the gradient for every parameter from the output to the input layer will know its gradient relative to the final error output value they contributed to and can be modified to minimize said error based on the gradient value.

2.1.2.1 Loss functions The loss function determines the error, or in other words, the loss, between a network prediction and a given target value. It expresses how far off the network is at making a correct prediction. It is often expressed as a scalar that increases by how far off the model is. As the goal of the model is to perform correct predictions, the main objective of the training process is to minimize this error. A common loss function is the cross-entropy loss function. Cross-entropy loss, or log loss, measures the performance of a

(26)

classification model whose output is a probability value between 0 and 1. The function can be described as:

−

M

X

c=1

y_o,clog(p_o,c) (2.4) whereM is the number of classes,y is the binary indicator if class label cis the correct prediction for the observationo and p is the predicted probability observationoWith binary classification, where the number of classesM equals 2, the cross-entropy can be calculated as:

−(ylog(p) + 1−y)log(1−p)) (2.5) IfM >2we calculate a separate loss for each class label per observation and sum the result.

In practice, this gives a high loss value to wrong predictions and a 0 loss value to the correct predictions. This is the behavior that is wanted in a loss function since when minimized it will give better predictions. This is the base of the loss function used in this work.

2.2 Fully Convolutional Network

The goal of semantic image segmentation is to classify each pixel of an input image with a corresponding class of what is represented. Because we predict every pixel in the image, this task is commonly referred to as dense prediction.

The expected output in semantic segmentation is a complete high-resolution image in which all the pixels are classified. In a typical convolutional network, the height and width of the input gradually reduce, i.e., downsampling, because of pooling. This downsampling helps the filters in the deeper layers to focus on a larger receptive field. However, the depth and number of filters used gradually increase, which aids in extracting more complex features from the image. From the pooling layers, one can somewhat conclude that the model better understands the context presented in the image by downsampling. However, it loses the information of locality, i.e., where said context is located. Thus if one were to use a regular convolutional network with pooling layers and dense layers, the locality information would be lost, and one only retains the contextual information.

(27)

A Fully Convolutional Network(FCN) is a CNNs based network that progresses from coarse to fine inference to predict every pixel like semantic segmentation requires [3].

2.2.1 Deconvolution

In order to attain the output expected from semantic segmentation, there is a need to convert, or upsample, the low-resolution information provided by a typical CNN to high resolution and recover the locality information. Deconvolution, sometimes also called transposed convolution or fractionally strided convolution, is a technique to perform upsampling of an image with learnable parameters [14].

On a high level, transposed convolution is precisely the opposite process of a standard convolution, i.e., the input volume is a low-resolution image, and the output volume is a high-resolution image. A regular convolution can be described as a matrix multiplication of input image and filter to produce the output image.

In short, by taking the transpose of the filter matrix, it is possible to reverse the convolution process. Hence the name transposed convolution.

2.2.2 Skip connections

Skip connections have become a standard module in most convolutional networks.

It provides an alternative path for the gradient during backpropagation. In short, what skip connections do is skip some layer and feed it to some later layer rather than the next one. As we go backward during backpropagation, the resulting gradient often becomes very small on deep networks with many layers. This vanishing gradient is because of the long chain of multiplication with numbers less than one. In the worst cases, gradients might become zero, resulting in no update to the early layers.

Skip connections are often divided into short and long skip connections. Short skip connections are used along with consecutive Convolutional layers that do not change the input dimension. In contrast, long skip connections are used in symmetrical encoder-decoder architectures. These are architectures where the spatial dimensionality is reduced during encoding and gradually increases an equal amount during decoding. During decoding, the dimensionality of a feature map is increased via transpose convolutional layers described insubsub-

(28)

section 2.2.1. Simply put, said layers form the same connectivity as a standard convolution layer but in a backward direction.

The motivation behind this type of connections is to have an uninterrupted flow of gradients from the first to the last layer, and hence tackle the vanishing gradient problem. Also as shown inFigure 2.6, Li et al. [9] reports that the loss surfaces change dramatically with skip connections. A larger global maxima and fewer locals.

Figure 2.6:“The loss surfaces of ResNet-56 with/without skip connections. The proposed filter normalization scheme is used to enable comparisons of

sharpness/flatness between the two figures.” Source: [9]

2.3 U-net

The neural network, U-net, is built upon FCN, and adopts an Encoder-Decoder architecture which consists of a contracting path to capture context and a sym- metric expanding path to enable accurate localization. It was first introduced by Ronnerger et al. [26] to segment medical images and achieve good results with the fewer training set. These types of images, like orthophotography images, are very close to two-dimensional images as they are taken orthogonal to the plane of the subjects. With this in mind, some studies have shown that U-net is very suitable for remote sensing images [24]. From the experiments of Hu et al. [8], U-net was among the top two networks alongside Deeplab developed by Chen et al. [4]. Although U-net had a slightly lower performance than Deeplab, according to [16], U-net outperforms Deeplab when it comes to inference time.

(29)

Figure 2.7:U-net architecture designed by Ronneberger et al. Source: [26]

The architecture of U-net builds upon the concept of FCN. As previously mentioned, the network is structured as an encoder-decoder, as shown inFigure 2.7, and consists of three main parts. The first, leftmost blocks are called the contracting or downsampling path. It consists of 4 blocks where each block applies a double 3x3 convolution step, including batch-normalization, followed by 2x2 max-pooling. BN is applied after each convolution step. Each down-step to a new pooling layer doubles the number of feature maps. The bottom-most horizontal layer is called the bottleneck and consists of a double 3x3 convolution followed by a 2x2 up-convolution. The right-most layers are called the expanding or upsampling path. It is complementary to the contracting path, consisting of an equal amount of blocks. Each block comprises a double 3x3 Convolution followed by a 2x2 upsampling, also known as a transpose convolution—the number of feature maps in this path halves after every block.

In order to localize, the U-net also has skip connections between the downsampling and upsampling paths. The upsampled output is concatenated with the corresponding cropped feature maps from the downsampling path. Due to the loss of border pixels in every convolution, the feature maps are cropped. Said concatenation allows the upsampling path to utilize the features learned during

(30)

downsampling.

2.4 Mobile robot localization

Mobile robot localization is the problem of determining a robot’s posé relative to a given map of the environment. It is often called position estimation. It is an instance of the general localization problem, which is among the most basic perceptual problem in robotics. Nearly all robotics tasks require knowledge of the location of objects susceptible to manipulation. A simplified explanation of the problem is a robot given a map of its environment. Its goal is to determine its position relative to this map, given the environment’s perceptions and movements.

Examples of perceptions are reading from sensor data such as optical radar (LIDAR), cameras, or movement odometry from wheels. Mathematically, this is called a problem of coordinate transformation. Usually, a robot cannot sense its posé directly. The errors in posé estimation often relate to the issue that most sensors are not noise-free, and single readings from sensors are most of the time insufficient, especially in homogenous environments—for instance, a robot navigating a building with very similar corridors. For this reason, the robot’s posé must be inferred from data over time.

2.4.1 Different localization problems

Different situations require different tools for a robot’s ability to localize its posé . We must divide localization problems relative to their inherent objectives and properties. Such as the nature of the environment and the initial knowledge the robot may possess.

Global versus Local Localization: Characterized by the type of information known at the start and during run-time, mainly three localization problems are defined distinguished with increasing difficulty:

Position tracking assumes a known robot posé at startup. Achieving localization is thus achieved by adapting to the noise in the robot motion. The effect of such noise is usually small. I.e., the estimated posé is highly likely to be where the robot’s true posé is. With methods for position tracking, one often relies on the assumption that the posé error is small, and unimodal distribution, e.g., a Gaussian distribution, is used to approximate the posé uncertainty. Said localization

(31)

problem is defined as alocal localizationproblem, as the uncertainty is local and confined to a small region around the robot’s true posé .

Situations when localizing with an unknown robot posé are calledglobal localization. The robot’s posé is somewhere in the environment, but its exact location is unknown. As this problem must localize and track, it subsumes the position tracking problem and is more complicated.

The most challenging localization problem is that of thekidnapped robot problem. It is a variant of the global localization problem, where the robot can get kidnapped and teleported to some other location. It is more complex, as the robot might believe it knows where it is while it does not. In global localization problems, the robot usually has some idea that it does not know its location.

One might have to visit the universe of Star Wars to teleport a robot somewhere else. However, this problem’s practical importance is essential. Recovery from failures is critical for autonomous robots, and testing localization algorithms by kidnapping it measures its ability to recover from global localization failures.

The environments the robots move in can also provide challenges. Static envi- ronmentsare, in short, environments where only the robot moves and nothing else. All other objects are static and remain in the exact location. Dynamic envi- ronmentscontain other objects that have measurable movements and rearrange the environment’s configuration as a whole. Examples of changes relative to this project are cars, people, daylight, and weather. The latter environment is the most difficult to localize in, as one has to take dynamic objects into account.

The last topic that characterizes different localization problems is the localization algorithms themselves and their input, or lack thereof, into the robot’s motion.

Passive localizationonly observes the robot and its sensors, while active lo- calizationaffects the robot’s movements to acquire or improve the probability of the robot’s estimated posé . The latter usually yield better results than the former. However, it is not always desirable to affect the movement of a robot, nor possible.

2.4.2 Monte Carlo Localization

Consider a robot moving around trying to determine its posé on a map of its environment, using its sensors. As previously mentioned, most sensor readings come with some form of error, so the robot generates several random guesses of

(32)

where it is traveling on the map. Each guess, called a particle, contains a complete description of a possible future posé . At every scan with its sensors, it compares its reading with that of the particles and discards the particles inconsistent with its observations. However, it generates more of those particles similar to its readings. After some time, hopefully, most of the particles converge towards the same posé as the robot’s true posé . This form of localization is called the Monte Carlo Localization (MCL). It is easy to implement, works well on many localization problems, and solves local and global localization.

To get a basic understanding of the algorithm,Figure 2.8visualizes the iterative process of localizing with MCL. Each picture depicts a robot’s position in a hallway, along with its beliefbel(a)of where it is, represented by a histogram over a grid. The grid may be considered as a particle or a possible position of the robot. Step(a)presents the algorithm starts with initializing a continuous uniform random set of particles with the same amount of weight, i.e., the probability of having the same posé as the robot. In short, the robot has no idea where it is, as itsbel(x)is equal over the whole hallway. Next, in step(b), the robot runs a sensor update with the measurement modelp(z|x), i.e., checking it is next to a door and compares its reading (True) with that of the random particles. Thus a function applies a new weight to the particles based on the similarity of the readings and updates its beliefb(x). Those next to a door gets larger weights, while those not next to a door get a smaller weight. Step(c)shows the motion update applied to both the robot and particles. On step(d), a new sensor update is made, and at step(e), the particles with the highest weights are located around the robot’s actual posé .

(33)

Figure 2.8:“Monte Carlo Localization, a particle filter applied to mobile robot localization.” Source: [12]

MCL is also described as the Probability Density Function (PDF) of the robot’s posé is represented by a set of particles. This is thebel(x)function inFigure 2.8.

Each particle is a sample of the PDF and codifies a possible posé of the robot.

(34)

Particles are distributed according to the PDF, i.e., the PDF regions with a higher probability will have a higher concentration of particles. The vanilla MCL algorithm (also known as particle filter localization) described in [12] solves both the local and global localization problems. However, the most complex localization problem, the kidnapped robot problem cannot be solved with the basic MCL. The algorithm is also relatively inefficient during position tracking as it does so with the same amount of particles as during global localization.

2.4.2.1 Augmented MCL MCL may solve the global localization problem in its vanilla form but cannot recover from robot kidnapping or global localization failures. As the position is acquired, particles at places other than the most likely posé gradually disappear. At some point, particles only “survive” near a single posé , and the algorithm cannot recover if this posé happens to be incorrect.

In practice, any stochastic algorithm such as MCL may accidentally discard all particles near the correct posé during the resampling step. This problem is particularly paramount when the number of particles is small and spread out over the environment during global localization. A relatively simple heuristic can solve this problem. The idea of this heuristic is to add random particles to the particle sets. Such an injection of random particles can be justified mathematically by assuming that the robot might get kidnapped with a small probability, thereby generating a fraction of random states in the motion model. Even if the robot does not get kidnapped, the random particles add a level of robustness. How ever, this is not optimal during position tracking, or situations of a high likelyhood of localization. I.e., an adaptive manner of drawing particles is a better solution. As mentioned in [12]Augmented MCLis an adaptive variant of MCL that adds random samples when the likelihood of localization is low. The algorithm adds random particles to the set based on the probabilities of sensor measurements.

Thus, the lower the average sensor measurement probabilities, the higher the number of random particles added to the set. In our case, if the particles images do not compare well to that of the drones, we add random particles.

2.4.2.2 KLD-Sampling: Adapting the Size of Sample Sets In the mentioned MCL algorithms, the number of particles is fixed. However, during position tracking, the number of particles could be much lower. Therefore, adapting the particle set size improves the efficiency of MCL algorithms. KLD-

(35)

sampling [7] is a variant of MCL that adapts the number of particles over time.

This paper does not provide a mathematical derivation of KLD sampling but only provides a short description. The name KLD-sampling is derived from the Kullback-Leibler divergence, which is a measure of the difference between two probability distributions. The idea behind KLD-sampling is to determine the number of particles based on a statistical bound on the sample-based approximation quality. Specifically at each iteration, it determines the number of samples such that, with probability1−δ, the error between the true posterior and the sample-based approximation is less than

2.4.2.3 KLD-Augmented-MCL Gamallo et al. [30] managed to combine KLD sampling and Augmented MCL into one algorithm. Solving global localization, the kidnapped robot problem, and improving efficiency by also including KLD sampling. They provided excellent results running the algorithm in real- time on a robot.

Figure 2.9:The KLD-Augmented-MCL algorithm Source: [30]

(36)

The algorithm represented inFigure 2.9takes in the previous PDF, represented by the particle setX_t−1, the motion command(u_t), the set of detected features in the present iteration(F)and the environment map(M). The different steps of the algorithm can be divided into three groups:

• MCL: is the core of the algorithm (lines 12–17). It samples the previous particle set, updates the particles using the motion model, and, finally, estimates the weight of each particle through the measurement model.

• Augmentation: corresponds to the insertion of random particles when the present measurements do not match with the expected ones (lines 3–8, 18, 24, and 25). This allows the algorithm to recover from localization failures, for example, in the kidnapping problem.

• KLD: calculates the number of particles that are necessary to appropriately represent the PDF of the posé of the robot (lines 9–11, 19–22, 26–29).

A detailed description of the algorithm will not be provided in this paper, as it is described very well in [30]. However, the implementation of the algorithm will be outlined in more detail later. In short, it includes the random particle generation from Augmented MCL at the beginning and adapts the size of the particle distribution at the end.

2.5 Related works

This section oulines some of the work that inspired this thesis, and what this work intends to improve. They are mainly based around localizing UAVs and many aspects are adapted into this work.

Mantelli et al. [35] propose a new localization strategy for a UAV equipped with a downward-facing camera, using a robust vision-based measurement model.

The proposed measurement model computes the likelihood of the robot posé with the aid of an improved descriptor called abBRIEF [35], based on BRIEF [28]. The abBRIEF descriptor differs from BRIEF in two points: the color space used and the noise image reduction strategy. Their vision-based localization system applies the new measurement model in a Monte Carlo Localization (MCL) approach [36] that estimates the UAV posé in 4 degrees of freedom (DoF). In this paper, the UAV is located within a short period, outperforming previous measurement models and yielding low errors. Still, it is not proven

(37)

to be robust against environmental changes like lighting and seasonal changes.

This is what we seek to improve by introducing machine learning and changing the measurement model.

Masselli et al. [43] also attempt UAV localization with a downward-facing camera. Instead, they use a particle filter with terrain classification through feature extraction for localization. Their solution provides global localization and an average error of5.2mbut is not robust against all environmental changes.

We believe that segmentation through Deep Learning will yield a much more accurate result that is more robust against environmental changes.

Viswanathan et al. [33] demonstrate a working implementation of semantic segmentation with a Bayesian localization algorithm for ground vehicles across seasons, successfully localizing in satellite maps from summer, winter, and spring. Inherently, solving the localization problem is harder for Unmanned Ground Vehicle (UGV) than for UAV due to the drastic shift in perspective from the ground images to satellite map images. Although this paper also uses segmentation with LIDAR to locate roads. It gives merit that invariance across seasons can be solved using semantic segmentation and a particle filter. An important note is that their “winter” environment contained no snow, but this can be included when training the network.

Nassar et al. [22] showing successful segmentation of satellite imagery using U-Net, but using a custom Semantic Shape Matching algorithm to establish the location in the satellite map. While the segmentation is mainly successful, localization is sub-par, and robustness against environmental changes is not proven. This framework also uses SIFT [34] Registration making the framework more computational heavy, and it does not inherently provide global localization.

(38)

3 A framework for UAV localization utilizing object segmentation

This chapter represents the localization framework developed in this project. The segmentation network used in it and the dataset created for its purpose are also described. The framework, visualized inFigure 3.1, operates in iterations and is intended to produce a localization posé in real-time. It is divided into two primary blocks, one performing image segmentation and the other localizing with the MCL algorithm. The segmentation block segments static objects from the images taken by the drones downwards facing the camera. Such objects may include buildings, roads, and bodies of water, depending on the dataset and framework setup. In this setup, only buildings are included. The latter block uses the KLD-Augmented-MCL algorithm described inparagraph 2.4.2.3to localize said frames in the ground-truth map. The framework also attains heading and altitude from the drone if available so that the particles have a posé more similar to the drone’s real-world posé. This added information aids the localization algorithm in localizing the UAV. As mentioned, the framework is intended to operate in real-time. However, this is dependent on the available hardware and tuning the hyperparameters, such as the number of particles.

Figure 3.1:Simplified overview of the framework.

(39)

The chapter begins by describing the segmentation network, its training pro- cedure, and the dataset produced for its purpose. Lastly, an overview of the framework is provided, and each block is presented in more detail. The specific implementation KLD-Augmented-MCL algorithm described inparagraph 2.4.2.3 will also be outlined.

3.1 Segmentation network - U-net

The segmentation network used in the segmentation block is almost an exact copy of the vanilla U-net described in subsection 2.3. Although many have tried adapting the U-net architecture to its specific tasks, the vanilla model provides excellent results. The only part where this project has modified the original architecture is changing the batch normalization during the contracting path to group normalization. More specifically the normalization step after the convolution in theconv 3x3, Relu step shown inFigure 2.7. As described in paragraph 2.1.1.4, although BN is still a milestone technique enabling neural networks to train, its error increases rapidly using smaller batch sizes, specifically with BS less than 16. The inaccurate batch statistics estimation causes the increase in errors, and GN introduced by Wu et al. [47] attempts to solve the issue. Using higher resolution images while training, BS must decrease due to memory limitations on GPUs. GN allows us to use lower BS with an increase in performance.

Figure 3.2:Alteration of the conv 3x3, ReLU step of the vanilla U-net model.

As mentioned insubsection 2.3, U-net is chosen for its state-of-the-art performance in segmentation accuracy, good inference time, and ease of implementation. PyTorch [31] in combination with PytorchLigthning [37] is used to implement the network and its training framework. The Python libraries allow

(40)

ease of use, simple implementation of data augmentation, and distributed training over several GPUs.

3.2 Dataset

The purpose of our model is to robustly segment out non-moving and long-lasting objects such as buildings, roads, and water, through seasons and in different weather conditions. For the model to accomplish this, it needs to be displayed examples of such objects and learn them; this task is the sole purpose of the dataset. It is a large set of images, accompanied by ground truth images, where each pixel annotates as one of the classes defined in the dataset.

Such open datasets do exist. However, they lack variance concerning seasons and weather. Also, most provide satellite images that have insufficient ground sampling distance (GSD). As the Norwegian Civil Aviation Authority states, the max flight altitude in Norway for uncertified and private drones is a maximum of 120 meters relative to the ground. Given the specifications on a typical drone such as the DJI Mavic 2 Pro, this results in a horizontal viewing field of approximately 158m by 89m on a 16:9 aspect ratio and a Ground Sampling Distance (GSD) of 2.95 cm.

Previous work using U-net, such as Sahu et al. [11], employed a dataset with GSD as high as 100cm with decent results on a somewhat similar task. On the other hand, Horwath et al. [17]. Mentions improvements in accuracy for high- resolution images in electron microscopy images, but not without challenges.

However, this task is not directly transferable to segmenting buildings, as they were looking for microscopic particles with a radius of circa 4-6 pixels in a 1024x1024 image. E.g., with a GSD of 5 cm, a small shed with a size of 1m² would take up about 20x20 pixels, and most static buildings have a larger footprint than 1 m². To sum up, the U-net seems adaptable to different resolutions, but we want to train the network as similarly as it will be used.

3.2.1 Producing a new dataset

The solution is to create a dataset specifically for the needs of this project. The Norwegian mapping organization, Kartverket, provides aerial imagery that has been geometrically corrected and contains geospatial information about where

(41)

Scale GSD Field of view Drone height

1:1000 4.0 cm 20 m 43 m

1:2000 8.0 cm 41 m 87 m

1:2000 10.0 cm 51 m 109 m

1:5000 12.5 cm 64 m 136 m

1:5000 20.0 cm 102 m 217 m

Table 1:GSD Variance of rastermaps ordered from Kartverket. Drone height and FOV are approximations based on the camera of a DJI Mavic Pro Drone.

it is. Such images, or maps, are often called orthophotos. The geometrical correction, a.k.a rectification, is used to project images onto a common image plane. This process is how multiple aerial photos are patched together to make a larger map. They also provide vectorized maps annotating buildings, roads, and some other classes. Kartverked did not have any orthophotos with a GSD as low as the example given with DJI Drone. Despite this, orthophotos were acquired with a GSD ranging from 4.0cm to 20.0cm:

To provide locational and building variance, orthophotos were extracted from different locations across each county in Norway. The orthophotos contained areas ranging from industrial, residential to urban as displayed below inTable 2.

Cabin fields were also extracted, as buildings in such areas in Norway often have turfed roofs. The latter might be the most challenging rooftop for a segmentation network to classify as they look very similar to the plain ground.

These are the general chosen locations:

• Agder: Kristiansand

• Innlandet: Hamar

• Møre og Romsdal: Ålesund

• Nordland: Bodø

• Oslo: Oslo

• Rogarland: Stavanger

• Vestfold og Telemark: Rjukan

• Troms og Finnmark: Kirkenes

• Trøndelag: Trondheim

• Vestland: Bergen

• Viken: Sarpsborg

(42)

Table 2:Examples of different areas.

Figure 3.3:Example of city center. Figure 3.4:Example of an industrial area

Figure 3.5:Example of turfed roofing.

Figure 3.6:Example of residential area.

A combination of QGIS [39] and Python [6] were used to get the ground truth for each orthophoto. The script imports all the orthophotos from Kartverket to QGIS and exports ground truth images from the vectorized map with the same spatial extent as the orthophotos. Upon inspections, some issues arose, seeTable 3.

Mainly rectification of the orthophotos was not perfect, and some buildings had imperfect annotations. Also, some buildings lacked annotations completely.

Annotations of rivers were also highly erroneous due to the riverbed naturally shifting. Even though the location of Norway is somewhat susceptible to changes in high and low tide, I saw little to no issues with the coastal borders. The Python script created samples by iterated over the orthophotos and ground truth maps, extracting 512x512 images from both. Some samples were excluded, such as

(43)

orthophotos with black backgrounds and images only containing only water. The reasoning behind this was to reduce the share size of the dataset and skip samples that would not positively affect the model. A conversion to nearest-neighbor interpolation was applied to the ground truth examples to provide better masks for training. E.g., hard edges on class borders.

Table 3:Ground truth issues.

Figure 3.7:

Rectification issue.

Figure 3.8:Missing annotation issue.

Figure 3.9:Inaccuracy in water segmentation

A makeshift solution for the rectification issues was to adjust the ground truth layer by expanding all building annotations. This expansion does entail that some building annotations extend their target, but the idea was that all buildings are covered. A solution to the missing annotation issue was to use the trained network to scrub the dataset and remove samples with many false positives. The idea is that the network will segment the buildings with missing annotations, providing a high set of false positives when compared said segmentations to the ground truth, and thus exclude these samples from the dataset.

Figure 3.10:Random sample from the dataset created form Kartverkets data.