SelectingMaximallyInformativeFrequencySubsetsforAcousticSurveys UniversityofBergenDepartmentofinformatics

(1)

University of Bergen Department of informatics

Selecting Maximally Informative Frequency Subsets for Acoustic

Surveys

Author: Knut Thormod Aarnes Holager Supervisors: Ketil Malde, Nils Olav Handegard

June, 2022

(2)

Abstract

In this thesis, we sought the most informative subset of frequencies to be utilized when classifying sandeel in acoustic data. The intention was to help identify an optimal choice of frequencies if the choice of transducers were restricted by, for example, price or carrying capacity in autonomous vessels. To measure the information lost, we started by producing pseudo labels with an existing automatic acoustic classifier trained to identify sandeel. Then we trained new classifiers based on the same architecture on the same data, but varying subsets of frequencies were used. We could then measure how well these new models could reconstruct the pseudo labels. The F1-score of the highest performing subset, per subset size, increased from a size of one frequency in use (0.34) to two (0.46), and then drastically to three (0.65), after which only marginal improvements were seen. In particular, the subset containing 18kHz, 38kHz, and 200kHz achieved close to the same performance as using the complete set of six frequencies (0.67). Furthermore, the three frequencies mentioned exhibit unique performance compared to the other subsets of equal size.

(3)

Acknowledgements

I would foremost like to express my gratitude towards my friends and fellow master’s stu- dents. Hans Martin Johansen, Mathias Madslien, Halvor Barndon, Emir Zamwa, Johanna Jøsang, John Isak Villanger, and Gunnar. Thanks to you all for the comradeship, both in the reading room and outside the university. For making each day of this degree enjoyable and supporting me throughout these two years. Particularly to my fellow sandwich makers, whose ignorance of conventional mayonnaise limits, motivated me to continually hit the gym.

Thanks to my supervisor Ketil Malde and co-supervisor Nils Olav Handegard for helping me formulate my thesis and providing invaluable feedback. Gratitude to all the people who provided support from the University of Bergen and IMR, particularly Ibrahim Umar and Tomasz Furmanek, for without your expert knowledge and willingness to help, this thesis would not have been possible. Finally, a big thanks to my family and girlfriend for your unfaltering support.

Knut Thormod Aarnes Holager 01 June, 2022

(4)

(5)

List of Figures

1.1 Sandeel . . . 2

2.1 Echosounder . . . 8

2.2 Echosounder . . . 9

2.3 The perceptron and multi-layer perceptron . . . 13

2.4 ReLu and sigmoid . . . 15

2.5 Convolutional neural network example . . . 16

2.6 Horizontal edge detector example . . . 18

2.7 Receptive field . . . 19

2.8 Same convolution example . . . 20

2.9 The max pool operation . . . 20

2.10 1×1 convolution . . . 21

2.11 Transposed convolution . . . 22

2.12 Difference between semantic and instance segmentation . . . 23

2.13 Learning rates . . . 26

2.14 Over/under-fit . . . 29

2.15 Two data augmentation examples . . . 31

2.16 U-Net architecture . . . 32

3.1 U-Net architecture . . . 36

4.1 Module overview . . . 41

4.2 Module outputs illustration . . . 42

4.3 Pseudo label . . . 43

4.4 Data preparation process . . . 44

4.5 Missing values and bottom . . . 45

4.6 Data, label and bottom crop extraction and interaction . . . 46

4.7 Criteria and folder structure . . . 47

(9)

5.1 Loss and F1 score during training . . . 56

5.2 Example output, threshold and label . . . 56

5.3 Best frequency combination - F1-score . . . 58

5.4 Best frequency combination - Precision . . . 59

5.5 Best frequency combination - Recall . . . 60

5.6 Error bars per combination . . . 62

5.7 With and without unique subset . . . 63

5.8 Performance trend per frequency . . . 65

(10)

List of Tables

4.1 2018 data distribution . . . 48

4.2 2019 data distribution . . . 48

4.3 Data loading scheme . . . 50

4.4 Data augmentation summary . . . 51

4.5 Experiment hyperparameters . . . 51

5.1 Summary exhaustive search . . . 57

(11)

Chapter 1 Introduction

This chapter will present the context of the research conducted in this thesis and the mo- tivation and focus of the work. Finally, the research question will be stated, followed by a short description of each chapter.

1.1 Marine Advisory Work

The International Council for the Exploration of the Sea (ICES) is an intergovernmental science organization which is the primary provider of advice on marine ecosystems to the governments and international bodies that manage the North Atlantic Ocean and adjacent seas. It consists of over 6000 marine scientists, over 300 institutes, and 20 member coun- tries, including Norway. Advice is formulated by expert groups who work towards a better understanding of marine ecosystems and their sustainable use [17]. The Norwegian Institute of Marine Research (IMR) is a significant contributor, and one of its activities is to give input to these advices through research and monitoring. An important research area is the monitoring of a species known as the lesser sandeel, which is the area to which this thesis seeks to contribute [42].

(12)

1.2 Why the Lesser Sandeel?

The lesser sandeel (Ammodytes marinus), hereby justsandeel, is a small pelagic¹ fish which resides in sandy-bottomed coastal and shallow ocean waters and feeds mainly on plankton.

It plays a key role in the North Sea, as it has a critical function in the marine ecosystem as forage fish² and as the catch for fisheries. Due to the high level of predation, it lives large parts of its life buried in a sandy seabed, but during the spring feeding season, adults emerge each dawn to create large schools in the upper pelagic layer. Historical data show that changes in their abundance cause bottom-up effects in the ecosystem, causing, for example, breeding failure among several species of seabird [21].

The North Sea is under pressure from various factors, including fishing, coastal construction, maritime transport, oil and gas exploration and production, tourism and recreation, navigation dredging, aggregate³ extraction, military activities, and wind farm construction [18]. Because of the importance of sandeel, the Norwegian IMR has conducted annual acoustic trawl missions since 2005 in sandeel areas of the North Sea [21]. The goal is to monitor the sandeel stock and create input and data to help the ICES make scientifically-backed advice, one of which is recommendations for fishing quotas [20].

Figure 1.1: A sandeel buried in the sand.

Credit: Original image by Mandy Lindeberg [7]

1Pelagic fish inhabits the pelagic zone of ocean or lake waters – living neither close to the bottom nor near the shore.

2Forage fish are small pelagic fish which are preyed on by larger predators.

3Raw materials such as gravel, crushed stone, or sand which are obtained from natural sources.

(13)

1.3 IMRs Acoustic Trawl Surveys

IMRs acoustic trawl surveys combine acoustic data from echo sounders, and biological data from trawl catches. When fisheries conduct acoustic surveys of fish, they useecho soundersto detect and observe targets in the water remotely. Echo sounders are a special variety ofsonar, where the acoustic beam produced by atransducer is directed vertically downwards from the measuring platform at a set frequency [52]. The echo sounders in use by the IMR usually capture acoustic data at multiple frequencies in parallel to maximize the information acquired [25]. The scrutiny of the data is done through the use of a post-processing software called Large Scale Survey System (LSSS), whereacoustic target classification is donemanually with the biological samples as aid [24]. Afterward, the output LSSS-data and biological samples are further processed to estimate the fish abundance and age distribution to support ICES advice [22].

1.4 Acoustic Classification of Sandeel

Acoustic target classification is a significant challenge in most acoustic surveys, and the sandeel is known to be difficult to classify [20, 21]. The earlier mentioned Norwegian sandeel surveys use the 200kHz frequency to delineate schools, while the 18kHz and 38kHz are the most important frequencies for classification when measurements were taken at 18kHz, 38kHz,120kHz, and200kHz [20]. Some earlier works have had trouble identifying sandeel at 38kHz and 120kHz [13, 30, 39], while some have had success using combinations of the four frequencies18kHz, 38kHz, 120kHz, and 200kHz to specifically separate sandeel from herring and mackerel [38].

Brautaset et al. [3] successfully applied deep learning methods to automatically classify sandeel in multi-frequency acoustic data from the Norwegian sandeel surveys. They extracted and utilized the four frequencies18kHz, 38kHz, 120kHz, and200kHz from the original data, but a problem within deep learning is that feature importance can be difficult to interpret.

Hence, it is hard to establish precisely what frequencies contributed the most to their results.

(14)

1.5 Unmanned Vehicles in Marine Science

In 2019 Verfuss et al. [55] reviewed the current status of unmanned vehicles suitable for monitoring marine life. The different types of vehicles can operate stationary or moving, on the ocean surface, as aerial or submerged. They can either be remotely controlled, autonomous, or a combination of both. Unmanned vehicles can use a wide array of sensors, but will often have restrictions on what sensors it can carry. For example, the number or types of transducers installed on acoustic platforms can quickly become a structural problem as the size and weight of a transducer drastically increases at lower frequencies⁴. Many of the vessels reviewed are commercially available, which has led to a growth in number and use.

The increase in unmanned vehicles and the possibility of gathering more complex data will lead to an exponential rise in the data collected in marine science [31]. This increases the need to make automated programs, or better tools to aid in the current manual data processing, which today is a major chokepoint. However, the nature of the data gathered may change, possibly in decreasing quality, as humans will have fewer or no opportunities to actively engage in the autonomous information gathering processes. This change in information gathering raises the need to optimize the quality of the data gathered by the unmanned vessels.

During the summer of 2020, an unmanned vessel was tested in ˚Ardalsfjorden in Norway which involved a kayak with an electric motor, and one 200kHz echosounder was installed to measure sprat abundance. Earlier surveys had indicated that large numbers of sprat live close to the surface, where the traditional research vessels have an acoustic blind zone⁵. The kayak survey showed that the small unmanned vessel managed to measure high densities of sprat in the blind zone. It was also less prone to scare away the fish as the kayak produced significantly less noise, and its size allowed access to shallower waters. The result was positive for the continued use of unmanned vessels, but the manned vessels are still needed for biological samples [23].

4Two sample transducers sold by Kongsberg Maritime for fisheries at the frequencies 18kHz and 333kHz weigh respectively 85kg and 2kg [32].

5Echosounder are usually mounted on the bottom of the vessel, creating a blind zone up to the surface.

Often on a retractable keel to get underneath a layer of bubbles that can be detrimental to the echosounders performance [26].

(15)

1.6 Research Question

This thesis aims to create scientific advice concerning which echo sounders should be installed on lightweight unmanned vessels, regarding classification of sandeel in acoustic data. The unmanned vessels may have reduced carrying capacity or other limitations for echo sounders, and the choice of which to install to maximize the information collected is essential. To achieve this, we will expand upon the work of Brautaset et al. [3] and measure the information stored in all possible subsets of the frequencies contained in the original data from the Norwegian sandeel surveys in the year 2018 and 2019: 18kHz,38kHz,70kHz,120kHz,200kHz, and 333kHz. The research question is stated as follows:

”As lightweight unmanned vessels may not have the capacity to carry all the six echo sounders the IMR usually deploys on the Norwegian sandeel surveys, which ones should be prioritized with the regard to classifying sandeel in multi-frequency acoustic data?”

1.7 Chapter Overview

This section describes the outline of the thesis and, in short, the content of each chapter.

• Chapter 2 - Background: Describes the concepts used in this thesis. First marine acoustics in section 2.1, then machine learning and artificial neural networks in the following sections.

• Chapter 3 - Basis and related work: Describes the basis work for which this thesis seeks to further expand upon in section 3.1, and details more recent related research in section 3.2.

• Chapter 4 - Material and Methods: Describes the approach to answering the research question. First, our data and the tools utilized, then how the labels were produced combined with data generation, and finally, our experiment.

• Chapter 5 - Results: Describes the results from the experiment, which starts with an evaluation of the training process. Then the results are summarized, followed by different findings in the results for our analysis.

(16)

• Chapter 6 - Discussion: Analyses the results and discusses their implication.

• Chapter 7 - Conclusion and Future Work: Describes the answer to the research question, summarizes key findings, and states recommendations for future work.

(17)

Chapter 2 Background

This chapter introduces the concepts used during the experiments, within both acoustics and machine learning.

2.1 Acoustics

The echo sounder consists of atransmitter that produces a burst of electrical energy at some set frequency. Then a transducer receives the output from the transmitter and converts it to an acoustic signal that is propagated through the water, which is also called a ping.

This forms a directional beam akin to the light from a handheld flashlight. Targets in the water backscatter/reflect parts of the energy back towards the transducer. The transducer detects the backscattered sound orecho, and the sound is converted back to electric energy as the received signal and is further amplified. The time elapsed when the signal is received determines the range to the target [52].

Pings are usually represented as columns in a 2-dimensional image, also called an echogram, with range along the y-axis and time of ping along the x-axis. The columns represent how the acoustic reflections vary for each ping. Any targets detected in the ping will be visualized as a mark in the echogram, usually with different colors depending on the echo strength. In multi-frequency sonars, individual echograms are produced in parallel for each frequency in use, and this is visualized in figure 2.2. Because the echograms are

(18)

constructed as time×range, the vertical magnitude of a mark indicates the height of the target. At the same time, the horizontal position illustrates changes in time if the echosounder is stationary or in space if moving. When moving, the echogram thus represents a vertical cross-section of the water column as the transducer is in motion through the water at constant speed in one direction [52].

Receiver/Ampliﬁer

School Ping

Seabed Transducer Transmitter

Timer

Display

Seabed School

Range ->

Time ->

Figure 2.1: Concept of an echosounder: Pings generate echoes from a school of fish and the seabed, and the echoes are displayed in an echogram.

(19)

Range

Ping

Figure 2.2: Example echograms from multiple frequencies.

2.1.1 Acoustics and Fish

To measure the force of backscattered sound received from a target, the backscattered cross- section σ_bs or the target strength (TS) is used. They are defined as:

σ_bs =r²Ib

I_i, (2.1)

(20)

TS = 10 log₁₀(σ_bs), (2.2) where I_b is sound intensity backscattered from the target, I_i is the intensity of the ping at some arbitrary distance (usually 1m), andr is the distance away from the transducer(σ_bs in unitsm²). σ_bs can vary greatly depending on the frequency used, the composition, angle and shape of the target, absorption through sound being converted to heat, and several other factors as described in Simmonds and MacLennan [52].

2.1.2 The Volume Backscattering Coefficient

Individual targets in some sampled volume may be small and plentiful, resulting in their echoes combining and forming a continuous backscattered signal with varying amplitude.

Single targets are no longer possible to distinguish, but the signal itself is a measure of the biomass in the water column. This is measured using the volume backscattering coefficient (s_v), defined as:

sv =X

σbs/V0, (2.3)

where a sum over all discrete targets returning echoes in the sampled volume (V0) is taken. There is a linear relationship between the abundance of fish and sv as long as the species or group of species is known. For more details on s_v, see Simmonds and MacLennan [52]

Furthermore, it is important to exclude the bottom echo when fish are being surveyed.

As some fish like the sandeel may be found in schools close to the bottom, and if the discrimination of the bottom is poor, there will be large errors in the estimated fish density.

For more details on bottom detection and implications, see Simmonds and MacLennan [52].

2.2 Machine Learning

Machine learning can be split into four parts; the algorithm, empirical data, a task, and a performance measure. A machine learning algorithm is designed to increase performance

(21)

on a task, given data. During this process, also called training, the algorithm is said to be learning by fitting a model to the data. The two machine learning approaches used in this thesis are supervised learning and unsupervised [12].

2.2.1 Algorithm Approaches

2.2.2 Supervised Learning

Supervised learning algorithms are defined by the data consisting of an input and the desired output [12]. The algorithm will have to learn a function, mapping from input to correct output. In classification problems, the output would be a class label, for example, classifying pictures of cats from other animals. While in regression problems, the output is a value within a numerical range. For example, predicting the height of a person.

2.2.3 Unsupervised Learning

Unlike supervised learning, unsupervised learning algorithms only receive the input and learn properties contained in the data [12]. A practical example is clustering, where the samples in a dataset are divided into clusters of similar properties.

2.2.4 Data and Features

The quality of the input data to a machine learning algorithm will likely affect its performance [40]. Data must be gathered, integrated, cleaned of errors, preprocessed, and features often extracted before being used in learning. Hence, time allocated to prepare and increase the data quality can exceed the time spent learning. The process of extracting features is often referred to as feature engineering. It constructs a representation of the data with the most important factors to solve the task. This is often domain specialized and usually requires human involvement. In the next section, artificial neural networks (ANN) are introduced, which is one avenue within machine learning that can automate the extraction of complicated feature representations during learning.

(22)

2.3 Artificial Neural Networks

This section introduces the basic components of an ANN and how these are combined to create a deep learning network/ model.

2.3.1 Perceptron

The ANNs fundamental building block is called an artificialneuron orperceptron. It consists of a linear regression with the tunable parameters w and b inside a non-linear activation function, explained later in section 2.3.3. The perceptron is formulated in the following way [47]:

y=g(

D

X

i=1

w_ix_i+b) (2.4)

where D is the number dimension of the input space, xis the input vector, w is a set of weights of the same size asx,b is the bias, andgis the activation function. The single output valuey, also called the neurons’activation, is a weighted sum of the input and weights plus the bias, transformed by the activation function. The perceptron is illustrated in figure 2.3.

2.3.2 Multi-Layered Perceptron

The neurons presented in section 2.3.1 are organized together in layers to form an ANN, which in turn forms what is called a multi-layer perceptron (MLP) [47]. If all neurons in each layer are connected to every neuron in the next layer, they form a type of ANN called fully connected networks. An MLP is depicted in figure 2.3.

(23)

(a)Perceptron/neuron.

h⁽¹⁾

h⁽²⁾

x

ŷ

(b)An MLP with four inputs in the input layer(arrows to the left), two hidden layers(h), and three outputs in the output layer(ˆy).

Figure 2.3: Illustration of a single neuron and a fully connected deep learning network of neurons.

Credit: Razavi [47]

The architecture of any ANN consists of an input layer, a user-defined number ofhidden layers, and finally, an output layer [47]. More hidden layers form a deeper network, hence

(24)

the name deep learning. An MLP is a type of network called a feed-forward ANN because the data flow from the input to the output layer, and each layer is a function of the previous layer. During training, the weights between every neuron and the bias are optimized in a process that is further explained in section 2.4. In the MLP depicted in figure 2.3, different neurons will activate with varying strengths depending on the input, resulting in different outputs. The architecture of the MLP in figure 2.3 can be expressed as [12]:

h⁽¹⁾ =g⁽¹⁾(W^(1)Tx+b⁽¹⁾) (2.5) h⁽²⁾ =g⁽²⁾(W^(2)Th⁽¹⁾+b⁽²⁾) (2.6)

ˆ

y=g⁽³⁾(W^(3)Th⁽²⁾+b⁽³⁾) (2.7)

where for each layer h is a vector of activations, W is a matrix of weights, b is a vector of biases, g is an activation function applied element-wise, and ˆy is a vector of outputs.

2.3.3 Activation Functions

The activation function enables ANNs to learn non-linear features [47]. It is needed because a network consisting of only linear layers will be the same as a single linear layer. Hence, it won’t be able to capture non-linearities in the data, and therefore an activation function is required in the hidden layers. Some activation functions can also be applied to the network’s output to solve the task the network is set to perform and must be suited for the task at hand [12]. The activation function commonly used in the hidden layers is ReLU, which stands for rectified linear unit [51]. ReLU takes a real number as input and outputs this number if it is above zero, otherwise, it will output zero. Lettingg denote the activation function, andx the input, the ReLU activation function can be formulated as follows:

g(x) = max(0, x) (2.8)

ReLU is also visualized in figure 2.4. ReLU activate some neurons to propagate their input while preventing others from doing so. This can result in greater efficiency and faster training, as not all neurons are active, further detailed in Sharma [51].

Another activation function is the logistic sigmoid that transforms all input values to values in the range [0. . . 1] [51]. It could be applied to the output layer to solve binary

(25)

classification problems, as the values can be treated as probabilities. The formula for logistic sigmoid is:

g(x) = 1

1 +e^−x (2.9)

Figure 2.4: A ReLU function (blue) and a sigmoid function (red) .

The Softmax Function

The softmax can be viewed as a multivariate version of the logistic sigmoid activation function, which allows the softmax to be applied to problems containing multiple classes [51].

For all data points, it calculates the probability of every class and can be expressed as:

g(x)_j = e^x^j PK

k=1e^x^k for j = 1,...,K. (2.10)

where K is the number of classes, and the output summarizes to 1 over all classes.

For a network solving multiclass classification, the output layer will have size equal to K.

This corresponds to 3 classes in figure 2.3. The softmax would then be used as the last transformation (g⁽³⁾ in equation 2.5) and output the probability of an input belonging to each of the three classes.

(26)

2.3.4 Convolutional Neural Network

The convolutional neural network (CNN) is a type of ANNs primarily used in machine learning tasks concerning images [43]. One of the reasons why the CNN emerged was because images input to a regular ANN produces a large number of learnable parameters. For example, a low-resolution image with 512×512 pixels passed to an hidden layer containing only one neuron would have 1×512×512 = 262144 weights alone. To solve this issue and have fewer learnable parameters, the modern CNN is built around three main components;

a convolutional layer, a pooling layer, and a fully-connected layer. An example CNN is illustrated in figure 2.5, and each main component will be explained later in this section.

…

… …

Classes

Input Convolutional layer

+ ReLu Pooling layer Flatten Fully connected

layer

Softmax

Kernel Stride

Activation maps from diﬀerent kernels

Figure 2.5: Illustrations of the main components in a CNN.

Convolutional Layer

O’Shea and Nash [43] describe the convolutional layer as consisting of many learnable mul- tidimensional weight matrices that slide over the input. We will refer to such a matrix as a kernel. The kernel’s height and width are parameters defined by the user, but the depth will always be equal to the number of channels in the input. This results in kernels being described only by height×width. The kernel slides over the input, and is applied to different locations of the input, also called the current receptive field. A single scalar value is

(27)

computed when applied, which is the weighted sum of the kernel’s weights and the corre- sponding values in the receptive field. If we have a 2-dimensional image I(i, j) as input, a convolutional operation is expressed as [12]:

(I ∗K)(i, j) = X

height

X

width

I(i−height, j−width)K(height, width) (2.11)

where ∗ is the convolutional operation, I(i,j) is the image pixel at (i,j), and K is the kernel.

The output scalar value from the convolutional operation is usually fed through non- linear activation functions like ReLU and then called theactivation. The sliding operation is based on a value calledstride, which is the number of horizontal positions to move the kernel in the input between each calculation. Suppose we started from the left, and it is impossible to move horizontally to the right and still fit the kernel inside the input. In this case, the kernel will, if possible, move rows down vertically equal to the stride and then continue horizontally, starting afresh from the left. After sliding over the entire input, a complete 2-dimensional activation map, also called a feature map, has been created, one such map for each kernel applied. The idea is that each applied kernel will learn to identify different features in the input. An example of a horizontal edge detector can be seen in figure 2.6 [43].

(28)

0 0 0 0

0 0 0 0 0 0 10 10 0 0

10 10

-1 -1

0 0

-1 0 1

1 1

-1 -1

0 0

-1 0 1

1 1

0 0 0 10 10 10

10 10 10

-1 -1

0 0

-1 0 1

1 1

30 30

0 0

30 30 0 30 30 - -

- - - - - - 0

- - - -

- - - 30 30

…

10 10 10 10 10

0 0 0 10 10 10

0 0

0 0 0 0 10 10 10 10 0 10

10 10

0 0 0 10 10 10

10 10 10

10 10 10 10 10

0 0 0 10 10 10

0 0 0 0

0 0 0 0 0 0 10 10 0 0

10 10

0 0 0 10 10 10

10 10 10

10 10 10 10 10

0 0 0 10 10 10 0

0

0 0 0 0 0 0 0

- -

- - - -

30 30 - -

0 - - 0 0

- -

0 - - -

- - - -

30 30 30 30

0 30 30 0 0

- -

0 30 30 -

0 0 0 0

…

… …

…

Input

Kernel

Output

Figure 2.6: Illustration of a valid (detailed later in this section) convolutional operation.

The kernel is applied repeatedly across the input. The input size is 6×6, kernel size is 3×3, and stride 1, resulting in overlapping operations and output size being 4×4. The figures to the right show the input, kernel, and output(activation) as color gradings, where the color gets darker if the values are low. This example is a horizontal edge detector, and the result is large values in the activation map along the border between the values of 0 and 10 in the input, which could have been colors in a picture.

The receptive field will start as small regions, but as we apply additional convolutional layers, it will have access to increasing context in regard to the input [12]. This is illustrated in figure 2.7, and kernels in early layers learn to identify simple features while later combining these to identify complex features. The kernels utilizes parameter sharing, as the same weights are repeatedly used across the input. Furthermore, the kernels are often smaller than the input, resulting insparse connections as opposed to fully connected networks. The parameter sharing results in the CNN having another useful attribute called equivariance, which means that if the input changes, the output changes in the same way.

(29)

Input Conv 1 Conv 2

Figure 2.7: The activation maps from two convolutional layers with 3×3 kernels and stride 1. The first convolution’s receptive field is marked as red. On its activation map, a new convolutional layer is applied. Its first receptive field is outlined in green, which translates to a larger area in the input.

Credit: Original image by Nick Hobgood [6], used as input picture above (Edited with colored grid).

The application of CNNs on acoustic data can be motivated by the echograms being similar to regular images, but the RGB color channels being replaced with the different frequency channels measured. However, the same object will be represented differently in echograms at varying ranges from the transducer, possibly similar to objects of another class [52]. This likely increases the complexity of target classification in acoustic data, as benefits of equivariance is not as applicable.

Reductions in the spatial size will normally occur with the convolutional operation described in this section, and such operations are called valid convolutions [43]. By applying padding with zeros around the input, we can retain the dimensions of the input. The effect is that more convolutional operations fit in the new padded input, hence an equal output size. This is called the same convolution, illustrated in figure 2.8.

(30)

30 30 30 0 0

0 0

0 0 0 0 -0 -

- -

0 0 0 0 0

0 1

0 0 0

0 1

Input Kernel Output

30 0

30 30 0 30 3 2

2 3 2

2 2

3 2 30

0 30 0 1

1 1 1 - -

1 1 -

1 1 1

Figure 2.8: Illustration of the same convolutional operation. The input size is 3×3, but after padding with zeros, the size is 5×5, kernel size is 3×3, and stride is 1. This results in an activation map size of 3×3, conserving the input size.

The Max Pool Layer

1 4

2 1

5 2

1 7

2 8

5 1

2 9

1 2

4 7

8 9

Figure 2.9: Illustrates the max pool operation with size 2×2 and stride 2.

The max pool layer reduces the height and width of its input [43]. Like a convolutional operation, the max pool looks at an input region but instead applies a max operation. The pooling kernel size is given in height×widthand is applied individually to each channel of

(31)

the input. This reduces the height and width but preserves the number of channels. The most common max pool layer is a 2×2 with a stride 2. Alone, the max pool has no learnable parameters and is applied to decrease the computational complexity of the CNN.

Fully Connected Layer

0 0

0 0 0 0 0 0 0 10 10

0 0

0 0 0

0 0

0 0 0 0

0 0 0 0 0

*

Input (3x3x3) Kernel (1x1x3) Output (3x3)

Figure 2.10: Illustration of the 1×1 convolution with stride 1. Yellow cells demonstrate the application of a 1×1 kernel along all channels of the input, producing one activation in the output.

Lin et al. [29] proposed the convolutional layer with kernel size 1×1 and stride 1, followed by an activation function. The 1×1 layer will take the weighted sum along a 1×1 slice through all channels of the input, as illustrated in figure 2.10. This is equivalent to applying a fully connected layer to the same values. As this preserves the resolution, it can be used to alter the depth of the output feature maps by specifying the desired number of kernels, while also introducing non-linearity. In figure 2.10, we have only one kernel, but if two were applied instead, the output feature map would have a depth of two. In this work, it is used mainly to map high dimensional feature maps to the desired number of classes.

(32)

Transposed Convolutions

4 1 5

1 1

1 0

4 0

0

4 5

0 0 1 1 1 0

1 0 1

5 5 0

0 0 1

=

+ + +

Transpose convolution

Input Kernel Output

4

5 0

0

1

Figure 2.11: Illustration of the transposed convolution operation with kernel size 2×2 and stride 1. The green color shows one of the intermediate computations. The center value of each crop is outlined to illustrate the summation step as these overlap.

A transposed convolution is an operation taking an input, and with a kernel similar to that described in 2.3.4, but now instead map the input to a higher resolution [9]. In example figure 2.11, a 2-dimensional 2×2 input is fed to a transposed convolutional layer with kernel size 2×2. The whole kernel is multiplied element-wise with the input and proceeds to produce values in a temporary matrix initialized with zeros, denoted by empty cells in the figure.

We do not use the temporary values in practice, but they are used here to illustrate the intermediate computations. The calculated values in the temporary matrix are situated correctly relative to the input. These temporary matrices are then summed over, producing the output. This operation is repeated for all channels, retaining the depth of the input.

(33)

Segmentation

Input picture Red = clown ﬁsh class,

Blue = background class Red = clown ﬁsh instance 1, Yellow = clown ﬁsh instance 2 Semantic segmentation Instance segmentation

Figure 2.12: Illustration of the difference between semantic and instance segmentation.

Credit: Original image (Input picture above) by Nick Hobgood[6]

Segmentation is a task where the objective is to assign one or several classification masks to the input, usually a picture [14]. This is further split into two different categories: semantic and instance segmentation. In semantic segmentation, we assign each pixel in the input to predefined classes. The output would have the same resolution as the input but with channels equal to the number of classes. A softmax would then be calculated for each pixel across the depth, and the pixel would be assigned to the class with the highest probability.

Hence, producing a mask for each class. In instance segmentation, we increase the complexity by applying semantic segmentation while simultaneously assigning a bounding box to each object, as visualized in figure 2.12.

2.4 Training Neural Networks

This section will describe the main concepts behind how neural networks are trained to perform on various tasks.

2.4.1 Forward-Propagation and The Loss Function

The objective of an ANN is to approximate some optimal function f, and in this thesis, we focus on classifiers, y= f(x), which maps an input x to an output category y. The ANN

(34)

approximates this function by defining a mapping, ˆy= ˆf(x, θ), and learns the values of the parameters θ (weights and biases) through training using examples. In supervised tasks, the labels instruct the output layer exactly how to perform given the input data. However, the data does not inform the individual hidden layers how to behave to produce this desired output. When the data flow through the network using the parametersθ, it produces outputs ˆ

y, called theforward-propagation step. How the parameters are initialized can heavily affect the training process, and different strategies are further described in Goodfellow et al. [12].

Using a loss function to compare the true y values to the estimated values ˆy, we measure the network’s error, also called loss. The network uses this loss to then alter θ to best approximate f, which will be explained later in section 2.4.3. In classification tasks, the network is trained to output the probability of each class given an input [15]. We can use a loss function called weighted cross entropy to train such a model. This function outputs a loss based on the probabilities, weights classification of certain classes differently, and is often used when dealing with data containing class imbalance as more weight can be applied to the minority class. Expressed as:

loss(x, y) = −

K

X

k=1

w_ky_klog(ˆy), yˆ= ˆf(x, θ) (2.12)

where K is the number of classes, w_k is weight for class k, and y_k is the target label.

More examples of loss functions can be viewed in Mishra and Gupta [37].

2.4.2 Mini-batch Stochastic Gradient Descent

”A recurring problem in machine learning is that large training sets are necessary for good generalization, but large training sets are also more computationally expensive.” - (Goodfellow et al. [12] 2016, page 147)

Calculating the total loss of the whole dataset is often unfeasible, and depending on the hardware, this could lead to a crash or slow learning due to heavy memory demands. A solution is to sample a set of examples, called a mini-batch, from the entire dataset, with the intent to approximately estimate the true loss using this smaller fraction of the dataset.

(35)

Then we update the parameters of our network based on this and repeat on a new mini- batch. This is called mini-batch stochastic gradient descent (SGD), a common optimization algorithm. The size of this mini-batch can vary from one example to hundreds. When we have run this process on all the data, we say that an epoch has passed [12].

2.4.3 Back-Propagation and Gradient-based Learning

To update the network parameters, we use the loss from a mini-batch and iteratively step back through the layers in a process called back-propagation [49]. In each step, we calculate the gradient of the loss functions with respect to the parameters of the current layer by using the chain rule. This is to determine how changes to each parameter will affect the loss. Using the gradient, SGD performs gradient descent [12] by updating all parameters in the opposite direction of the gradient to reduce the loss. In what magnitude a parameter is adjusted by the optimizing algorithm is determined by the learning rate, usually a value between 0 and 1. The parameter update is expressed as [19]:

θ^(j) ←θ^(j)−η1 m

m

X

i=1

∂loss(x_i, y_i)

∂θ^(j) (2.13)

whereθ is the parameters of the network, m is the mini-batch size,η is the learning rate, and j is the layer. In figure 2.13 an example loss function is illustrated with one global loss minima and different learning rates applied with SGD. Low learning rate values usually have a long training time and may cause the SGD to converge to a local minima instead of the global [10]. However, too high values can overshoot the global minima and diverge. Both can be prevented by applying a method to adapt the learning rate to the topography of the loss function. This thesis applied momentum, which only acts as a velocity to the update step.

The velocity is based on past steps, and the update will step in thevelocities’ direction, not the current gradient. More detail on momentum can be found in Sutskever et al. [53].

(36)

Start Start Start

b c

Figure 2.13: Three different applications of SGD on a loss function. Each arrow is an imagined learning step taken by the algorithm for; (a) low learning rate, (b) high learning rate, and (c) momentum.

In summary, the entire training process using SGD can be described as the following algorithm [10]:

Mini-batch SGD one epoch Loop:

1. Sample a mini-batch of data.

2. Forward propagate the mini-batch through the network and compute the loss.

3. Back propagate to calculate the gradients.

4. Update the parameters based on the gradients.

2.5 Model Evaluation

To evaluate a machine learning algorithm, we need a performance measure. First, the performance metric itself will be described, followed by a technique applied to make unbiased measures of the model by leaving out parts of the data. Finally, two central challenges that appear in machine learning.

(37)

2.5.1 Performance Metrics

This section describes the performance metrics used in this thesis. Consider a binary classification system that classifies samples into either the positive ornegative class. Predictions by the classifier can thus be sorted into the following four categories [45]:

• True positive (TP):A correct classification of a positive example.

• True negative (TN):A correct classification of a negative example.

• False positive (FP): A negative example incorrectly classified as positive

• False negative (FN): A positive example incorrectly classified as negative.

We can now calculate the performance of the classifier from these values, and the simplest is accuracy [45]:

accuracy = correct predictions

total number of predictions = T P +T N

T P +T N+F P +F N (2.14) This metric does not handle class imbalance well, as it is equivalent to calculating the percentage of correct predictions [45]. An example is that if 95% of the data belongs to one class, then always predicting this class will give us an accuracy of 95%.

We calculate two new metrics to better deal with class imbalance: precision and recall [45]. Precision is the percentage of positive predictions made by the model that are correct.

Recall is the percentage of all positive samples the model managed to classify correctly.

precision = T P

T P +F P (2.15)

recall = T P

T P +F N (2.16)

Then, by using precision and recall, we calculate the F1-score [45]. It combines these metrics and is designed to work well on imbalanced data. The F1-score formula:

F1-score= 2· precision ·recall

precision+recall (2.17)

(38)

2.5.2 Train-Validation-Test Split

When a machine learning model is learning, the goal is to achieve the lowest generalization error. This means to not only perform well on data seen during training, but also on new unseen data. To measure this error, it is normal to split our data into three parts: the training, validation, and test datasets, and we measure some error or metric on each. As the name suggests, the training data is used during the training process of the model. The validation dataset is extracted from the training dataset and gives an unbiased estimate of the models’ performance and can be used to guide the training process. The last mentioned could be to select the best model from a selection of many. Finally, the test dataset is used to get an unbiased estimate of the final model’sgeneralization error [12].

2.5.3 Overfitting Vs. Underfitting

A model’s performance depends on the difference between itstraining andtest error. Under- fitting happens when a model fails to achieve a low training error, while overfitting happens when the training error is significantly lower than the test error. To manipulate this behavior, we adjust the model’s capacity. Capacity represents the variety of functions the model can learn, and by adjusting it, one can increase and decrease the likelihood of underfitting and overfitting. The capacity can be controlled by, for example, changing the number of layers in a neural network, and further details can be seen in Goodfellow et al. [12]. Low capacity means that the model may fail to capture patterns in the data. High capacity translates to adjusting to the training data to such an extent that the model performance may be worse when given unseen test data. The optimal solution, depicted in the center plot of figure 2.14, is to have a model with a balanced capacity that is as close to the true function as possible [12].

(39)

x

y

Underfit

x

y

Just right

x

y

Overfit

Trained model True function Training data Trained model True function Training data Trained model True function Training data

Figure 2.14: Training data is generated with random noise around a sinus wave (True function). Model capacity increases from left to right. The center plot illustrates a model that has learned an almost perfect fit to the true function.

2.6 Regularization

Regularization, as described by Kukaˇcka et al. [28], is any supplemental technique with the goal of increasing the model’s generalization performance. Two techniques used in this thesis will be described here; batch normalization and data-augmentation.

2.6.1 Batch-Norm

Batch normalization is a technique applied to reduce what is called internal covariate shift [19]. This is defined as the change in the distribution of activations in hidden layers caused by the change in the network’s parameters when training. During backpropagation, the hidden layers depend on the activations of all layers before them. As each layer changes its output distribution, the other layers must adapt to this change. Research has shown that this slows down and destabilizes the training process, and a solution to this problem is the implementation of batch normalization. Using the activations from all the neurons in a hidden layer, a mean and variance are calculated per mini-batch. These values are

(40)

then used to normalize the activations of the hidden layer. Each hidden layer is given two additional learnable parameters γ and β, that perform a linear transformation of the normalized activations, defined as such:

Zˆ⁽ⁱ⁾ =γZ_norm⁽ⁱ⁾ +β (2.18)

where ˆZ⁽ⁱ⁾ is the batch normalized activations, Znorm⁽ⁱ⁾ is the normalized activations for the i^th hidden layer. The learnable parameters make the ANN able to adjust and shift the distribution through the training process. The result may lead to a faster and more stable training process.

Recently, the poor understanding of batch normalization has come into question, and Santurkar et al. [50] have stressed that more investigation should be put into understanding its effectiveness. Their findings show that it might not stem from internal covariate shift but likely other factors.

(41)

2.6.2 Data-Augmentation

Data-augmentation is a regularization method directly applied to the training dataset by applying some transformation [28]. Several methods are available, but the two used in this work are: adding noiseandflipping. One example of applying noise is to add Gaussian values with a mean of 0 and some user-defined variance to each pixel. This adds more randomness to the data, making the model learn more general features instead of specific. The network is less prone to overfit on certain samples, which in turn might increase generalization performance.

Other methods of applying noise are described in Kukaˇcka et al. [28]. Flipping is a simple transformation where we flip the input and label along a particular axis. Thus, mapping the data to a new representation. An example of each method mentioned can be seen in figure 2.15:

Add noise per pixel

(a) Gaussian noise is added to each pixel.

Vertical ﬂip

(b) Vertical flipping.

Figure 2.15: Two augmentation methods applied to the same image.

Credit: Original image (Both left pictures above) by Nick Hobgood [6]

(42)

2.7 U-Net

In this section, we introduce the architecture of the model that is the backbone of the work in this thesis. U-Net is a fully convolutional, state-of-the-art [46] semantic segmentation CNN initially developed for biomedical image analysis by Ronneberger et al. [48].

Channels: 1 64 64

128 128

256 256

512 512 1024

1024 512

512 256 256 128

128 64 64 2 (classes)

5722 5702 56822842 2822 28021402 1382 1362682 662 642322 302 282 542 5221042 1022 10022002 1982 19623922 3902 3882 3882

Input image

(572x572)

Output segmentation map

Con tracting

Expanding conv 3x3, ReLU

conv 1x1 max pool 2x2 transpose conv 2x2 Crop, copy and concatenate

562

Figure 2.16: The U-Net architecture, the downwards facing arrow illustrates the contracting path and the one facing upwards is the expanding path. For each block, the vertical number is the resolution, while the horizontal is the number of feature channels. The color gets darker as the channels increase.

Credit: Ronneberger et al. [48]

U-Net utilized what Ronneberger et al. [48] called acontracting pathto identify what was in a picture, while anexpanding path localized where it was. These two branches were more or less symmetrical, and together they formed a U-shape, giving the network its name. The contracting path can be looked at as five different stages of processing, from top to bottom, in figure 2.16. Each stage consisted of two 3×3 valid convolutions with their individual ReLU activation functions. Initially, the feature channels are increased to 64, and the channels

(43)

were doubled for each contracting stage. The convolutions were followed by a 2×2 max pooling operation with stride 2 to further decrease the resolution of the output from the convolutional operations and then output a feature map to the next stage. After the bottom stage, the max pool operation is replaced with transpose convolutions to now increase the resolution. At each subsequent stage traveling back up the expanding path, the number of feature channels is halved by the convolutional steps down to 64.

In the expanding path, the previous stage’s output was concatenated with a crop from the output feature map of a stage from the contracting path with the same channel size, the cropping is due to different resolutions. This step allows the following expanding path convolutional operation to access both the high-resolution feature map from the contracting path and the upsampled feature map, which in combination helps with the localization of features [48].

At the final layer in the expanding path, a 1×1 convolution maps the 64 feature channels to the user defined number of classes, which is two classes in figure 2.16. Finally, the softmax was then calculated between these classes, giving each pixel a probability distribution over the classes, with one channel for each class. Hence, giving us a segmentation map [48].

When released, U-Net outperformed other networks in multiple biomedical challenges [48]. Its performance inspired new models that use the U-Net architecture as their backbone, as seen in NAS-Unet [56] in 2019, and Unet++ [58] in 2018. U-Net has also been applied successfully to other fields such as road extraction in satellite images by Zhang et al. [57] in 2018, and on acoustic classification by Brautaset et al. [3] in 2020, which will be explained in the following chapter.

(44)

2.8 Knowledge Distillation

Knowledge distillation (KD) is a method within deep learning which focuses on transferring the knowledge from one teacher network with strong capability to a student network. One example of KD is using a pretrained teacher network with high performance to label the data, and then we train the student using these labels¹. Alternatives have been to extract knowledge from the layers of the teacher and use this during training of the student with or without teacher labels. Generally, the methods are applied in three principal situations [1]:

• Create a new, less complex model for platforms with computation power limitations.

• Enhance the accuracy of an existing model.

• Train a model with limited or constrained data.

In some instances, KD has been applied to train student models even more complex than the teacher network, with some methods using an ensemble of teacher models [1].

The research field of KD has attracted much attention in recent years but is still under development. This causes applications of KD not to follow a strict set of rules but is a creative process adapted to each domain applied. However, recent studies have shown promising results for KD, as detailed in Alkhulaifi et al. [2].

1The teachers labels can either be binary classes values calledhard labels or the class probability distribution from the softmax function calledsoft labels [11].

(45)

Chapter 3 Basis and Related Work

In this section, we introduce applications of deep learning on acoustic data. First, the work this thesis uses as a basis, then we describe some recent developments.

3.1 Acoustic Classification in Multifrequency Echosounder Data using Deep Convolutional Neural Networks

Brautaset et al. [3] had as objective to propose a deep learning method to classify and seg- ment multi-frequency acoustic data gathered during acoustic trawl surveys, without using predefined features. Their work is the basis for the work contained in this thesis. Their architecture is visualized in figure 3.1, and the difference from the original U-Net implementation is the use of the same convolutions, reduction in input resolution to 256×256, and the use of batch normalization. Furthermore, as the resolution of the layers in the contracting and expanding path has the same size, hence when concatenating, cropping is no longer needed.

The Data

The data used originated from the ongoing Norwegian acoustic trawl surveys, where they used the data spanning 2007-2018. 2011-2016 was set as training and validation data, and

(46)

36

Channels: 4 64 64

128 128

256 256

512 512 1024

3 (sandeel, other and backgrund)

2562 2562 25621282 1282 1282642 642 642322 322 322162 162 162

Input image

(256x256)

Output segmentation map

Con tracting

Expanding conv 3x3, batch norm, ReLU

conv 1x1 max pool 2x2 transpose conv 2x2 copy and concatenate softmax

64

2562 2562

512

256

512 256

322

322 322642 642 6421282 1282 12822562

64

128 128

1024

3

2562 2562

Figure 3.1: Modified version of the original U-Net architecture (illustrated in figure 2.16) made by Brautaset et al. [3].

Credit: Brautaset et al. [3]

SelectingMaximallyInformativeFrequencySubsetsforAcousticSurveys UniversityofBergenDepartmentofinformatics

University of Bergen Department of informatics