Acoustic Recognition with Deep Learning

(1)

Acoustic Recognition with Deep Learning

Experimenting with Data Augmentation and Neural Networks

Torstein Anton Berle Gombos

Thesis submitted for the degree of

Master in Electronics and Computer Technology Program option: Cybernetics

30 credits

Department of Physics

Faculty of Mathematics and Natural Sciences

UNIVERSITY OF OSLO

(2)

(3)

Acoustic Recognition with Deep Learning

Experimenting with Data Augmentation and Neural Networks

Torstein Anton Berle Gombos

(4)

(5)

Abstract

In the recent years, classification through deep learning has seen a large increase in popularity. In particular Deep Convolutional Neural Network (CNN) have shown a great deal of promise in image recognition. It is a machine learning concept with a performance that occasionally outperforms even humans, under certain conditions. With this, there has been a lot of exploration with what other applications deep learning can accomplish. In sound recognition, the field has mostly been restrained toward Acoustic Speech Recognition (ASR). However, recent work has shown potential for Acoustic Event Recognitions (AERs) applications as well. For this field, it is common to manually extract spectrograms from audio files and apply these spectrograms as input to a CNN. It has been shown to produce decent results, reaching typical object recognition accuracy levels [39]. However, the AER field has progressed slower, compared to object recognition, when it comes to creating labelled data for supervised learning tasks.

This thesis has studied the effectiveness of CNN with several different spectrograms extracted from AER data. It will show that the most robust spectrogram for this type of data is Mel Spaced Log Filter Banks (MSFB) with a high number of filters; while the Wavelet based Scalogram will excel at higher pitched audio. The research will also demonstrate how the constraint posed by limited amount of AER data can be reduced by applying audio augmentations methods directly to the audio time-signal.

The constraint can be reduced additionally by utilizing CNNs with no downsampling in the spectrograms frequency domain, and by extracting and adding the first-order dynamics of the spectrograms.

(6)

(7)

List of Figures

2.1 Spectrograms are applied as input to CNN models; softmax activation is used as prediction of the output . . . 6 2.2 Cross entropy loss function; less accurate predictions result

in exponentially higher losses . . . 8 2.3 Rectified Linear Unit (ReLU) and Leaky Rectified Linear

Unit (LeakyReLU) . . . 11 2.4 Illustration of the VGG network architecture [20] . . . 12 2.5 Model describing the two stages of the signal processing

model. The first stage operates within the time-domain of the signal. The second stage operates within the time- frequency domain . . . 15 2.6 Effect of pre-emphasis on a signal compared to unfiltered

signal. α=0.95 . . . . 17 2.7 Showing two spectrograms from a street music signal

sampled at 44100kHz before and after pre-emphasis is applied. The lower energy from the higher frequencies contains a stronger representation after pre-emphasis. Color mapping is the ’hot’ map from pythons matplotlib . . . 17 2.8 Amplitude over time for the Siren and Jackhammer signals . 18 2.9 The spectrogram of an audio file with a siren; The rise and

fall in frequency could very well be Doppler effect visible in the spectrogram.. . . 19 2.10 From top-down: time-signal, Log Spectrogram, MSFB and

Mel Frequency Cepstral Coefficients (MFCC) . . . 20 2.11 The hamming window . . . 21 2.12 Four figures showing the process from framing the signal,

finding the Short-Time Fourier Transform (STFT), calculating the power spectrum and creating the logarithmic spectrogram . . . 23 2.13 Mel Scale . . . 24 2.14 10 filter banks . . . 24 2.16 Scalogram for a car engine; a clear visual line reveals the low

frequency notes of the engine . . . 26 2.17 An intuitive understanding of the resolution differences

between the Wavelet transform and the STFT [61] . . . 27 2.18 The Morlet wavelet . . . 28 2.19 Variations of the Morlet wave though the scales: 0.5, 1, 1.5 . 29

(10)

3.1 Class distribution of the urban sound dataset . . . 36 3.2 MSFBs of a signal with the class children_playing. Using

colormappinghotfor a more distinct visual . . . 40 3.3 MFCC extracted from the children_playing class. 20 Coeffi-

cients are kept . . . 40 3.4 Top: MSFB of a siren signal, bottom: the mel dynamics of the

MSFB . . . 41 3.5 First left: normal log spectrogram, second left: added noise,

second right: -2 pitch shift, first right: 0.85 time shift (It is a little wider) . . . 43 3.6 Separation of folds into training and validation . . . 46 3.7 The process of extracting a batch of features for the network 48 4.1 Box plot explanation . . . 51 4.2 Confusion matrix example . . . 51 4.3 Test 1; validation accuracy after 150 epochs for the classes:

jackhammer, street_music and siren . . . 54 4.4 Test 1; validation loss after 150 epochs for the classes:

jackhammer, street_music and siren . . . 54 4.5 Test 2; accuracy after 150 epochs for the classes: air_conditioner,

engine_idling and drilling . . . 56 4.6 Test 2; loss after 150 epochs for the classes: air_conditioner,

engine_idling and drilling . . . 56 4.7 Test 3.1; accuracy after 60 epochs for the classes: main_class

and secondary_classes . . . 60 4.8 Test 3.1; loss after 60 epochs for the classes: main_class and

secondary_classes . . . 60 4.9 Test 3.2; Accuracy after 60 epochs for the classes: main_class

and secondary_classes . . . 62 4.10 Test 3.2; loss after 60 epochs for the classes: main_class and

secondary_classes . . . 62 4.11 Test 4.1; validation accuracy after 100 epochs for applying

mel dynamics to the classes: air_conditioner, jackhammer, car_horn, engine_idling . . . 64 4.12 Test 4.1; validation loss after 100 epochs for applying mel

dynamics to the classes: children_playing, jackhammer, siren, street_music . . . 65 4.13 Test 4.2; validation accuracy after 100 epochs for applying

mel dynamics to the classes: air_conditioner, jackhammer, car_horn, engine_idling . . . 66 4.14 Test 4.2; validation accuracy after 100 epochs for applying

mel dynamics to the classes: children_playing, jackhammer, siren, street_music . . . 66 4.15 Test 5.1; accuracy for MSFB-128 with augmentation that

replaces the original sample . . . 69 4.16 Test 5.1; accuracy for log spectrogram with augmentation

that replaces the original sample . . . 69

(11)

4.17 Test 5.1; loss after 60 epochs for the classes: air_conditioner, jackhammer, car_horn, engine_idling . . . 70 4.18 Test 5.1; loss after 60 epochs for the classes: air_conditioner,

jackhammer, car_horn, engine_idling . . . 70 4.19 Test 5.2; accuracy for MSFB - 128, Adds the augmented

samples to the training set without removing the original sample . . . 72 4.20 Test 5.2; loss for MSFB-128, Adds the augmented samples to

the training set without removing the original sample . . . . 72 5.1 Loss metrics of the 5 features over 150 epochs . . . 74 5.2 The difference between the two augmentation approaches;

Noise augmentation shows an improvement in favor of replacing compared to adding. Pitch shifting shows slightly lower loss for adding compared to replacing . . . 77

(12)

(13)

List of Tables

2.1 Architecture of Novel CNN (N-CNN) . . . 29 3.1 Architecture of N-CNN . . . 45 3.2 Architecture of Novel CNN with Asymmetric Kernels

(ASYM-CNN) . . . 45 4.1 Parameters for preprocessing and CNN, test 1 . . . 52 4.2 Dataset size and distribution for test 1. . . 52 4.3 Test 1; confusion matrix for classes with temporal variations

for the validation set . . . 53 4.4 Dataset size and distribution for Test 2 . . . 55 4.5 Test 2; confusion matrix for classes little frequency-temporal

variations . . . 57 4.6 Dataset size and distribution for Test 3. . . 58 4.7 Test 3.1; confusion matrix for separating one class from

several others . . . 59 4.8 Test 3.2; confusion matrix for separating one class from

several others . . . 61 4.9 Dataset size and distribution for Test 4 . . . 63 4.10 Dataset size and distribution for Test 5 . . . 67

(14)

(15)

Abbreviations

AER Acoustic Event Recognition ASR Acoustic Speech Recognition ASC Acoustic Scene Classification AI Artificial Intelligence

ANN Artificial Neural Network

ASYM-CNN Novel CNN with Asymmetric Kernels CPU Central Processing Unit

CNN Convolutional Neural Network DL Deep Learning

DCT Discrete Cosine Transform DFT Discrete Fourier Transform FCN Fully Connected Neural Network FFT Fast Fourier Transform

FFI Forsvarets Forskningsinstitutt GPU Graphics Processing Unit GUI Graphical User Interface GMM Gaussian Mixture Model HMM Hidden Markov Model IoU Intersection over Union IQR Interquartile Range

LeakyReLU Leaky Rectified Linear Unit ML Machine Learning

MIoU Mean Intersection over Union

(16)

MFCC Mel Frequency Cepstral Coefficients MSFB Mel Spaced Log Filter Banks

N-CNN Novel CNN ReLU Rectified Linear Unit RNN Recurrent Neural Network RGB Red, Green, Blue

STFT Short-Time Fourier Transform WAV Waveform Audio File Format

(17)

Preface

This work concludes the two years I have spent studying for my Master of Science in Cybernetics at the University of Oslo. The degree was conduc- ted August 2017 to may 2019. This thesis was provided byForsvarets For- skningsinstitutt(FFI), which has expressed a desire to apply Acoustic Sound Recognition in their research applications. These past two years have been challenging, yet extremely rewarding. Much to the honor of the extremely supporting environment at the Institute of Technology Systems at Kjeller, where I have spent most of my two years.

As such, I would first and foremost like to thank my supervisor Idar Dyrdal, who have inspired and engaged in my research. His willingness to test and compare his own results with mine has been a tremendous help to overcome challenges in the work. His collaboration has been far beyond expectation.

I would also like to thank Torbjørn Kringeland, Andreas Hagen and Jo- nas Rød, and everyone in the class, for all their support through long days and late evenings. They were filled with intense studying, only interrupted by the occasional table-tennis match.

Finally, i would like to give a huge thank you to my father-in-law Tom Arsenovic, who has spent countless hours correcting and analyzing my different thesis drafts. Without his help, this thesis would have not made as much progress as it did.

(18)

(19)

Chapter 1

Introduction

This chapter outlines the thesis’ overview. Mainly, it describes the motivation and the questions the research aims to answer. It also explains the theoretical background and methodical choices for performing the thesis experiments. Lastly, it will outline the structure.

1.1 Background and Related Work

1.1.1 Acoustic Event Recognition and Traditional Methods Acoustic Event Recognition (AER) is the field of recognizing the source of acoustic sound waves. It involves recognizing acoustic events over a duration of time. In many acoustic recognition tasks, it is common to extract a signal’s spectral features for a more semantic analysis of the signal. These features, known as spectrograms, are an image- based representation of a signal’s frequency magnitude over time. Early analyzes of spectrogram meant manually detecting and classifying events that occurred in the spectrogram. Eventually, spectrograms were used as input for Hidden Markov Models (HMMs) and Gaussian Mixture Models (GMMs) [37][26], enabling automated statistical classifications of the spectrograms information.

1.1.2 Acoustic Event Recognition with Deep Learning

Recently, Deep Learning with CNN have applied spectrograms as input for various AER tasks, achieving good results [65] [25]. Acoustic recognition has so far seen the largest progression in research towards the ASR field.

As such, extensive annotated datasets have surfaced, providing ASR with a large working community. This data progression has seen a slower pace for AER and the amount of annotated data compared to that of ASR is relatively small. This has encouraged many AER applications to work in tandem with computer vision techniques like the articles [19], [5], [3] and [67]. Recently, AER data have become more available, which could be due to good progress made with unsupervised learning algorithms. For instance, the model named SoundNet [3] uses image recognition to label sound and create datasets. Recently, by using metadata from films, Google

(20)

developed the AudioSet which contain over 70 million annotated sound recordings from YouTube [25]. This article demonstrated that large scale audio classification with CNN is a viable approach, and the AudioSet is a major contribution to AER data availability. In addition, their trained deep CNN model may be used as an external feature extractor, to create high dimensional embeddings of raw audio.

1.1.3 Spectro-Temporal Features

There are four spectral features which are typical candidates for both ASR and AER. Three of these spectral features utilize the Discrete Fourier Transform (DFT) with overlapping windows extracted from the audio signal; a method known as the Short-Time Fourier Transform (STFT) which creates a time resolution for a signal’s frequencies. In AER and ASR, one typically finds three variations of spectrograms extracted from the STFT and one from applying the Wavelet transform, known as a Scalogram.

The first feature is extracted from the logarithm of the power spectral density from the STFT output vectors, known as a Log Spectrogram. The second are the Mel Spaced Log Filter Banks (MSFB) which are compressed a spectrogram with overlapping filters made to mimic the non-linear way humans perceive sound. The third is the Mel Frequency Cepstral Coefficients (MFCC) which are decorrelated MSFBs. The reasoning for this is to retain phonetic information regardless of variations in pitch and duration [18]. This is typically a problem in ASR where a vocabulary could have similar monosyllabic utterances, but have strong syntactic and duration variations. The Scalogram feature utilizes the Wavelet transform instead of the DFT which allows for more accurate resolution in both time and frequency. It has been used in recent deep learning experiments with promising results [6] [47].

(21)

1.2 Thesis Purpose

This study wants to research how different spectral features can be extracted from audio signals and applied in Deep Convolutional Neural Net- works. By extracting spectrograms, the research aims to determine the optimal spectral feature for classifying acoustic audio events. Given the limited data available, experimentation with various AER techniques to boost CNN performance is required. In addition, the thesis will study three types of audio augmentation applied to the audiotime series. The augmentations arepitch shiftandtime shiftin the audio, as well as adding random Gaussian noise. Performing this research aims to lay the groundwork for developing industrial AER applications with limited available data.

The thesis proposes a series of experiments for using Deep Convolutional Neural Networks with input spectral features in Acoustic Event Recogni- tion. Specifically, the experiments aim to answer the following questions:

1. For AER like data, what is potentially the optimal spectral feature to extract for CNN? Different spectrograms most likely provides different outcomes based on audio signal source, and audio from industrial events and environment often operates on low frequencies.

As such, different spectrograms might represent this information in ways which alters the CNN’s performance. Also, by applying Scalograms in CNN, this thesis will explore whether this is a reliable feature compared to traditional spectral features.

2. By training manually designed networks, can the results be enhanced with applying kernels which retain frequency-domain information through the network?

3. How can limited data be enhanced by data augmentation methods?

In image classification there is little reason to not implement augmented data as it helps increase data variance. However, with acoustic sound there are some considerations required. Small changes in the sound perspective might completely change signal to a point where it is no longer representative of the label. The experiments are meant to determine what augmentations are reliable and robust and how much to gain from implementing them.

4. When working with ASR, it is not uncommon to extract the first- order Delta of the spectral feature from mel-based spectrograms [37]

to increase performance. This thesis will explore CNN models are able to learn from Delta features with AER-typical data. In addition, verify its merits for non mel-bases spectrograms as well.

1.2.1 Proposed method of progress

As setup for completing the experiments, the thesis proposes a two- part classification model. The first part will apply signal processing to

(22)

extract any of the four spectral features to use as input. Each sample will go through a downsampling and a pre-emphasis filter, then apply either STFT or the Wavelet transform to extract the audio spectrogram.

The implementation also includes extracting the spectrograms first-order Delta and appending it to the spectrogram input. The second part of the model proposes a Deep Convolutional Neural Network, with batches of spectrograms as input to train an AER classifier. The thesis proposes two CNN model variations based on the setup from Stanford’s machine learning lectures [13]. The first CNN architecture in the thesis uses small localized receptive fields to be able to learn spectro-temporal patterns through the audios additive noise [49]. The second model applies asymmetric kernels which only perform convolution in the temporal domain while keeping the frequency sample rate intact. The second part of the model’s rationale is to learn even smaller localized features in the frequency domain while trying to make the model invariant too temporal variations in the audio.

1.3 Thesis Structural Setup

The thesis is structured as follows:

• Chapter 2 - Theoretical Background explains the theoretical background of the thesis. It is divided in 3 parts: Understanding of CNN as concept, architecture build-up and various techniques that will be seen in the thesis experiments. Why extract Spectrograms and Scalo- grams and how are they extracted. In addition, several signal preprocessing elements are described which enhances features of the signal.

The last part describes the data augmentations which are common for acoustic sounds, and discusses considerations for implementations.

• Chapter 3 - Method describes the methodology of the thesis experiments. It describes the data used in the experiments, as well as the signal processing methods, spectral extraction, CNN setup and data augmentations.

• Chapter 4 - Tests and Results presents the thesis tests and results. It sets the criteria for the tests and how performance is evaluated. For each test result, the setup parameters and the motivation of the test is described.

• Chapter 5 - Discussion analyzes the results of the thesis tests. How do they compare and what can be concluded from the tests? The chapter also discusses potential weaknesses in the results and how to potentially improve upon them.

• Chapter 6 - Conclusion provides a summary of the work carried out in the thesis. This chapter revisits how the questions asked was answered and presents options for further work in the field.

(23)

Chapter 2

Theoretical Background

This chapter gives an in-depth description of the various concepts implemented for the thesis experiments. The description will be in a general sense and not necessarily be the exact implementations for the experiments.

A full setup of the implementation for the experiments can be found in chapter 3 (Method). This chapter describes first the deep learning concepts of CNN, followed by signal processing to extract spectral features and then augmentation of audio data.

Deep Learning : The field of Machine Learning and Deep Learning is quite massive and extensive. The introduction to Neural Networks will give a brief overview of the concept before diving into the specified concepts that define CNNs. For an intuitive understanding of how neurons and neural networks work, the reader is referred to notes [24], [57] and [28]. The theory will mainly consist of the concept that is defined for this particular type of deep learning.

Signal Processing : Signal processing can branch out in various direc- tions based on the task at hand. The theory in this section will focus on how to extract spectral features from signal time series. This also includes techniques for enhancing features from the data as well as all the parameters needed to apply these to the thesis’ experiments.

Data Augmentation : There are some limitations posed on the augmentation available for audio augmentation relative to for instance image augmentation. This section will describe those limitations and the theory be- hind implementing some of the possible augmentations to expand the data available for training.

2.1 Convolutional Neural Networks for Acoustic Classification

CNN is a deep neural network architecture which falls under the field of Artificial Neural Network (ANN). It is a network type that performs

(24)

well on image like data; due to the way it handles the large number of parameters that arises from using image-like data as inputs. By stacking CNN layers, the network is able to use non-linearity to learn spatially local features from an input volume. This format suits the task of predicting spectral features since they represent a compressed 2- dimensional mapping of the signals energy. Spectrograms are usually viewed as grayscale images, but are in this work visualized through different color mappings for illustration-purposes. Figure 2.1 shows a

Figure 2.1: Spectrograms are applied as input to CNN models; softmax activation is used as prediction of the output

simplified version of the CNN model. The input in this case shows a Log Spectrogram of a signal and the output is a softmax score Y ∈ [0, 1]for each class. The score signifies how certain the network is of its prediction.

In mathematical terms the model aims to learn a function that takes an input X, given a set of parameters (weights + bias, referred to as θ) and maps it to an output classY. The function outputY = f(X;θ)represents the standard supervised machine learning algorithm, where the task is to adjust the learnable parameters θ (weights) to approximate the function [24].

2.1.1 Supervised Learning

Optimizing the parametersθ of the network means to minimize a defined loss function between the predicted class of the network and the actual ground-truth. One way to optimize (train) the parameters is to subject the model to a substantial number of examples of what the model is trying to predict [24, p. 150]. These examples are referred to as the training set. The number of samples needed may vary based on the network architecture or number of classes. However, the largest factor is usually what type of data is applied. For each training sample, a differentiable loss function calculates the error L between the ground-truthY and the input data X.

The loss is then backpropagated back through the neurons of the network where each neuron calculates its own delta error and adjusts its weights towards minimizing the total loss [57].

(25)

2.1.2 Difference between Fully Connected and CNN

Where regular Fully Connected Neural Networks (FCNs) would use single vectors, a CNN uses the 3-dimensional inputwidth, height and depth. The main difference is that the input volume to the CNN layer is divided into smaller regions, instead of connecting every neuron of a layer to each of the neurons of the next layer. The regions are often called the filters. The filters are local spatially (width and height) but runs all the way through the depth of the input. They are slid across the input volume, where the dot product of the input and filter is computed for each stride [13]. This can be shown to reduce the number of parameters for the network by a substantial amount which helps avoid the curse of dimensionality [4], also known as overfitting the network. Overfitting can be described as the point where the model network is unable to classify new and unseen data, because it has adjusted the functionY = f(X;θ)to completely fit the training data.

2.1.3 Multi-Class Classification

The final layer of network uses the softmax activation function. The softmax function maps the outputs of the previous layer to a probability distribution over the different classes. This type of classification is known as multi-class classification [45]. In this scenario, the network is given samples of audio signals and its job is to define one specific class for that sample. That means that if the network is given the task of listening for a siren in a sample with other sounds, it will only tell if the siren has occurred or not. Softmax is defined as in equation 2.1:

Y(s)_i = ^e

s_i

∑^Cj=1e^s^j, forj=1, ...,Cands= (s₁, ...,s_C))∈_R (2.1) s_i is a given class,Cis the number of classes ands_j are the scores for each of the classes.

2.1.4 Minimizing the loss

The network model uses a gradient search to minimize a differentiable cost function. Minimizing this function for any relevant input is the goal for any supervised models. Defining the cost function generally means having a performance metric Pto be optimized by reducing a cost function J(θ), defined as in equation 2.2

J(θ) =_E_X,yp_dataL(f(x;θ),y) (2.2) The loss is calculated from the output from the softmax activation of the final layer in the network and the ground truth label. Since the softmax output is a vector with elementsT ∈ (_{0, 1}), the ground truth labels need to be one-hot encoded. In a multi-class classification scenario this means redefining the label as a binary vector where the position of the positive class is represented as 1 and the other negative classes as 0. For example,

(26)

if the network is trying to predict images of dogs, cats and birds, the one- hot encoded vector of each one would be [_{1, 0, 0}]_,[_{0, 1, 0}]_and[_{0, 0, 1}]_{. The} standard loss function for multi-class classification is a variation of cross- entropy called categorical cross-entropy which is defined in equation 2.3 [45]:

L= −

∑

C i

y_ilog(Y(s)_i) (2.3) Where the y_i is the ground-truth label and Y(s)_i is the softmax defined in equation 2.1. Defining the loss function with the log of the softmax output creates a higher penalty which increases the loss exponentially when poorer predictions is made, as shown in figure 2.2.

Figure 2.2: Cross entropy loss function; less accurate predictions result in exponentially higher losses

Categorical cross-entropy is a differentiable function as the error needs to be backpropagated through the neurons of the network with an update function (Optimizer) [15].

2.1.5 Optimizer

Stochastic Gradient Descent

The optimizer function defines how to update the parameters θ, of the network based on the differentiable loss function defined in equation 2.3 [7, p. 58]. The update function backpropagates the loss through the network and updates the weights multiplied with a "learning rate" factor. In general terms the update function can be defined as in the following equation: 2.4 [58, p 101].

δ_k = (y_k−t_k)g⁰(a_k), For the output δ_j = g⁰(u_j)

∑

k

δ_kw_jk, For hidden layer (2.4)

(27)

This yields the following general update functions for the weights:

W jk←W_jk−ηδ_kz_j

ν_jk ←ν_jk−ηδ_jx_i (2.5)

Where η is the learning rate factor which decides how much impact the update has on the weights for each iteration. W is the network parameters and z_j and x_i are the neurons output. There are several types of optimization functions which utilizes variations of stochastic gradient descent. Ordinary gradient descent means finding the points in a function where the gradient is 0 and then choosing the smallest value of these points.

This will be the global minimum of that function [7, p 48].

Y_min =min(∇f(x) =0)

However, finding all the point that meets this criterion can be challenging.

The function will often find local minimums instead of the global. This task becomes intractable for CNNs which often have parameters which ranges to several millions. The solution is to estimate gradient descent by computing the loss on random batches of data. The batches create an estimated representation of the many dimensional weight parameter space.

Adam optimizer

The Adam optimizer is a stochastic gradient descent optimizer made by Diederik Kingma and Jimmy Lei Ba [30]. It utilizes an adaptive learning rate for different parameters which is estimated from the first and second moments of the gradients. The adaptive learning rates are beneficial for networks that train over longer durations. In the beginning the error will be high and require a large learning rate to faster decline the error. Later in the training session the error will lessen and a smaller learning rate is required. If not, having such a large learning rate might overcorrect the weights during backpropagation. Adam provides a dynamic learning for this purpose. There are other optimization algorithms with adaptive learning rates like the Adadelta [66]. This thesis will however mostly utilize the Adam optimizer.

2.1.6 Convolutional Layer Components

The convolutional layer is made up of learnable filters. They are set up of small spatial regions of height (H) and width (W), though it stretches all the way through the input data in depth(D). There are four hyperparameters that decide how the filter looks and behave. These are the number of filters, the receptive field (often called kernel size), the step size (strides) and the amount of padding used.

(28)

Convolving the signal

The neurons in the layer only connects to a local area of the input volume, the kernel. In height and width this area is usually around 3x3 or 5x5, but it might vary based on the image features [13]. The kernel is slid along an input volume x and for each slide the dot product between the receptive field ofxand the kernels weights are calculated: h(W~^X+b)_. bis added bias andh()is the activation function. The filters produce a 2-dimensional activation map for each stride and dot product that is calculated. During backpropagation the filters will eventually learn to activate when they recognize various visual features in the input volume. On a low level these visuals might be blogs or edges, but on higher levels later in the network it might recognize stronger and more complex patterns [13].

Output size

The output volume size is decided by three hyper parameters: depth, stride and padding. Depth corresponds to the number of filter that the layer is set with. The region that the receptive field is looking at is often called the depth column, or fibre [13]. The stride decides how many cells the filter moves for each dot product. Usually three or less is considered a good stride value. The last parameter is padding, or zero-padding. This maintains the same size for the spatial region of the input and the output by padding zeros around the border of the input. This enables control of the spatial size of the output as it will remain the same as the input.

Activation function

The activation function of a CNN layer contributes to making the filters non-linear. Previously, the most common activation function would be sigmoid (which saturates at 0 and 1) and tanh (saturates between -1 and 1). These days most CNN activation functions are variations of the ReLU [40]. The function that defines ReLU is given by:

f(x) =max(x, 0) (2.6)

This means that the output of the neuron is regarded as 0 up until any positive value. The positive values the increase linearly without saturation.

Saturation had the disadvantage that, as withsigmoidortanh, it could "kills"

the gradient [33]. Meaning the gradient does not change between updates.

ReLU do not saturate when its output is above zero. It also converges much faster than sigmoid or tanh [14]. A problem which sometimes arises with ReLU however, is that the neuron potentially "dies". This occurs when the dot product between the input and a particularly low weight never becomes positive and remains zero forever [14]. A variation of ReLU to combat this is called LeakyReLU which is defined as follows:

f(x) =max(αx,x)_, ₀<α<₁ _(2.7)

(29)

(a) ReLU activation plot (b) LeakyReLU activation plot.

Figure 2.3: ReLU and LeakyReLU

This allows for a small gradient change which never allows the neuron to die out and it keeps the fast convergence. Figure 2.3 shows the two activations side by side.

2.1.7 Architecture

The architecture of a neural network is defined by what layers are stacked together how the layers are designed. There are several ways to set up the layers of the network and it often depends on a few different factors: data type, amount of data, classification problem or computational limitations, etc.

The models in this thesis follows the principle set up by the Stanford lecture for Visual Recognition [13]. They look at the more popular networks created like AlexNet [31], VGG [54] from Visual Geometry Group and LeNet [32]. These nets contain many variations, but they follow similar structure in setup:

I NPUT →[[CONV−> RELU]∗N→ POOL?]∗M→[FC→ RELU]∗K→FC Where∗Nmeans repeating the layers andPOOL? means that there can be an optional pooling function. N ≥ 0, but usually less than 3. KandMare positive integers andKis normally less than 3. Very often the layers start with fewer filters in depth and large spatial area. Then a downsampling in spatial area while the filter depth increases. Figure 2.4 illustrates an example of the downsampling of the spatial area while the filters increase.

The network is based on the VGG architecture [54].

2.1.8 Pooling Operations

Pooling is a common technique to add between the neural layers. Their purpose is to reduce the number of parameters in the network based on a simple assumption. The assumption that for a data point in the input volume, adjacent data points will have strong correlation. The data point and its adjacent points can therefore be combined to one point. While

(30)

Figure 2.4: Illustration of the VGG network architecture [20]

it is still common to use max pooling, there are research that supports neglecting the use of these operations in the future [13].

2.1.9 Regularization Techniques

Overfitting is a common problem in supervised learning. In short it means that a network has become to specialized on the training data and is unable to generalize new and unseen data. This could happen if the network trains too long on the same data, if some weights become too dominant, if the training set is too small or if the data exists in high dimensional spaces (Curse of Dimensionality [4]). There are other factors that could contribute as well. Regularization describes methods used to combat the overfitting effect and some of these will be described in this section. The most common way to increase regularization and avoid overfitting is the following [7]:

1. Adding additional data

2. Normalize data or apply batch normalization 3. Applying dropout

4. Variations in network architectures and reducing network capacity 5. Applying weight regularization

While there are many methods of regularization, they do not necessarily apply to every type of problem. Therefore, only the regularization methods used in the models of this thesis are described in detail.

(31)

L2 Regularization

Google developer’s course on machine learning defines regularization as a way to penalize complex models [46]. Previously the goal was to minimize the loss of the model:

minimize(Loss(Data|Model))

The goal is now to minimize the loss with complexity added to the loss, which Google calls Structural Risk Minimization:

minimize(Loss(Data|Model) +complexity(Model))

One way to quantify complexity is to sum up the absolute value of all the weights and square it. This will mark high valued weights to add more complexity to the system while low valued weights will matter less.

L2=||_w||²₂= w²₁+w²₂...+w²_n

This expression is added to the loss function with a factorλ, where usually is 0 < λ ≤ 1. The factor defines how much penalty the over dominant weights should possess. L1 is a similar regularization technique, but does not perform the squaring of the absolute value of the weights.

2.1.10 Common Regularization techniques

The following are common regularization techniques. While popular, neither was implemented in this work.

Batch Normalization

Batch normalization normalizes the output of a network. The goal is a stable distribution of activations functions [29]. It is usually applied after the output of FCN or CNN layers, but before non-linearity. The following are suggested benefits from batch normalization from [29] and [34]:

1. Improved gradient flow through the network.

2. Allows higher Learning rates.

3. Reduces the strong dependence on initialization.

4. While not strictly a regularization method, does provide some regularization which might reduce the need for dropout.

Dropout

Dropout is a regularization method that can be applied to the output of any layer. It effectively switches off neurons based on a probability for each neuron. The idea is to force the network to use alternative nodes to extract the best features. As a result, weights or paths in the network will not become too dominant for when exposed to certain features.

This is a powerful aid to combat overfitting in the network and increase regularization.

(32)

Data Normalization

All input data should to be normalized before being fed to the network.

Normalization forces various data of different values to be comparable to each other by mapping them to numbers between 0 and 1. It is defined by the following equation:

ˆ

x_i = ^xⁱ−x_min x_max−x_min

Where ˆx_i is a normalized data sample, x_i is a sample of the dataset and i=0, 1, ...,n.xmaxis the highest value of the dataset andx_minis the smallest.

Class weighting

Some data can be skewed toward a few selected classes, meaning that some classes are better represented than others. This is typical for many real- world datasets. An example could be a model that is trying to discriminate between two types of birds: sparrows and pigeons. The data could be skewed towards pigeons and consist of 90% pigeons sounds. This could fool the results to make it seem like the accuracy is quite high, but in reality, the model has learned to always predict pigeons. Then it would be correct 90% of the time. This problem can be solved by setting a bias in the weights towards under-represented classes which makes these classes matter more.

This is a rather standard solution since weight differences in classes are quite common. The weight for each class is given by equation 2.8 [55].

w_i = ^data

C∗bincount(data) ^(2.8)

Here C is the number of classes, w_i a weight for a defined class i and bincount(y)defines the number of occurrences of each class.

(33)

2.2 Signal Processing of Acoustic Sounds

This section will explain the process of extracting spectral samples for the Convolutional Neural Network and why spectrograms is a common feature extracted from acoustic audio. The signal processing model is divided in two stages. Each audio sample will undergo signal processing in the time domain and the time-frequency domain before being fed in to the Neural Network. In the time domain, the signal will be downsampled, made mono from stereo and filtered through pre-emphasis.

The augmentations of audio data are also applied at the time domain stage.

When the signal is mapped from time to frequency, different techniques is applied based on the spectral information to extract. In addition, some tests will look at the first order delta of the spectral information to add to the spectrogram. Model is illustrated in figure 2.5.

Figure 2.5: Model describing the two stages of the signal processing model.

The first stage operates within the time-domain of the signal. The second stage operates within the time-frequency domain

2.2.1 Sampling the signal

Most microphones record sounds within 20Hz to around 20000Hz. A sampling rate of 40000Hztherefore covers any normal use. CD-audio was at some point standardized to 44100Hz which is why many audio files usually are sampled at 44100Hz[26] to meet theNyquist-Shannon criteria:

Theorem 1 If a function h(t)contains no frequencies higher than fc cycles per second, it is completely determined by giving its ordinates as a series of points spaced ₂¹_f

c seconds apart [27]

Here f_cis known as the Nyquist Critical Frequencywhich is the bandwidth of the signal. In short, the theorem says that the sampling rate of the signal

(34)

must be f_s > 2f_c. The high sampling rate of 44100Hz guarantees that the signal covers the entire spectre that humans can perceive, though it is unnecessary to do in practice. Most sound signals exist on frequencies far below 22050Hz. By representing a signal at 16000Hz it is possible to reconstruct any signal accurately up to at least 8000Hz. This will most likely describe all relevant information in the data used in this work.

2.2.2 Pre-Emphasis

When reaching higher frequencies, the power often drops to a point where it is negligible. [26] estimates that about 80% of the power is contained within the first 1000Hz. From 1kHZ to 8kHz, there is a drop of almost

−12dB/Octave² [26, p 154]. Important features are sometimes embedded in higher frequencies/energies which could be valuable to the model.

Applying a pre-emphasis filter on the time series of the signal help reduce the signal-to-noise radio for the higher frequencies. Pre-emphasis in the discrete time domain with a single-zero filter [26, p 155] can be defined as in equation 2.9:

H_p(z) =₁−αz⁻¹ (2.9)

Where H_p(z) is the transfer function describing the filter, where α is the filter coefficient. Rewriting the transfer function in terms of its inputX(z) and outputY(z)yields:

Y(z) =X(zt)−αX(zt)Z⁻¹ (2.10) Applying the inverse z-transform of 2.10 gives the following time-domain relation [37, p 73]:

y(t) =x(t)−αx(t−1) (2.11) Whereαis between 0 < α < 1, but almost always is between 0.9 and 1.

Figure 2.6 show the difference in energy before and after pre-emphasis. The resulting spectrograms can be viewed in figure 2.7. One should note that this type of pre-emphasis is typically used for ASR. Audio under 200Hz could get an unwanted attenuation problem [26]. If one were to use pre- emphasis for fields like AER, it might be useful to find a more sophisticated pre-emphasis filter.

(35)

Figure 2.6: Effect of pre-emphasis on a signal compared to unfiltered signal.

α=0.95

Figure 2.7: Showing two spectrograms from a street music signal sampled at 44100kHz before and after pre-emphasis is applied. The lower energy from the higher frequencies contains a stronger representation after pre- emphasis. Color mapping is the ’hot’ map from pythons matplotlib

(36)

2.2.3 What are Spectral Features

Acoustic audio contains a large amount of information that varies in many different ways. It could be distance to microphone, change of acoustic scene, different recording equipment, etc. The discrete time-signal is represented with its amplitude along samples in time as figure 2.8 shows.

The audio is an example of two classes from the Urban Dictionary dataset.

Raw signals can be challenging inputs for a classifier; the vector is of

Figure 2.8: Amplitude over time for the Siren and Jackhammer signals high dimensionality and perceptually similar signals are not necessarily neighbours in the vector space [60]. It can also be difficult to distinguish the events embedded in the noise. The solution is a mapping of the signal from a time series to a representation of both time and frequency. By creating a time resolution for the signal’s frequency, it can now be determined what frequencies existed at a given point in the audio. This is known as a spectrogram, which is a 3-dimensional representation of the power spectral energy of a signal. When visualizing signals as spectrogram it becomes easier to distinguish between the various classes, as it shows the frequencies magnitude of the signal over a time axis. Examples of these can be viewed in figure 2.9, which visualizes an event of a siren from the UrbanSound dataset. It is a Log Spectrogram which is extracted through the Short-Time Fourier Transform. The range of the frequency corresponds to the critical frequency (fc) of the signal, which is its highest frequency component [26]. Spectrograms are often viewed in greyscale, which means that the higher the energy is, the darker the shade of the component. The Log Spectrogram applies the STFT on the signal to create its resolution of time and frequency. One also finds the Scalogram which relies on the Wavelet Transform over different scales of the wavelet for its time and frequency resolution. Where the Fourier transform analyzes the function as a sum of different sinuses stretched to infinity, the Wavelet Transform analyzes a finite wavelet function. Both are explained further in this section.

(37)

Figure 2.9: The spectrogram of an audio file with a siren; The rise and fall in frequency could very well be Doppler effect visible in the spectrogram..

2.3 Short-Time Spectral Analysis

Acoustic audio can be considered a quasi-stationary signal [37, p 136], which means that its transfer function remains unchanged over short time intervals. Short-Time spectral analysis is the method splitting the signal into statistically stationary time frames, and then calculate each frame’s power spectrum. The power spectral density of a Fourier transformed frame is known as a periodogram and when stacked together they form the spectrogram. The Spectrograms in figure 2.10 displays a Log Spectrogram, a Mel Spaced Log Filter Banks (MSFB) and a Mel Frequency Cepstral Coefficients (MFCC) for a signal with street music. The first is a direct representation of the logarithm of the power spectral density of the signal. MSFB and MFCC are the same power spectral representation with frequencies mapped to the Mel-Scale. This will be further described in section 2.3.4 and 2.3.5.

(38)

Figure 2.10: From top-down: time-signal, Log Spectrogram, MSFB and MFCC

2.3.1 Framing the signal

To obtain a resolution in time the signal needs to be divided into overlapping frames. The signal is divided by using a windowing function, usually a Hamming Window [37]. A Hamming window is illustrated in figure 2.11 and the Hamming window function is described by equation 2.12:

w[_n] =







0.54−0.46cos(^2πn Nw

)_, ∀0≤n≤ N_w 0, otherwise

(2.12) The frames are usually around 20-40ms in length, so that the audio doesn’t change over the duration of the frame. A typical frame length of 25ms and sampling frequency of 16kHz yields 400 samples of the audio file. The frame step length decides how much overlap there should be between each set of samples. For instance, a frame length of 10 ms means a window of step 160 samples at a time:

0.025∗16000=400 samples per window 0.01∗16000=160 samples

As the first 400 samples begin at 0, the next 400 samples will begin at 160 meaning an overlap of 240 samples. It should be noted that Beigi in [26]

(39)

believes that the Hamming windows is not the most elegant function for this purpose, but it has stayed as the most popular out of habit.

Figure 2.11: The hamming window

2.3.2 Power Spectral Estimation Short-Time Fourier Transform

The Fourier transform describe what frequencies exists in a signal, but does not localize them in time. To account for this the DFT is applied to the signal frames to get an idea of what frequencies exists in each frame. This creates a resolution in time for the framed audio. There is a resolution trade-off in this method with for the frequency-time axis. The windows need to be small enough for the required time-resolution, but not at the cost of the representation of the frequency’s power spectrum. The Short-Time Fourier Transform for a time signal [26, p 747] is given by equation 2.13:

H(n,k) =

∑

N n=1

h(n)w(n)e⁻^j2πkn^N (2.13)

Wherew(n)is the Hamming window from 2.12 and the DFT is defined as:

H(k) =

∑

N n=1

h(n)e⁻^j2πkn^N , 0≤k ≤N (2.14)

Here:

h(n) =time-domian of the signal N=N samples long window

K=Length of DFT

(40)

Power Spectrum of a Signal

The next step is to find the energy for each frame. This is found by mapping the output of the STFT to the power spectral’s time-frequency domain. It is calculated by squaring the frequency components from 2.13 [26, p. 735], as show in 2.15:

S(n,k) =|H(n,k)|² (2.15) Calculating the power spectrum for a frame returns a feature vector of length ^N₂ +1 which is known as a periodogram. Stacking the periodogram together creates a matrix of size N×(^N₂ +1)which is the spectrogram of the signal.

2.3.3 The Log Power Spectrogram

In ASR It is common to take the logarithm of each vector to create the Log Spectrogram as shown in equation 2.16 [37]:

S_log(n,k),^20log10|H(n,k)|² (2.16) This is inspired by how human perceives loudness, which is not linear [26].

Humans perceive changes in low frequencies much clearer than higher frequencies. Taking the log of each feature vector enhances the changes in frequency more closely to this perception. In many AER tasks, this could be considered even more important to implement, as much of the spectral energy lies in lower frequency ranges [53].

The process so far towards the Log Spectrogram can be viewed in figure 2.12. It shows the process of extracting the Log Spectrogram from a siren time signal. The last plot in figure 2.12 shows the Log Power Spectral Density. Compared to just the Power Spectral Density these features are sufficiently more distinct. On both spectrograms it is clearly possible to see the increase and decrease of the frequency over time, which is typical for a siren sound. The Log Spectrogram is one of the spectral features featured in the thesis tests.

2.3.4 Mel Scaled Log Filter Banks

Humans perceive changes in sound easier at low frequencies. Our hearing is almost linear up to about 1000Hz, then for higher frequencies it becomes more logarithmic [26], which is why the Log of the Spectrogram is taken. The Mel-Scale is a scale that imitates this behavior even further by introducing mel-spaced filters which discriminates higher frequencies in the spectrum [59]. Conversion to the Mel Scale is performed by calculating the dot product between the power spectral density and a number of triangular filters. The filters are spaced across the frequencies mel-scales counterpart. This spectral feature is known as the MSFB. The motivation of the mel-scale stems from the ASR field, which tried to model thecochela of the inner ear [distant].

The Mel Scale can be described as such:

(41)

Figure 2.12: Four figures showing the process from framing the signal, finding the STFT, calculating the power spectrum and creating the logarithmic spectrogram

Definition 1 (Melody(mel)). Mel, an abbreviation of the word melody, is a unit of pitch, It is defined to be equal to one thousandth of the pitch (Φ) of a simple tone with frequency of 1000 Hz with an amplitude of40dB above the auditory threshold [26].

When applying MSFB in ASR, it is common to create between 26 to 60 filter banks [26, p. 170]. However, the number of filter banks can also be defined by the number of Fourier bins. In that case, it is common to convert 1/2 or 1/4 of the bins to the mel space. For 512 Fourier bins, this means either 256 or 128 mel spaced filters. Equation 2.17 converts frequency to the Mel Scale which is shown in figure 2.13. Equation 2.18 converts back to Hz again. Figure 2.13 plots how the changes in lower frequencies are much more distinct than for higher frequencies.

M(f) =1125ln(1+ ^f

700) (2.17)

M(f) =700(e¹¹²⁵^m −1) (2.18) The filters are triangular and each filter starts from the center of left adjacent filter. The filter then linearly increases to 1 before linearly decreasing to 0 again, which will be the center of its other adjacent filter, shown in figure 2.14. The triangular filters can be modelled as in equation 2.19 [37, pp. . 141]. Once the filters are created, the MSFB is calculated by taking the log of the dot product between the filters and the spectrogram seen in equation 2.20.

(42)

Figure 2.13: Mel Scale

Figure 2.14: 10 filter banks

Fm(k) =











0, k< f(m−1) k− f(m−1)

f(m)− f(m−1)^, ^f(m−1)≤k < f(m)

1, k= f(m)

f(m+₁)−k

f(m+1)− f(m)^, ^f(m)<k ≤ f(m+1) 0, k> f(m−1)

(2.19)

MSFB=20log₁₀(S(n,k)·F_m(k)) _(2.20) 2.3.5 Mel Frequency Cepstral Coefficients

MFCC is frequently used for ASR, and it is created by de-correlating the filter banks by applying the type 2 Discrete Cosine Transform (DCT) [36]. It was introduced by Davis and Mermelstein in the 1980’s [18]. Its

(43)

motivation stems from trying to retain significant phonetic information from an utterance which could have syntactic and duration variations.

MFCC extracts features which are more independent of such variations [26]. The MSFBs are highly correlated since the triangular filters overlap with each other. The de-correlation of the filters is performed by applying the DCT which is defined as in equation 2.21:

S_MFCC_i =

∑

M k=₁

X_kcosh i

k−¹ 2

π M

i

, i=1, 2, ...,M (2.21) Where X_k is the kth output of the MSFB filters. The DCT is applied on a orthogonal basis. When applying this for each filter banks the result is the Mel Cepstral Coefficients. There is one for each filter bank though usually about half of them are discarded since the higher numbered filter banks represent fast changes in high energy. MFCC is also considered to be less robust towards additive noise [12] [63]. So, while the MFCC might be suitable for speech which have strong temporal differences, it might be less consistent with signals like car engines or air conditioners. Some measure can be taken to combat this issue by applying a cepstral lifter. The lifter increases the magnitude of the high frequency coefficients and is given by equation 2.22:

MFCC_{l f} = (₁+ ^L

2 ∗sin(^π∗n_c

L ))∗MFCC (2.22)

WhereLis the number of liftering coefficients andn_c = [0, 1, ...,c]where c is the number of MFCCs. Figure 2.15a shows the filter banks for an audio signal of street music while image 2.15b shows its corresponding cepstral coefficients.

(a) MSFB of a signal with street music; events are more correlated and easier to detect

(b) The MFCC of the same signal which now is de-correlated.

2.3.6 Spectral Dynamics

The dynamics of the spectral information is used to model the local dynamic change of sound. [21], [26] and [37] recommends dynamic information as useful to speaker verification models (ASR), since it has

(44)

been showed to be more resilient towards additive noise. [37] also mentions that dynamic features are immune to theshort-time convolutional distortion, which is a constant offset in the Log Spectral and Cepstral domain. These dynamics are often referred to as Delta and Delta-Delta and they represent the first and second order differences of the cepstral coefficients. In general, it is common to use as many Delta or Delta-Delta coefficients as cepstral coefficients. However, [26] recommends using a smaller number of Delta and even small Delta Delta. The most important reason being increased parameters and run-time for too little gain in performance [26, p. 175]. However, they are mostly applied to mel-based features. As mentioned in chapter 1, an experiment will be held to verify its effect with the Wavelet Transform as well. The Delta features are applied to the training process by stacking the dynamic matrix to the spectrogram matrix. Which means the length of the delta feature must be the same as the spectrogram.

2.4 Wavelet Transform and Scalograms

Scalograms are defined as the absolute value of the output matrix of the Wavelet Transform over different scales. Its 2 dimensions (i.e. axis’) are for time and scales, where the spectrogram was time and frequency. The scales directly relate to the frequencies of the signal, so it is possible to look at the scalogram as another representation of frequyency and time as well.

Figure 2.16 shows the output of a Scalogram and a Log Spectrogram for the audio of a two successive gun shots. The Wavelet Transform is made to

Figure 2.16: Scalogram for a car engine; a clear visual line reveals the low frequency notes of the engine

analyze time-frequency resolution more precisely [16]. It is accomplished by analyzing the signal at different scales. An analyzing function, often called mother wavelet is slid across the signal with a given scale. The chosen scales extend the wavelet or narrow it, which alters the wavelets

(45)

frequency. As mentioned in the section about STFT, the STFT gives a fixed resolution in time, but the resolution can be somewhat lacking. The wavelet changes its time-frequency resolution as the scale changes. An illustration of the different resolutions is given in figure 2.17. The Wavelet Transform,

Figure 2.17: An intuitive understanding of the resolution differences between the Wavelet transform and the STFT [61]

contrary to the Fourier transform, is localized in time. Whereas the Fourier is the sum of sine or cosine waves that stretches from −_∞ to +_{∞. To} transform the signal the mother wavelet is convolved with the time signal.

It is translated through the signal in time. When it is finished it is translated again but at a different scale. Longer scales stretch the wavelet which in return decreases its frequency, but in increases the time resolution. If the time dimension is to large however it will in return diminish the knowledge about where certain frequencies exists in the signal. Decreasing the scales compresses the signal which increases the frequency at the cost of the time resolution. The result is that small scales catches more rapid details where larger scales catches longer and coarser details [11]. This trade-off between time and frequency resolution can be seen as the same trade-off that defines the uncertainty principle [17]. Analyzing the signal at different scales is known as multi resolution or multi scale analysis [61].

2.4.1 The Wavelet Transform

There are two different transforms which have different properties. They are the discrete and continuous Wavelet Transform. Mathematically the Continuous Wavelet Transform can be described by equation 2.23:

Xw(a,b) = ¹

|a|^1/2

Z _∞

−_∞x(t)∗ψ(^t−b

a )dt (2.23)

(46)

Where a is a scaling factor and b is translation. For the continuous transform, bothbandaare continuous which means there can be an infinite number of wavelets. Discrete transform will have a discrete value for the scalingaand translationb. The time series signalx(t)is convolved with an analyzing function (mother wavelet)ψ. [61]

2.4.2 Morlet Wavelet

Defined as the analytic function ψ, convolved with the time signal over time and scale. There are several variations, but it has two mathematical requirements; the wavelet must have finite energy, which localizes it in time and frequency; it also needs zero mean in the time-domain.

This is to ensure that it is integrable so the transformed signal can be reconstructed from the inverse Wavelet transform. The wavelet applied in the experiments is called the Morlet wavelet. It is a wavelet mostly utilized for Continuous Wavelet Transforms. Defined as the product of a complex sine wave and a gaussian window [10] in equation 2.24. Figure 2.18 shows the plot of the Morlet wavelet.

ψ=_e^2iπ^{f t}_e⁻^t

2

2σ2 (2.24)

Figure 2.18: The Morlet wavelet

2.4.3 Scaling the wavelet

Scaling the wavelet with different scales can be referred as shifting the wavelet in time. It can be expressed mathematically with the function ψ(^t_a)_{, with} a > 0 being the scaling factor. psi() is the wavelet function as explained in equation 2.23. The scaling factor is inversely proportional to the frequency. That means scaling a wavelet by two will reduce its frequency by half, also known as an octave [64]. figure 2.19 illustrates how

(47)

Figure 2.19: Variations of the Morlet wave though the scales: 0.5, 1, 1.5 the frequency changes proportionally with the change of frequency: The relationship between scale and frequency can be described with a constant of proportionality known as the center frequency. The center frequency defined as the arithmetic or geometric mean of the lower and upper cut- off frequency of a band-pass system or band-stop system. This property can be examined with the used scales to find the pseudo-frequencies that is analyzed in the signal. The relation between scale and frequency is seen as follows [64]:

F_eq = ^C^f

aδt (2.25)

Where F_eq is the related frequency, C_f is the central frequency of the wavelet, a is the scale and δt is the sampling interval. Table 2.1 shows the relationship between scaling and frequency with a constant proportionality, the center frequency.

Wavelet scale 2 4 8 16 Equivalent freq ^F₂êq ^F₄êq ^F₈êq ^F₁₆êq Table 2.1: Architecture of N-CNN

2.4.4 Vanishing Moments

Each wavelet family has different properties and one should choose the wavelet based on what type of data is going to be analyzed. Most importantly the different family contains different vanishing moments.

Acoustic Recognition with Deep Learning

Acoustic Recognition with Deep Learning

Experimenting with Data Augmentation and Neural Networks

Torstein Anton Berle Gombos

Thesis submitted for the degree of

Master in Electronics and Computer Technology Program option: Cybernetics

30 credits

Department of Physics

Faculty of Mathematics and Natural Sciences

UNIVERSITY OF OSLO

Acoustic Recognition with Deep Learning

Experimenting with Data Augmentation and Neural Networks

Torstein Anton Berle Gombos

Abstract

Contents

List of Figures

List of Tables

Abbreviations

Preface

Chapter 1

Introduction

1.1 Background and Related Work

1.2 Thesis Purpose

1.3 Thesis Structural Setup

Chapter 2

Theoretical Background

2.1 Convolutional Neural Networks for Acoustic Classification

∑

∑

2.2 Signal Processing of Acoustic Sounds

2.3 Short-Time Spectral Analysis

∑

∑

∑

2.4 Wavelet Transform and Scalograms