• No results found

BACHELOR THESIS

N/A
N/A
Protected

Academic year: 2022

Share "BACHELOR THESIS"

Copied!
84
0
0

Laster.... (Se fulltekst nå)

Fulltekst

(1)

FACULTY OF SCIENCE AND TECHNOLOGY

BACHELOR THESIS

Study programme / specialisation:

Datateknologi The spring semester, 2022 Open

Author: Linas Vidziunas

& Ørjan Kløvfjell Thoresen ………

(signature author) Course coordinator: Álvaro Fernández Quílez

Supervisor(s): Álvaro Fernández Quílez, Ketil Oppedal

Thesis title:

Unsupervised lesion detection with multi-view MRI and autoencoders

Credits (ECTS): 20 Keywords:

Autoencoder, Unsupervised

learning, Multi-view MRI, Transfer learning, Prostate cancer, Anomaly detection

Pages: 47 + appendix: 28

Stavanger, 14 May/2022 date/year

(2)

Abstract

Prostate Cancer (PCa) is one of the most common health threats in the developed world and the most diagnosed cancer among men in Norway. Detecting a lesion is a challenging task, and early diagnosis and treatment can be crucial for the patient.

Computer-assisted systems using Deep Learning (DL) to analyze Magnetic Reso- nance Imaging (MRI) data could help improve diagnostic accuracy if used together with current practices. However, the DL systems require large amounts of annotated data which are rarely available. In addition, the presence of factors such as bias in the form of a higher prevalence of one class in classification tasks can harm the final results of the developed system.

This thesis explores methods for performing anomaly detection on MRI image data of the prostate, where lesions are the anomalies. We tackle the lack of annotated data by using unsupervised learning and a type of neural network called Autoencoders (AEs). Specifically, Convolutional Autoencoders (CAEs), Variational Autoencoders (VAEs), and multi-view CAEs are compared.

We evaluate the methods by means of Area Under the Curve (AUC) to measure the ability to separate the images with a lesion from those that do not. The results of this thesis show that using a multi-view CAE offers advantages over the single-view based CAE and VAE in terms of ROC-AUC and that a fully unsupervised approach holds potential for lesion detection in prostate MRI.

(3)

Acknowledgements

This thesis was conducted during the spring semester of 2022. It marked the end of our bachelor’s degree at the University of Stavanger. We must admit that it has been challenging but also educational and fun.

We wish to thank all the staff and co-students at the University of Stavanger for making it such a great place to be a student during the last three last years and throughout the pandemic.

We owe our supervisers, Álvaro Fernández Quílez and Ketil Oppedal, a special thanks. In particular, we want to express our gratitude to Álvaro for the countless meetings and email conversations, as well as his continuous support and guidance.

(4)

Contents

Abstract i

Acknowledgments ii

Abbreviations viii

1 Introduction 1

1.1 Motivation . . . 1

1.2 Problem Definition . . . 3

1.2.1 Objectives . . . 3

1.2.2 Proposed Method Overview . . . 3

1.3 Related Work . . . 4

2 Technical Background 5 2.1 Artificial Neural Networks . . . 5

2.2 Convolutional Neural Networks . . . 6

2.2.1 Convolutional Layer . . . 6

(5)

CONTENTS

2.2.2 Pooling Layers and Transposed Convolutional Layer . . . 7

2.3 Autoencoder . . . 8

2.3.1 Variational Autoencoder . . . 9

2.4 Training . . . 10

2.4.1 Loss function . . . 10

3 Materials & Methods 12 3.1 Introduction . . . 12

3.2 Software . . . 14

3.3 Dataset and Data Pre-processing . . . 15

3.3.1 Data Pre-processing . . . 16

3.3.2 Multi-view Pre-processing . . . 18

3.4 Model Design . . . 20

3.4.1 Convolutional Autoencoder . . . 20

3.4.2 Variational Autoencoder . . . 21

3.4.3 Multi-view Autoencoder . . . 22

3.5 Evaluation . . . 23

3.5.1 Area Under the Curve based on Reconstruction Loss . . . 23

3.5.2 Transfer Learning and Fine-tuning . . . 24

3.5.3 Classification and Threshold Selection . . . 25

3.5.4 Bootstrapping with Replacement . . . 27

(6)

CONTENTS

3.6 Learning Rate and Batch Size Selection . . . 28

3.7 Tuning . . . 28

3.7.1 Convolutional Autoencoder . . . 29

3.7.2 Variational Autoencoder . . . 29

3.7.3 Selecting Epochs . . . 30

4 Results 31 4.1 Using Reconstruction Loss . . . 32

4.2 Using Transfer Learning Approach . . . 34

4.3 Sensitivity Analysis of the Multi-view Autoencoders . . . 35

5 Discussion 37 5.1 Effectiveness of Methodology . . . 37

5.2 Future Work . . . 38

5.3 Evaluation of Deep Learning Models . . . 39

5.4 Comparison to Related Work . . . 40

6 Conclusion 41

Bibliography 41

Appendix 48

A Setting - Using pre-trained weights 48

(7)

CONTENTS

B Anatomical planes 51

C Coding and Computational Resources 52

C.1 Python code . . . 53

D Model Description 54

D.1 VGG16 inspired . . . 55 D.2 U-Net . . . 59

E Learning Rate 64

E.1 U-Net . . . 65 E.2 VGG16 . . . 66

F Batch Size 67

F.1 U-Net . . . 68 F.2 VGG16 . . . 69

G Convolutional Autoencoder 70

G.1 Graphical illustration of the effect of latent constraint . . . 70 G.2 Tuning U-Net . . . 72 G.3 Tuning VGG16 AE . . . 73

H Varitaional Autoencoder 74

H.1 Tuning VGG16 AE . . . 74

(8)

Abbreviations

AE Autoencoder

VAE Variational Autoencoder

CNN Convolutional Neural Network ANN Artificial Neural Network MRI Magnetic Resonance Imaging AUC Area Under the Curve

DL Deep Learning

CAE Convolutional Autoencoder MSE Mean Squared Error

KL Kullback-Leibler

DICOM Digital Imaging and Communications in Medicine GPU Graphics Processing Unit

UiS University of Stavanger

TL Transfer Learning

(9)

CONTENTS

IQR Inter Quartile Range PCa Prostate Cancer 2D 2 Dimensional

PSA Prostate Specific Antigen

ROC Receiver Operating Characteristic TPR True Positive Rate

FPR False Positive Rate

API Application Programming Interface MR Magnetic Resonance

T2w T2-weighted

PD-w Proton Density-weighted TP True Positives

TN True Negatives

FP False Positives

FN False Negatives

3D 3 Dimensional

(10)

Chapter 1

Introduction

1.1 Motivation

Prostate Cancer (PCa) is the second most diagnosed cancer among men and the fifth leading cause of death worldwide [1]. In Norway, PCa is the most diagnosed cancer among men [2]. During the 1990s, there was a massive upswing in PCa diagnosis in Norway. During this time, screening and testing were introduced to the primary health system using Prostate Specific Antigen (PSA), followed by biopsies. The increase in testing is considered the most plausible explanation for the increase in the diagnosis of PCa [2]. Prostate cancer may be asymptomatic in its early stages, and therefore regular screening is crucial to give early diagnosis and treatment. The use of PSA remains controversial as a screening technique due to over-diagnosis of indolent tumors, potentially unnecessary biopsies, and treatment [3].

After undergoing a PSA test, Magnetic Resonance Imaging (MRI) can be used to support the conclusion reached by the PSA test [4]. Furthermore, it can be used as a standalone technique. On the other hand, an experienced clinician can usually discern between normal and abnormal cases after seeing a few cases and by mere comparison. Nevertheless, this approach has several problems; it suffers from intra- reader variability and can be a time expensive process.

Machine learning has in recent years been leveraged by the healthcare industry and has the potential to transform many aspects of the medical landscape [5]. Deep Learning (DL) holds a promise in automating MRI analysis, though traditional DL methods rely on labeled data. Labeled data means that the data has annotations,

(11)

1.1 Motivation

less labeled data exists in the public domain, and acquiring more is time-intensive and expensive from an economic point of view [6], [7].

Autoencoders (AEs) leverage neural networks to learn efficient encodings of unla- beled data (unsupervised learning). Autoencoders have previously been shown to be successful in data compression and outlier detection [8], [9]. To work around the lack of labeled data, this thesis explores the use of AE architectures to detect when a lesion is present in an MRI image of the prostate. There is a higher prevalence of cases that are non-PCa when compared to the number of cases with PCa, and unsupervised learning helps exploit that bias in a positive way. A multi-view Con- volutional Autoencoder (CAE) such as the one proposed in this thesis combines the MRI images of all views, potentially leading to more efficient use of the available data.

(12)

1.2 Problem Definition

1.2 Problem Definition

The primary goal is to explore the use of multi-view CAEs in the task of anomaly detection, where lesions in the prostate are the anomalies. The models will train and predict on 2 Dimensional (2D) MRI image data. In essence, the task is abinary classificationproblem because the data consists of two classes; PCa and non-PCa. In order to offer a fair comparison between the multi-view CAEs and their single-view counterparts, the model architecture and the hyperparameter optimization process will be similar. The list of objectives is listed below.

1.2.1 Objectives

• Build and train CAEs and Variational Autoencoders (VAEs) and quantify the overall performance of their ability to detect when a lesion is present using unsupervised learning.

• Build and train multi-view CAEs based on the same architecture as the single- view counterparts to compare its performance.

1.2.2 Proposed Method Overview

First, the data (MRI) is loaded, and the images are pre-processed into different datasets and given labels indicating whether the image contains a lesion. The pro- posed CAE and VAE models are trained exclusively on healthy (non-PCa) data and fine-tuned on a dataset containing both healthy and unhealthy data (validation set).

The models are evaluated on their overall performance in detecting when a lesion is present in a given MRI image of the prostate on the test set.

Figure 1.1: An illustration of the proposed methodology in this thesis

(13)

1.3 Related Work

1.3 Related Work

As previously introduced, this thesis focuses primarily on using an unsupervised learning paradigm approach and AEs to detect lesions in MRI image data containing both healthy and unhealthy data, also referred to as anomaly detection. In [9] AEs are used for anomaly detection using normal training data only and demonstrates its ability to successfully detect abnormalities for a range of various objects.

Convolutional autoencoders and VAEs are used in this work, the VAE was first introduced in this paper [10] by Diederik P. Kingma and Max Welling. Our imple- mentation draws a lot of inspiration from the Keras AE code examples [11], [12].

The inspiration to build a multi-view AE structure comes from [13] and [14]. [13]

uses deep convolutional networks combining all views of the image data in one model, more specifically a U-Net architecture to do segmentation of the prostate. [14] com- bines MRI image data of all views to classify PCa and demonstrates better classi- fication results using the multi-view than single-view for several architectures. Our work also draws inspiration from this master’s thesis [15].

The AEs used in this work are inspired by U-Net [16] and VGG16 [17]. Although not used for the same purposes, the models are designed for tasks that indicate that they should work well for the tasks presented in this thesis. U-Net is a type of AE that has obtained good results in medical imaging tasks [16].

(14)

Chapter 2

Technical Background

2.1 Artificial Neural Networks

Artificial Neural Networks (ANNs) are computational processing systems based on the processes of the brain, i.e., biological nervous systems [18]. A typical structure has multiple inputs and outputs, and in between, there are hidden layers. The layers in between are called hidden because the data does not show the desired output for these [19]. The hidden layers are typically vector-valued; each unit resembles a neuron because its value depends on the value of other units in the networks and the connections between them [19]. The structure of these connections can come in many different forms, but all of them contain something called weights. The weights between the neurons are what is trained when training the network [20].

x0

x1 x2

x3 Input

layer

h(1)1

h(1)2 h(1)3

h(1)4 Hidden

layer 1

h(2)1

h(2)2 h(2)3 Hidden layer 2

ˆ y1

ˆ y2 Output

layer

Figure 2.1: Illustration of a simple ANN [21].

(15)

2.2 Convolutional Neural Networks

something called fully connected layers, also referred to as dense layers. Each neuron in a fully connected layer has connections to all the neurons in the previous and the following layer [18].

2.2 Convolutional Neural Networks

Convolutional Neural Networks (CNNs) are ANNs that use a linear operation called convolution [22]. Convolutional neural networks are often used when dealing with images, one of the reasons being that they can drastically reduce the number of parameters necessary to train the network well [23].

In CNNs, feature maps are generated by applying filters to the input image or the prior layer. Feature map visualization gives insights into where certain features are likely to appear [24].

2.2.1 Convolutional Layer

The following explanation of the convolutional function in the context of CNN is inspired by [22].

Convolutional neural networks are commonly used with image data; the operation is then 2D convolutional. An example of a formula for 2D convolution is given in Equation 2.1.

Y(i, j) =X

n

X

k

X(n, k)K(i−n, j−k) (2.1)

Using Equation 2.1 in the context of an image undergoing 2D convolution, the Y represents the output and is sometimes referred to as the feature map, andX rep- resents the input layer. For the output i and j specifies the specific neuron and n and k specifies the position of the pixel for the input layer. For this to result in an overall reduction of parameters, the Kernal function (K) needs to be 0 for all negative arguments.

For one input image, multiple functions for K can be trained, resulting in multiple channels (feature maps). When speaking about the number of filters of a convo- lutional layer in a neural network, this is the instances for function K. Kernel size

(16)

2.2 Convolutional Neural Networks

refers to the size of the function K, in Figure 2.2, a kernel size of 3x3 is used.

0 1 1 1 0 0 0 0 0 1 1 1 0 0 0 0 0 1 1 1 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 0

I

1 0 1 0 1 0 1 0 1

K

=

1 4 3 4 1 1 2 4 3 3 1 2 3 4 1 1 3 3 1 1 3 3 1 1 0

I∗K 1 0 1

0 1 0 1 0 1

×1 ×0 ×1

×0 ×1 ×0

×1 ×0 ×1

Figure 2.2: Illustration of 2D convolution [25].

2.2.2 Pooling Layers and Transposed Convolutional Layer

Pooling layers are typically used with convolutional layers and are a nonlinear way to downsample [26]. Two popular pooling layers are max pooling and average pooling.

A max-pooling layer reduces the input dimension by keeping the maximum value, whereas an average pooling calculates the average value [26]. This is done by applying a filter and a stride. The stride defines how that filter moves with every iteration.

Using a filter size of 2x2 will reduce the dimension by a factor of 2.

Figure 2.3: Illustration of max pooling with stride=2 and a 2x2 filter [27].

On the other hand,transposed convolutional layersare typically used for upsampling the data. Meaning the output size of the layer is greater than that of the input. This is done by adding zeros between the rows and columns of the input. The input can also be enlarged by adding a padding around the input, often containing zeros.

After that, the kernel slides across the picture and generates an output based on the

(17)

2.3 Autoencoder

2.3 Autoencoder

An AE is a type of neural network, and it can be trained to try to reconstruct its input to its output [29]. Autoencoders consist of two parts, the encoder, and the decoder [29]. The encoder takes an input and typically carries out dimensional reduction using hidden layers. The decoder is trained to then recreate the original input, given the lower-dimensional representation as input. By first encoding the data, the model prioritizes which features of the input should be kept (or which features of the input carry the most essential information), this often results in the network learning useful properties of the data [29]. Convolutional autoencoder is an autoencoder that makes use of a CNN as part of the architecture. Autoencoders have many different use cases, traditionally it has been used for dimension reduction and feature learning [29]. Although not the most effective method of data compression, some use cases for autoencoders include image denoising or generative tasks such as creating faces never seen before. In this work, we explore autoencoders for anomaly detection.

Figure 2.4: Typical structure of an autoencoder [30].

(18)

2.3 Autoencoder

2.3.1 Variational Autoencoder

Variational autoencoders are also ANNs. Whereas the regular autoencoder is a deterministic model, the variational autoencoder is a probabilistic model. When the encoder with functionq(z|x)is given an inputx, the encoder produces a distribution for z described by the function p(z). From the distribution p(z), a specific value z is sampled, and the decoder with function p(x’|z)attempts to reconstruct the input givenz as an input [31].

Figure 2.5: Illustration of a typical VAE model architecture [32].

Through training, the model is not only trained to reconstruct the input but also trained to keep the distance loss between the two distributionsq(z|x) andp(x|z) as small as possible. This can be done by using Kullback–Leibler divergence [31] and something known as the reparameterization trick [10]. This results in that different features end up in different places in what is often referred to as the latent space.

By simply giving the trained decoder values forz it can be used for generative tasks.

(19)

2.4 Training

2.4 Training

When training a deep neural network to a specific task, the goal is to adjust the parameters in the network to make it perform well on the training data but with the idea of learning a structure that generalizes to images beyond that training data [20]. Training a neural network comes down to finding a minimum of a cost function [20]. The output of the cost function indicates how well the network is performing [33]. A learning algorithm attempts to optimize the parameters such that the scalar produced by the cost function is as small as possible [33] [20].

Forward propagation is when a feedforward neural network is given an input, and the information flows forward in the network to produce an output [34]. Back- propagation does this in reverse order, the information flows from the cost backward throughout the network to produce a gradient of the cost function [35] [36]. The gradient of a function tells the direction of the steepest descent, by taking steps in that direction, the output of the cost function will become smaller [20]. Since the starting position when training is often random, there is no guarantee that the local minimum that the network goes into is the global minimum [20].

Back-propagation is often misunderstood to be the whole learning algorithm, but it is in fact, only used to compute the gradient. Another algorithm, such as stochas- tic gradient descent, is performing the learning by using the gradients from back- propagation [35].

2.4.1 Loss function

Binary cross-entropy is a loss function that is used for binary application problems [37]. In a case where normalized (having a value between 0 and 1) pixel arrays are used to represent the input image, the loss function could work to minimize the difference between the input pixel value and the predicted pixel value by the autoencoder [38]. Binary cross-entropy is the particular case of cross-entropy with only two classes. Consider pi as the true label and qi as the predicted distribution of the predicted value. By using the notationsp∈ {y,1−y}and q∈ {ˆy,1−y}ˆ the equation can be written as shown in Equation 2.2 [39].

H(p, q) =−X

i

pilog(qi) =−ylog(ˆy)−(1−y)log(1−y)ˆ (2.2)

(20)

2.4 Training

Categorical cross-entropy uses cross-entropy as a loss function in a similar manner as the binary cross-entropy function [40]. The main difference is that the sum of the predicted labels is equal to 1, this can be thought of as using a softmax [41]

activation across multiple binary cross-entropy losses. When categorizing between multiple classes, the predictions can be interpreted as probabilities.

The Kullback-Leibler (KL) divergence gives a measure of how one probability distri- bution is different from another probability distribution. The equation in its discrete form with the distributions P and Q are defined on the same probability space χ, the equation can be written as in Equation 2.3 [42].

DKL(P∥Q) =X

x∈χ

P(x)log

P(x) Q(x)

(2.3)

When KL is implemented as part of a cost function for a variational autoencoder, as in this thesis, the probability distributions are not in the same probability space.

Hence, tricks to compute the KL measure are required. More on how it can be accomplished in [10], [43].

The Mean Squared Error (MSE) is a statistical method used to measure the difference between an observed value and a predicted value [44]. The MSE can be calculated by the Equation 2.4. In Equation 2.4yi is theith observed (or true value) value, yˆi is the corresponding predicted value andn is the number of observations.

M SE = 1 n

n

X

i=n

(yi−yˆi)2 (2.4)

(21)

Chapter 3

Materials & Methods

3.1 Introduction

Figure 3.1: An overview of the methodology.

(22)

3.1 Introduction

As mentioned in the first chapter, this thesis uses unsupervised learning and vari- ous AEs to perform anomaly detection on MRI image data, where lesions are the anomalies. The experimental part of the thesis investigates whether using a multi- view AE structure is a valid approach and if it offers any improvements compared to its single-view counterparts on the task of classifying the MRI image data (PCa or non-PCa).

First, the data is pre-processed, and the models are trained exclusively on healthy data only from the training set. The validation set containing both unhealthy and healthy data is used to fine-tune the hyperparameters of the models. This is done by considering MSE losses between the input images and the reconstructed images by the autoencoder as anomaly scores to calculate an Area Under the Curve (AUC) score (see section 3.5.1). The AUC score is used to measure the model’s overall ability to separate between healthy and unhealthy images.

The proposed CAE and VAE models are fine-tuned for each view, by testing different dimensions in the latent dimension. This is done because lower-dimension embed- dings can improve performance for AEs [8]. The multi-view CAE is built while keeping the common parts from the fined tuned models. In the final stage, the mod- els are used to give anomaly scores on the healthy and unhealthy images of the test set, which is not previously used for neither training nor tuning of hyperparameters.

We use two methods for anomaly detection: the first method is based on reconstruc- tion losses as outlined in section 3.5.1, and the second applies TL and training of a categorical layer, 3.5.2. For both methods, bootstrapping with replacement (see section 3.5.4) is used for one hundred repetitions to account for potential variabilities of the results.

In addition, a third evaluation method is carried out by using a specific threshold based on reconstruction losses from the validation set to classify the images of the test set. The results are documented with several metrics as specified in section 3.5.3.

When referring to the axial, sagittal, or coronal plane (view) throughout this thesis, we are referring to the anatomical planes (an illustration is shown in the appendix B).

(23)

3.2 Software

3.2 Software

The programming language Python has been used for this thesis. Python is a popular high-level programming language with a comprehensive library and a large commu- nity [45]. This section will explain our most used external Python libraries. All of the libraries are free and open-sourced. The specific versions of the used libraries can be found on our Github repository in the file namedrequirements.txt.1

Tensorflow [46] (TF) is a library for machine learning developed by Google and released in 2015. TF is supported for a wide range of other programming languages, such as JavaScript and C++ [47]. Specifically, we used TF 2 which is the predecessor to TF 1, released in 2019.

Keras is an Application Programming Interface (API) for neural networks written in Python. Keras focuses on being user-friendly by being modular and easily extensible [48]. Keras acts like a wrapper and does not by it self handle low-level computations thus it requires a backend, such as TF. As of 2017 Keras has been integrated into Tensorflow as module.

Numpy is used for numerical computations, and is actually the most used python library [45]. It supports a wide variety of mathematical functions and is used for a wide range of tasks, from generating random numbers to computations of multi- dimensional arrays, and much more [49].

Matplotlib is a library for visualizing data. In this thesis we use Matplotlib to plot and export MRI images and more. Matplotlib also supports animated and interactive visualization, and allows to easily change visual styles and layouts [50].

Scikit-learn is a library for predictive data analysis, and offers tools to easily and effectively perform classification, regression, data preprocessing, and more [51].

Pydicom is a library to read, modify and create data encoded in the Digital Imaging and Communications in Medicine standard [52].

1Link to the Github repository: https://github.com/LinasVidziunas/Unsupervised-lesio n-detection-with-multi-view-MRI-and-autoencoders

(24)

3.3 Dataset and Data Pre-processing

3.3 Dataset and Data Pre-processing

The prostate image dataset is obtained from the PROSTATEx Challenge [53]–[55]

(November 21, 2016, to February 16, 2017)2. The challenge is a collection of prostate Magnetic Resonance (MR) studies curated for research in computer-aided diagnosis of the prostate. The dataset contains T2-weighted (T2w), proton density-weighted (PD-w), dynamic contrast enhanced (DCE), and diffusion-weighted (DW) imaging, from which we only use T2w. The challenge also provides supplementary spread- sheets containing image and lesion information, such as the dimensions of the images and where the lesions are located. From the challenge, only the training dataset is used. The testing cohort is kept as an external validation set in case we consider submitting to the public results table. The training cohort consists of 204 patients with a total of 330 lesions [55] and contains a total of 12845 T2w slices (2D images) where 1109 of the total are classified as containing a lesion.

View No. without lesions No. with lesions Sum

Axial 4798 412 5210

Coronal 3108 339 3447

Sagittal 3830 358 4188

Sum 11736 1109 12845

Table 3.1: Distribution of images with and without lesions in the for the T2W.

x and y dim. Axial Coronal Sagittal Sum

256 21 21

320 296 3426 4169 7891

384 4742 4742

512 21 19 40

640 151 151

Sum 5210 3447 4188 12845

Table 3.2: Frequencies of image dimensions for each view.

Table 3.2 shows that the image dimensions slightly vary, but they have a clearly prevailing dimension for each view, 384x384 for axial and 320x320 for coronal and sagittal. The images are encoded using the Digital Imaging and Communications in Medicine (DICOM) standard. The DICOM standard is an international standard for medical images and their relative information and is used in almost every radiology, cardiology imaging, and radiotherapy device [56].

2Download link: https://www.cancerimagingarchive.net/nbia-search/?CollectionCrite ria=PROSTATEx

(25)

3.3 Dataset and Data Pre-processing

3.3.1 Data Pre-processing

This subsection explains how the dataset has been pre-processed and the techniques used. All of the following data pre-processing has been accomplished using Python.

Some of the patients in the PROSTATEx dataset contain multiple sets of T2w images, and this is because the radiologist has decided to retake the scans due, to, for example, the image being blurry because the patient moved during the scan. If a patient had multiple sets of T2w images in one direction (view), we only used one of them and discarded the rest. The image dimensions used to feed the networks are 384x384 for the axial view, and 320x320 for the coronal and sagittal view. No registration of images between the different has been performed.

Splitting Datasets

The original dataset has been split by patients into three different sets; train, vali- dation, and test, with a distribution of 70%/10%/20% respectively. As a result of splitting the dataset by patients, a patient and their images cannot exist in multiple datasets (train, validation, test), i.e., if a patient is part of the training set for axial, they are also a part of training for the coronal and sagittal sets.

View Train Validation Test Sum

Axial 2679 425 845 3939

Coronal 2143 347 690 3180

Sagittal 2489 399 709 3597

Table 3.3: Total images after limiting each patient to one set of images for each view.

View Validation Test

Axial 31 58

Coronal 33 58

Sagittal 33 60

Table 3.4: Number of images containing lesions of the total amount.

Using Python, every image has been checked for whether it has a lesion in the supplementary spreadsheet, and a tag has been added to the DICOM file containing a boolean value of either true or false, true for containing a lesion, false for not containing a lesion. Reading, modifying, and saving DICOM files was accomplished using the pydicom library [52]. Then, the modified DICOMs were saved on a new path, containing information about which set and view it belongs to, with a file name

(26)

3.3 Dataset and Data Pre-processing

that includes whether the DICOM contains a lesion, the patient ID, DICOM series number, and its slice position (in the z-axis).

Figure 3.2: An example of a used file path to an image.

Further Pre-processing

When loading the images into memory for use in an AE, empty lists are created for the pixel arrays for each dataset (train, validation, test) and for the true labels of the pixel arrays. For each DICOM file in our custom directory structure, the DICOM file is read, and information such as the pixel array and the previously added tag of whether the image contains a lesion or not, are extracted.3

The previously added tag is retrieved and appended to the list of true labels. The pixel arrays are further pre-processed via normalization and reshaping. Normaliza- tion ensures that all pixel intensities range from 0 to 1 (see Equation 3.1). The now normalized pixel arrays are reshaped to the most prevailing image dimension for their view, 384x384 for axial and 320x320 for coronal and sagittal. Reshaping is necessary because the proposed models contain convolutional layers which require a constant dimension for its input. Finally, the reshaped normalized pixel arrays are appended to the pixel array list for their corresponding dataset (train, validation, or test).

xnorm= x−xmin

xmax−xmin (3.1)

In Equation 3.1: x is the pixel intensity,xmin and xmax are, respectively, the mini- mum and maximum pixel intensities for the current image, andxnorm is the resulting normalized pixel [15].

3Link to the Github repository: https://github.com/LinasVidziunas/Unsupervised-lesio n-detection-with-multi-view-MRI-and-autoencoders

(27)

3.3 Dataset and Data Pre-processing

3.3.2 Multi-view Pre-processing

Our proposed implementation of a multi-view CAE has an extra set of requirements for the data than the single-view implementations due to containing multiple inputs:

it requires an equal amount of axial, coronal, and sagittal images to train. Such a condition comes from the fact that models need a constant batch size to be trained on (same amount of data coming from all the different sources). As a result of this requirement, some images had to be discarded. In particular, we discard the images that appear at the start and end of the patient sequence of each view. The rationale underlying this choice is that we want to keep the images containing the whole prostate (middle of the sequence) and potentially contain more information that can be used at training time.

We assume that the order of input images for each view matters and that feeding the same patient into all of the model’s inputs will result in better detection performance.

Furthermore, we want to retrieve the images for each view for each patient in the correct order. We hypothesize that feeding the model images in sequential order of the same patient in each input, will enforce the AEs to make connections between the different views, resulting in an improved detection. Additionally, the approach has a relatively low computational cost (as well as time-wise) since the order of the images is obtained directly from the data retrieved from the MR systems and, in this case, provided directly by the challenge.

We tackle all the previous issues by implementing a heuristic in Python which allows, for each view of each patient, to retrieve a slice/part of the list of the images of equal length to the number of images in the view containing the fewest.4 Additionally, the retrieved slice contains the elements in the middle of the patient sequence, as previously introduced. See Figure 3.3 for a visual representation of how the lists are modified.

4Link to the Github repository: https://github.com/LinasVidziunas/Unsupervised-lesio

(28)

3.3 Dataset and Data Pre-processing

Figure 3.3: Illustration of the returned lists from the algorithm for retrieving slices of lists equal to the length of the shortest list, where the rectangles represent an element in a list.

(29)

3.4 Model Design

Python code 3.1: Pseudocode of the algorithm for the axial view.

1 # Number of images in the view that contains the least

2 minimum = min(len(axial), len(coronal), len(sagittal))

3

4 a = len(axial)//2-(minimum//2)

5 b = len(axial)-(ceil(len(axial)/2)-ceil(minimum/2))

6

7 axial = axial[a:b]

Using this approach results in 2119 images for each view in the train dataset, 345 for each view in the validation dataset, and 682 for each view in the test dataset. A total of 87 images containing a lesion are present in the validation set, and 155 are present in the test set out of the total number of images present in those sets. For the numbers of lesions present in each dataset, keep in mind that an actual lesion in three dimensions can be calculated up to three times, once for each view.

View Train Validation Test

Normal Abnormal Normal Abnormal

Axial 2119 319 26 636 46

Coronal 2119 312 33 624 58

Sagittal 2119 317 28 631 51

Table 3.5: Distribution of normal and abnormal (without lesion and with lesion) images in the datasets for only the multi-view architecture.

3.4 Model Design

This thesis explores two different architectures; one based on the widely-used U-Net architecture and one based on VGG16. The VAE and multi-view CAE models are extensions of the proposed CAE, meaning they use the same encoder and decoder architecture. All the AE models have been trained to reconstruct their input, using Adam as the optimizer from the Keras library [48].

3.4.1 Convolutional Autoencoder

The original paper describing VGG architecture used it to perform classification and, therefore, contains a fully connected classification layer at the end [17]. Our proposed CAE model inspired by VGG16 follows the standard AE structure: an encoder composed of the VGG16 architecture and a decoder mirroring the encoder.

(30)

3.4 Model Design

The encoder is the same as VGG16 except without the fully connected layers and the last max-pooling layer, and the decoder is the mirrored image of the encoder with upsampling layers instead of max-pooling. Detailed figures of the encoder and decoder structures used for all models including an illustration of the implemented U-Net CAE can be found in the appendix D. Onwards in this thesis, when referring to VGG16, we are referring to our implementation of VGG16 as an AE.

64

128 256 512 512 512 512 256

128 64 1

Sigmoid

Figure 3.4: The implemented autoencoder inspired by VGG16. Yellow boxes represent convolutional layers, red boxes represent max-pooling layers, and magenta boxes represent upsampling layers. The right-most layer represents a convolutional layer with sigmoid as its activation function. The numbers under the layers represent the number of filters.

The used loss function for single-view CAEs is binary cross-entropy. An explanation of binary cross-entropy is available in subsection 2.4.1.

3.4.2 Variational Autoencoder

In [12], before the bottleneck, they flatten the output from the encoder and feed it into a fully connected layer. This fully connected layer will be referred to as thepre- bottleneck layer. Similarly, after the bottleneck, they have also used a fully connected layer; this layer will be referred to as the post-bottleneck layer. See the Figure 3.5 for a visual representation of how the layers are connected.

Our VAE implementations are inspired by [12] and use both previously introduced layers,pre-bottleneck, andpost-bottleneck. Otherwise, the encoder and decoder parts of the implemented VAEs are precisely the same as their fully convolutional coun-

(31)

3.4 Model Design

Figure 3.5: Illustration of the layers in the bottleneck for our implementation of VAE.

We have used the sametrain_step as described in [12]. Thetrain_step specifies the loss as a combination of binary cross-entropy and KL loss. Explanations of both loss functions are available in section 2.4.

3.4.3 Multi-view Autoencoder

Our proposed implementation of a multi-view CAE features an encoder and a decoder for each view, all connecting to a single shared bottleneck.

The explored way to interconnect all encoders and decoders into a single shared bot- tleneck used a concatenation layer. The concatenation layer combines the previous layers into one layer.

Figure 3.6: Proposed multi-view model architecture. The arrows illustrate a concatenation operation.

Multi-view CAEs have been trained to minimize the binary cross-entropy reconstruc- tion loss for each view simultaneously.

(32)

3.5 Evaluation

3.5 Evaluation

The three evaluation methods in the following subsections have been used to evaluate the AE models. The evaluation method’s in section 3.5.1 and 3.5.3 were explained and introduced to us by our supervisor in the first and second meeting [6], [7] and also draws inspiration from this master’s thesis [15].

3.5.1 Area Under the Curve based on Reconstruction Loss

The classification performance of an autoencoder is evaluated through AUC. Area under the curve is commonly used to measure a model’s overall performance and ability to detect anomalies [7]. AUC is the area under the Receiver operating char- acteristic curve (ROC curve).

The ROC curve is a graph that shows the model’s performance of binary classification at all classification thresholds [57]. The curve plots two parameters; True Positive Rate (TPR) and False Positive Rate (FPR) (See Equation 3.2 and Equation 3.3).

T P R= T P

T P +F N (3.2)

F P R= F P

F P +T N (3.3)

When calculating the AUC based on MSE losses in this thesis, the MSE losses between the input images and the reconstructed images by the autoencoder are considered a prediction of abnormality. This is done by normalizing the MSE losses such that the value range is [0,1]. To calculate the AUC and plot the ROC curve, the normalized losses and the true labels (indicating whether the image contains a lesion or not) are given as inputs to functions belonging to the Scikit library [51].

(33)

3.5 Evaluation

3.5.2 Transfer Learning and Fine-tuning

This detection method is greatly inspired by [58]. TL simply means to transfer some learned weights from one model to another.

Applying TL to help with anomaly detection is done in the following three stages.

1. An AE model is trained on the training set (which contains exclusively healthy data).

2. The encoder and the embedding of the trained model are reused with learned weights, and a new categorical layer (dense layer with softmax [59] as activation function) is put on top of that. The encoder and the embedding weights are then frozen, and the new layer is trained to categorize normal and abnormal slices on the validation set, training it to recognize abnormal slices as the vector [1,0] and normal slices as the vector [0,1].

3. The weights of the whole new model are unfrozen, and the model is fine-tuned (training on a low learning rate and few epochs) on the validation set one more time.

Figure 3.7

When given inputs (image data), the model predicts a vector containing the proba- bilities of belonging to a class (example: [0.65, 0.35] meaning that 65% probability of being healthy and 35% probability of being unhealthy). The values that the model predicts when given an input from the test set are then normalized such that the value range is [0,1]. The normalized values and the true labels (PCa or non-PCa) are used to plot a ROC curve and calculate AUC using functions belonging to Scikit

(34)

3.5 Evaluation

library [51]. In both steps 2 and 3, the models are trained with categorical cross- entropy. In step 2, we have used the same learning rate as for the base autoencoder.

In step 3, we have used a learning rate of 1e-5, the same as used in [58]. For the CAEs, the same number of epochs was used as in [58], 20 for step 2 and 10 for step 3. For the VAEs, the number of epochs used in step 3 was extended to 100 because it is a probabilistic model.

3.5.3 Classification and Threshold Selection

Finally, we also explore a threshold-based method to classify between healthy and unhealthy images. Thresholds have been set on reconstruction losses for images in the validation dataset by considering the MSE on the difference between the input images and the predicted images. When classifying the images in the test set, all images with an MSE greater than the threshold are classified as abnormal and all images with an MSE lesser than the threshold are classified as normal.

In the paper [60], the author stated several statistical methods for outlier detection.

However, in this thesis, Tukey’s Method has been used because the method does not assume an underlying Gaussian or normal distribution. Interquartile Range (IQR) is a measure of the data’s variability about the median. IQR is given by Equation 3.4. Q1 is the first quartile of the data, representing 25% of the data lies between minimum and Q1. Q3 is the third quartile and is where 75% of the data is lying between minimum and Q3.

IQR=Q3−Q1 (3.4)

A threshold for outlier detection could then be set by using Equation 3.5 or 3.6

T hreshold=Q1−(1.5∗IQR) (3.5) T hreshold=Q3 + (1.5∗IQR) (3.6) Because the threshold is set on the reconstruction losses, the threshold given by the Equation 3.6 has been used, as we expected the abnormal images to have a higher reconstruction error [6]. The thresholds were determined from the validation set and applied to the test set.

(35)

3.5 Evaluation

We used sensitivity, specificity, F1, and accuracy metrics [61] to evaluate the detec- tion for abnormal and normal slices. The formulas are described in terms of True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN) (see Equation 3.7, Equation 3.8, Equation 3.9, and Equation 3.10).

Sensitivity = T P

T P +F N (3.7)

Specif icity= T P

T P +F N (3.8)

F1 = 2T P

2T P +F P +F N (3.9)

Accuracy= T P +T N

T P +T N +F P +F N (3.10)

These metrics were used to evaluate the performance with the threshold set as given by Equation 3.6. The effect of the threshold selection is documented with the metrics from equations 3.7, 3.8, 3.9 and 3.10. For all of the models, the tests have been carried out for every view (axial, sagittal and coronal) independently.

(36)

3.5 Evaluation

3.5.4 Bootstrapping with Replacement

Bootstrapping with replacement is a statistical method that mimics the effects of running the evaluation multiple times [62], which can be further used to carry out statistical testing or obtain confidence intervals, giving a measure of uncertainty around the results.

It is done by sampling n number of data points from the original data to create a new dataset. The resampling process is typically done with several repetitions; each new resampled dataset is called a bootstrapping set. Bootstrappingwith replacement means that if a data point is sampled into one set, it can also reappear in future sets. For all our tests, we have used bootstrapping with replacement for one hundred repetitions, where each bootstrapping set has been stratified such that it contains the same ratio PCa to non-PCa. Other works such as [14] use a similar approach.

In our work, we make use of theresample function imported from the Scikit library to carry out this task.

Python code 3.2: Pseudocode of how therasamplefunction is called along with the passed arguments. Detailed explanation of this function and it’s parameters can be found at [63].

1 bootstrap_data, bootstrap_labels = resample(test_data,

2 test_labels, replace=True,

3 n_samples=len(test_data), stratify=test_labels)

(37)

3.6 Learning Rate and Batch Size Selection

3.6 Learning Rate and Batch Size Selection

We experimented with different learning rates and batch sizes using both proposed CAE models. We used AUC scores (plotted in a graph every 10th epoch) on the validation set to evaluate the effect of different hyperparameters. The tested learning rates were 1e-3, 1e-4, 1e-5, 1e-6 and tested batch sizes were of 4, 8, 16, 32 samples.

All the experiments were run for 1500 epochs. The graphs can be found in appendix F and E.

Batch size Learning rate 32 samples 1×10−4

Table 3.6: Chosen hyperparameters. These hyperparameters are obtained only using the validation set.

The hyperparameters in Table 3.6 were used for all tests in the results chapter except for the multi-view CAE models because of computational limitations encountered on the batch size. For the multi-view models, a batch size of 8 was used.

3.7 Tuning

The rationale underlying the tuning of latent dimensions for each view is to adapt each model to achieve the best possible results. As mentioned in the introduction of this chapter, the reason for doing this is that lower dimensions have been shown to improve AE performance [8]. The tested dimensions differ for the different model architectures (CAE and VAE) and are presented in the following subsections. After brief experimentation, we determined to run the models for 300 epochs. AUC scores were calculated on the validation dataset at a set of epochs, including; 10, 25, 50, 100, 200, and 300, using the evaluation method outlined in section 3.5.1 to calculate the AUC score. The highest average AUC score determined the filter dimension we used for all further testing. If several dimensions resulted in the same average results, one was chosen randomly. The results from this tuning process were also used to choose the number of epochs to run on the final models. A visual illustration of the effect of tuning latent dimension is shown in appendix G.

(38)

3.7 Tuning

3.7.1 Convolutional Autoencoder

In order to tune U-Net and VGG16 for each view (axial, coronal, and sagittal), we experimented with the number of filters in thelast convolutional layer of the encoder structure. The following number of filters were tested: 128, 64, 32, 16, 8, and 4. See appendix G for tables on the calculated AUC scores throughout the set of different epochs on the validation set for the CAE models.

U-Net: Chosen number of filters

Axial 128

Coronal 32

Sagittal 16

Table 3.7: Chosen number of filters for the CAE models with U-Net architecture.

VGG16: Chosen number of filters

Axial 16

Coronal 32

Sagittal 4

Table 3.8: Chosen number of filters for the CAE models with VGG16 architecture.

3.7.2 Variational Autoencoder

This subsection does not contain the U-Net architecture because it failed to optimize the KL loss in the loss function. This is further discussed in chapter 5.

We fixed the size of the pre-bottleneck fully connected layer to the largest latent dimension we planned to test. This was done so that the pre-bottleneck fully con- nected layer does not become the main constraint in the VAE model. The largest latent dimension tested is 4096. Therefore, thepre-bottleneck layer had a dimension of 4096 units in all tests. The post-bottleneck fully connected layer is fixed to the same number of units as the latent image dimension in the encoder multiplied by 32, as 32 is the highest number of filters in the last convolutional layer of the encoder.

The number of filters in the last convolutional layer of the encoder was set to the same as the chosen number of filters presented in tables 3.7 and 3.8. The dimensions ofµ,σ, and sampling layers were tuned with the following number of units for each view: 64, 128, 256, 1024, and 4096.

For more extensive tables on tuning the VAE latent dimensions, see appendix H.

(39)

3.7 Tuning

Axial Coronal Sagittal

128 4096 4096

Table 3.9: Chosen latent dimensions for each view.

3.7.3 Selecting Epochs

The tables in appendix G were used to determine the appropriate number of epochs to run the final single- and multi-view CAE models. In addition, the figures in appendix E were used. From those graphs, it seems models with the chosen learning rate start producing diminishing results at around 150 epochs. One hundred epochs were chosen for simplicity as this seemed to be working well for all CAE models.

From the tables in appendix H it seemed the VAE models in general needed more epochs to get better results. Two hundred epochs were chosen and used on all VAE models to obtain the results in the chapter 4.

(40)

Chapter 4

Results

This chapter presents the results obtained on the test set for the previously tuned models, with parameters found in section 3.7.

The sectionUsing Reconstruction Loss, 4.1, presents results obtained by using recon- struction loss (see subsections 3.5.1 and 3.5.3) as a predictor for whether the image contains a lesion. In the section Using Transfer Learning Approach, 4.2, the AUC scores are calculated using the predictions from the fine-tuned model (see subsection 3.5.2).

The following AUC results are obtained using bootstrapping with replacement for one hundred repetitions. Measures such as F1, sensitivity, and specificity are not acquired through bootstrapping and are only present in the sectionUsing Reconstruction Loss 4.1 as they were calculated using the reconstructed losses. Again, AUC scores and other measures are acquired on the test dataset in this chapter.

The CAEs were trained for 100 epochs, and VAEs were trained for 200 epochs on the training set.

View Normal Abnormal Sum

Axial 787 58 845

Coronal 632 58 690

Sagittal 649 60 709

Table 4.1: Distribution of normal and abnormal images in the test set for the single-view models.

(41)

4.1 Using Reconstruction Loss

View Normal Abnormal Sum

Axial 636 46 682

Coronal 624 58 682

Sagittal 631 51 682

Table 4.2: Distribution of normal and abnormal images in the test set for the multi-view models.

4.1 Using Reconstruction Loss

Table 4.3 shows the mean and the standard deviation of the AUC score on the test dataset for the previously tuned models. The overall highest AUC score achieved is obtained for the coronal view using multi-view VGG16 CAE.

Model architecture View Mean AUC Std. AUC

CAE U-Net Axial 0.5406 0.0350

Coronal 0.5373 0.0352 Sagittal 0.5364 0.0366

VGG16 Axial 0.5394 0.0367

Coronal 0.5756 0.0395 Sagittal 0.5355 0.0380

VAE VGG16 Axial 0.5058 0.0334

Coronal 0.5583 0.0322 Sagittal 0.5420 0.0378 Multi-view CAE U-Net Axial 0.5090 0.0438 Coronal 0.5407 0.0441 Sagittal 0.5387 0.0481

VGG16 Axial 0.5003 0.0444

Coronal 0.5872 0.0424 Sagittal 0.5015 0.0437

Table 4.3: AUC scores based on reconstruction loss calculated as the MSE between input and output loss.

(42)

4.1 Using Reconstruction Loss

Model View Threshold TP TN FP FN Sensitivity Specificity F1 Accuracy

CAE U-Net Axial 1.9790e-05 2 770 17 56 0.0345 0.9784 0.0519 0.9136

Coronal 2.6411e-05 0 627 5 58 0.0 0.9922 0.0 0.9087

Sagittal 2.0121e-05 3 709 19 57 0.05 0.9739 0.0732 0.9036

VGG16 Axial 0.003801 3 771 16 55 0.0517 0.9797 0.0779 0.9160

Coronal 0.005697 0 632 0 58 0.0 1.0 0.0 0.9159

Sagittal 0.054743 2 717 11 58 0.0333 0.9849 0.0548 0.9124

VAE VGG16 Axial 0.024534 1 774 13 57 0.0172 0.9834 0.0278 0.9172

Coronal 0.027695 1 719 9 59 0.0167 0.9876 0.0286 0.9137

Sagittal 0.048382 2 712 16 58 0.0333 0.9780 0.0513 0.9061

mCAE U-Net Axial 2.2866e-06 1 619 17 45 0.0217 0.9733 0.0313 0.9091

Coronal 2.5542e-06 0 620 4 58 0.0 0.9936 0.0 0.9091

Sagittal 2.5843e-06 2 620 11 49 0.0392 0.9826 0.0625 0.9120

VGG16 Axial 0.003677 3 619 17 43 0.0652 0.9733 0.0909 0.9120

Coronal 0.008466 0 622 2 58 0.0 0.9968 0.0 0.9120

Sagittal 0.002426 1 629 2 50 0.0196 0.9968 0.0370 0.9238 Table 4.4: Measures of the detection performance for each model, obtained on the test

dataset without bootstrapping.

Table 4.4 shows that there are a lot of missed lesions. The models are good at detecting when an image does not contain a lesion but not as good at detecting lesions. The thresholds used are calculated using Tukey’s method (see subsection 3.5.3) and are not optimized for any specific metric. The best sensitivity score acquired is 0.0652 and is obtained on the axial view using the multi-view CAE implementation of the VGG16 model architecture. A sensitivity score of 0.0652 means that 6.52% of the lesions in the test dataset were correctly classified.

(43)

4.2 Using Transfer Learning Approach

4.2 Using Transfer Learning Approach

This section presents bootstrapped AUC score results using the transfer learning ap- proach. This method uses the same AEs presented above, in section 4.1, transferring the weights from their already pre-trained (on healthy data) encoders.

Model architecture View Mean AUC Std. AUC

CAE U-Net Axial 0.6190 0.0309

Coronal 0.5540 0.0361 Sagittal 0.5338 0.0314

VGG16 Axial 0.5842 0.0327

Coronal 0.5927 0.0377

Sagittal 0.5 0.0

VAE VGG16 Axial 0.5193 0.0384

Coronal 0.5501 0.0337 Sagittal 0.4992 0.0404 Multi-view CAE U-Net Axial 0.4831 0.0455 Coronal 0.5810 0.0434 Sagittal 0.5501 0.0340

VGG16 Axial 0.5155 0.0395

Coronal 0.6096 0.0403 Sagittal 0.5621 0.0361

Table 4.5: AUC scores obtained by using the transfer learning method for anomaly detec- tion on the test dataset.

This semi-supervised approach to detection improves the mean AUC score for most models. As a reminder, this method of detection further trains the models on the validation dataset. The validation dataset for the single-view AEs contains 399-425 images (depending on the view direction), where only a few of these contain a lesion, 31-33 (again, depending on the view direction). While the validation dataset for the multi-view AEs contains even less data for each view. See tables 3.3, 3.4, and 3.5 for the exact number of images in the validation dataset and the distribution of lesions amongst them.

(44)

4.3 Sensitivity Analysis of the Multi-view Autoencoders

4.3 Sensitivity Analysis of the Multi-view Autoencoders

For the multi-view CAEs, every view has an input and an output. This section ex- plains the processes followed to experiment with favoring one view above the others.

The hypothesis for this experiment was that by increasing the loss coefficients for one view, we would see an increased AUC score for that view. Further, we would be able to achieve a better AUC score on the best-performing view, coronal, and potentially reach an AUC score for the axial view comparable to the performance of the single-view CAE.

Favoring one view above others has been achieved by changing the defaultloss_weights parameter when calling Keras’s model compile function. Theloss_weights parame- ter accepts a list of coefficients for each output and, by default, is 1 for each output.

These loss coefficients are used to weight the loss contributions of different model outputs, where the loss values that will be minimized by the model will then be the weighted sum of all individual losses, weighted by the loss coefficients [64].

For this experiment, we have used the multi-view CAE with VGG16 model architec- ture, with the same parameters used in the in section 4. The chosen coefficients to experiment with were 1.2, 1.5, and 2. Each chosen coefficient has been tested for both the axial and the coronal view. The coronal view was chosen because it performs best for this architecture. The axial view was chosen because it performed better than the sagittal view in single-view VGG16, but underperformed in the multi-view.

This experiment consisted of 6 model runs. The presented results are the AUC scores using reconstruction loss (see section 3.5.1) for each view, and they are obtained through bootstrapping with replacement for one hundred iterations. The presented results also contain a multi-view with default loss coefficients to serve as a baseline.

Epochs Learning rate Batch size 100 1×10−4 8 samples Table 4.6: Summary of the used hyperparameters.

AUC scores shown in the Table 4.7, do not show a clear correlation with the coeffi- cients used. We attribute the variation of the AUC scores primarily to the random- ness of initialized weights and randomness of sampled points in the bootstrapping.

To further investigate how loss weights affect the separation of normal and abnormal data for abnormality detection, we recommend experimenting with a broader and

(45)

4.3 Sensitivity Analysis of the Multi-view Autoencoders

Coefficients Axial Coronal Sagittal

Axial Coronal Sagittal Mean AUC AUC std Mean AUC AUC std Mean AUC AUC std

1.0 1.0 1.0 0.5003 0.0444 0.5872 0.0424 0.5015 0.0437

1.2 1.0 1.0 0.5019 0.0445 0.5892 0.0445 0.5040 0.0439

1.5 1.0 1.0 0.5000 0.0446 0.5810 0.0443 0.5118 0.0449

2.0 1.0 1.0 0.5006 0.0445 0.5933 0.0445 0.5043 0.0444

1.0 1.2 1.0 0.5015 0.0445 0.5828 0.0420 0.5065 0.0438

1.0 1.5 1.0 0.5004 0.0440 0.5797 0.0440 0.5081 0.0450

1.0 2.0 1.0 0.5025 0.0446 0.5857 0.0437 0.5065 0.0441

Table 4.7: Comparison of AUC scores for multi-view models with modified loss coefficients.

The AUC scores are based on the MSE of the reconstruction loss for each view and are obtained through bootstrapping with replacement for hundred iterations.

(46)

Chapter 5

Discussion

This chapter discusses the methodology’s effectiveness (section 5.1) used in this thesis and reflects on what could be done in future works (section 5.2). Secondly, we evaluate the DL models used (section 5.3) and compare them to related works (section 5.4).

5.1 Effectiveness of Methodology

All patients in the PROSTATEx dataset [53] had at least one lesion. The supplemen- tary spreadsheet with information about where the lesions are located only specifies the center point of the lesion in each view. The slices (2D images) removed in the training dataset and the slices labeled as containing an abnormality in other sets only used this single point. Therefore it is not guaranteed that parts of the lesion are not present in nearby slices that were not labeled as abnormal. Preferably, the base AEs should have been trained on images from patients without PCa.

Both anomaly detection methods were able to differentiate to some degree between unhealthy and healthy data. The semi-supervised (using TL) method performed better than the fully unsupervised in almost all cases. None of the methods used in this work gave good enough results to be applicable in a clinical setting diagnosing patients.

For this thesis, we used the standard ROC-AUC functions from the Scikit package in Python and gave it all data as input, meaning we did not optimize the range of

(47)

5.2 Future Work

threshold for better AUC scores. Nevertheless, the AUC scores serve as a quantized way to compare the models. It is also important to note that we did not use a balanced dataset, which could have affected the results.

The latent dimensions of all models were tuned on the validation dataset. The rationale is based on the assumption that different views and models may need different latent dimensions to achieve an optimal latent representation. Too few tests were done to evaluate the effectiveness of this strategy, although it did reduce the amount of RAM necessary to run the models, which was critical when running the multi-view CAEs.

When attempting to build a VAE out of the U-Net architecture, it failed to opti- mize the KL loss out of the loss function; we believe the skip connections of that architecture may have caused this.

Reconstruction losses from the validation set was used to select and apply a threshold for classification on the test set. The threshold was calculated using Tukey’s method, and the threshold was likely too far out to the right of the distribution to give good results. The best sensitivity score of 0.0652 was acquired on the axial view using multi-view implementation of the VGG16 model architecture. Sensitivity measures the rate of true positives, and a score of 0.0652 means that only 6.5% of images containing lesions were correctly classified.

The multi-view models were built in such a way that they needed three inputs and gave out three outputs. The number of images in all datasets (train, validation, and test) were reduced due to the models’ requirement of equal amounts of data for each input. The reduction of data might have negatively impacted the models’ detection performance, especially for the axial view. In addition, changes to the test set might have affected the AUC score slightly.

5.2 Future Work

The data pre-processing for the multi-view CAEs in this work is not optimal mainly because some data had to be discarded. A solution that applies to our particular approach could include upsampling the data of the views containing less MRI image data. Another variation of training the model could be to mix the order of images between the views. In this work, the inputs given to the models were of the same patient. A more optimal approach to the multi-view models would be such that a lesion would appear simultaneously at each view either by using a 3 Dimensional (3D) input-output or registering the data in another way.

(48)

5.3 Evaluation of Deep Learning Models

The way that the threshold in the classification evaluation method was chosen could be improved. An alternative to Tukey’s method is to optimize based on a chosen metric on the validation dataset.

The parameters in the TL and fine-tuning were never properly tuned. The cho- sen learning rates and the number of epochs were primarily inspired by [58]. Al- though not quantified, our observations during testing of the TL approach indicated a substantial variability of AUC scores between epochs. We believe that this semi- supervised method for detection is far from being completely optimized and that a thorough tuning would significantly improve its detection ability.

We failed to build a working VAE model that functioned adequately with the U-Net architecture. One approach to solving the problem of the optimizer ignoring the KL-loss in the loss function could be to use a coefficient in front of the KL-loss that decreases with the number of epochs run.

In the future, when experimenting with the view loss coefficients for the multi-view models. We recommend experimenting with a larger and broader range of values than tested in this thesis.

5.3 Evaluation of Deep Learning Models

The U-Net models generally achieve higher AUC scores on the axial view, while VGG16 models performed better on the coronal view. Compared to the VAE, the CAE performed better overall. However, we observed a tendency for bettered per- formance for the VAE when the number of units in the bottleneck increased, but due to computational limitations, the number of units in the bottleneck could not be increased beyond 4096.

With both architectures (U-Net and VGG inspired), the multi-view AE models per- formance was better than the CAE and VAE for the coronal view with both pro- posed anomaly detection methods. The results were also better for the sagittal view with the multi-view CAEs than the single-view based (CAE and VAE) with the semi-supervised approach for detecting outliers. In addition, the multi-view CAE model with U-Net architecture performed better than the single-view CAE models for the sagittal view when considering the fully unsupervised method of anomaly detection. A plausible explanation for why the multi-view models performed worse on the axial view could be because of training data was significantly reduced in the training set from 2679 images to 2119.

(49)

5.4 Comparison to Related Work

Additional experimentation using pre-trained ImageNet weights (see appendix A) and tuning the loss coefficients for the multi-view architectures had no significant impact on the AUC scores.

Limitations

A limitation encountered for the VAE and multi-view architectures was depletion of random access memory (RAM) on the graphics processing unit (GPU). The graphics processing units used were the state-of-the-art Nvidia A100s with 40 gigabytes of RAM, so it is hard to blame the depletion of RAM on the GPU. The original plan for the VAE was to experiment with a broader and higher range of units in the bottleneck. The solution reached for the VAE was to reduce the range of tested units in the bottleneck. For multi-view models, the chosen solution to solve RAM depletion was to reduce the number of samples in a batch to 8 samples.

5.4 Comparison to Related Work

To the best of our knowledge, no existing work applies unsupervised learning with multi-view AEs to do anomaly detection on MRI image data of the prostate, and therefore, we do not have any direct comparisons.

Similar to the work in this thesis, [14] compared the performance of a multi-view approach including all image data for the task of classifying between PCa and non- PCa using the PROSTATEx dataset. [14] concluded that the use of a multi-view approach resulted in better results. The best-demonstrated results were with the multi-view approach using VGG16 architecture (0.843 AUC). However, it is worth noting that they used a fully supervised approach without autoencoders in their work. The best results obtained in this thesis with the fully unsupervised method was also with the multi-view VGG inspired CAE model (AUC 0.5872 on the coronal view, Table 4.3).

Referanser

RELATERTE DOKUMENTER