FACULTY OF SCIENCE AND TECHNOLOGY
BACHELOR THESIS
Study programme / specialisation:
Datateknologi The spring semester, 2022 Open
Author: Linas Vidziunas
& Ørjan Kløvfjell Thoresen ………
(signature author) Course coordinator: Álvaro Fernández Quílez
Supervisor(s): Álvaro Fernández Quílez, Ketil Oppedal
Thesis title:
Unsupervised lesion detection with multi-view MRI and autoencoders
Credits (ECTS): 20 Keywords:
Autoencoder, Unsupervised
learning, Multi-view MRI, Transfer learning, Prostate cancer, Anomaly detection
Pages: 47 + appendix: 28
Stavanger, 14 May/2022 date/year
Abstract
Prostate Cancer (PCa) is one of the most common health threats in the developed world and the most diagnosed cancer among men in Norway. Detecting a lesion is a challenging task, and early diagnosis and treatment can be crucial for the patient.
Computer-assisted systems using Deep Learning (DL) to analyze Magnetic Reso- nance Imaging (MRI) data could help improve diagnostic accuracy if used together with current practices. However, the DL systems require large amounts of annotated data which are rarely available. In addition, the presence of factors such as bias in the form of a higher prevalence of one class in classification tasks can harm the final results of the developed system.
This thesis explores methods for performing anomaly detection on MRI image data of the prostate, where lesions are the anomalies. We tackle the lack of annotated data by using unsupervised learning and a type of neural network called Autoencoders (AEs). Specifically, Convolutional Autoencoders (CAEs), Variational Autoencoders (VAEs), and multi-view CAEs are compared.
We evaluate the methods by means of Area Under the Curve (AUC) to measure the ability to separate the images with a lesion from those that do not. The results of this thesis show that using a multi-view CAE offers advantages over the single-view based CAE and VAE in terms of ROC-AUC and that a fully unsupervised approach holds potential for lesion detection in prostate MRI.
Acknowledgements
This thesis was conducted during the spring semester of 2022. It marked the end of our bachelor’s degree at the University of Stavanger. We must admit that it has been challenging but also educational and fun.
We wish to thank all the staff and co-students at the University of Stavanger for making it such a great place to be a student during the last three last years and throughout the pandemic.
We owe our supervisers, Álvaro Fernández Quílez and Ketil Oppedal, a special thanks. In particular, we want to express our gratitude to Álvaro for the countless meetings and email conversations, as well as his continuous support and guidance.
Contents
Abstract i
Acknowledgments ii
Abbreviations viii
1 Introduction 1
1.1 Motivation . . . 1
1.2 Problem Definition . . . 3
1.2.1 Objectives . . . 3
1.2.2 Proposed Method Overview . . . 3
1.3 Related Work . . . 4
2 Technical Background 5 2.1 Artificial Neural Networks . . . 5
2.2 Convolutional Neural Networks . . . 6
2.2.1 Convolutional Layer . . . 6
CONTENTS
2.2.2 Pooling Layers and Transposed Convolutional Layer . . . 7
2.3 Autoencoder . . . 8
2.3.1 Variational Autoencoder . . . 9
2.4 Training . . . 10
2.4.1 Loss function . . . 10
3 Materials & Methods 12 3.1 Introduction . . . 12
3.2 Software . . . 14
3.3 Dataset and Data Pre-processing . . . 15
3.3.1 Data Pre-processing . . . 16
3.3.2 Multi-view Pre-processing . . . 18
3.4 Model Design . . . 20
3.4.1 Convolutional Autoencoder . . . 20
3.4.2 Variational Autoencoder . . . 21
3.4.3 Multi-view Autoencoder . . . 22
3.5 Evaluation . . . 23
3.5.1 Area Under the Curve based on Reconstruction Loss . . . 23
3.5.2 Transfer Learning and Fine-tuning . . . 24
3.5.3 Classification and Threshold Selection . . . 25
3.5.4 Bootstrapping with Replacement . . . 27
CONTENTS
3.6 Learning Rate and Batch Size Selection . . . 28
3.7 Tuning . . . 28
3.7.1 Convolutional Autoencoder . . . 29
3.7.2 Variational Autoencoder . . . 29
3.7.3 Selecting Epochs . . . 30
4 Results 31 4.1 Using Reconstruction Loss . . . 32
4.2 Using Transfer Learning Approach . . . 34
4.3 Sensitivity Analysis of the Multi-view Autoencoders . . . 35
5 Discussion 37 5.1 Effectiveness of Methodology . . . 37
5.2 Future Work . . . 38
5.3 Evaluation of Deep Learning Models . . . 39
5.4 Comparison to Related Work . . . 40
6 Conclusion 41
Bibliography 41
Appendix 48
A Setting - Using pre-trained weights 48
CONTENTS
B Anatomical planes 51
C Coding and Computational Resources 52
C.1 Python code . . . 53
D Model Description 54
D.1 VGG16 inspired . . . 55 D.2 U-Net . . . 59
E Learning Rate 64
E.1 U-Net . . . 65 E.2 VGG16 . . . 66
F Batch Size 67
F.1 U-Net . . . 68 F.2 VGG16 . . . 69
G Convolutional Autoencoder 70
G.1 Graphical illustration of the effect of latent constraint . . . 70 G.2 Tuning U-Net . . . 72 G.3 Tuning VGG16 AE . . . 73
H Varitaional Autoencoder 74
H.1 Tuning VGG16 AE . . . 74
Abbreviations
AE Autoencoder
VAE Variational Autoencoder
CNN Convolutional Neural Network ANN Artificial Neural Network MRI Magnetic Resonance Imaging AUC Area Under the Curve
DL Deep Learning
CAE Convolutional Autoencoder MSE Mean Squared Error
KL Kullback-Leibler
DICOM Digital Imaging and Communications in Medicine GPU Graphics Processing Unit
UiS University of Stavanger
TL Transfer Learning
CONTENTS
IQR Inter Quartile Range PCa Prostate Cancer 2D 2 Dimensional
PSA Prostate Specific Antigen
ROC Receiver Operating Characteristic TPR True Positive Rate
FPR False Positive Rate
API Application Programming Interface MR Magnetic Resonance
T2w T2-weighted
PD-w Proton Density-weighted TP True Positives
TN True Negatives
FP False Positives
FN False Negatives
3D 3 Dimensional
Chapter 1
Introduction
1.1 Motivation
Prostate Cancer (PCa) is the second most diagnosed cancer among men and the fifth leading cause of death worldwide [1]. In Norway, PCa is the most diagnosed cancer among men [2]. During the 1990s, there was a massive upswing in PCa diagnosis in Norway. During this time, screening and testing were introduced to the primary health system using Prostate Specific Antigen (PSA), followed by biopsies. The increase in testing is considered the most plausible explanation for the increase in the diagnosis of PCa [2]. Prostate cancer may be asymptomatic in its early stages, and therefore regular screening is crucial to give early diagnosis and treatment. The use of PSA remains controversial as a screening technique due to over-diagnosis of indolent tumors, potentially unnecessary biopsies, and treatment [3].
After undergoing a PSA test, Magnetic Resonance Imaging (MRI) can be used to support the conclusion reached by the PSA test [4]. Furthermore, it can be used as a standalone technique. On the other hand, an experienced clinician can usually discern between normal and abnormal cases after seeing a few cases and by mere comparison. Nevertheless, this approach has several problems; it suffers from intra- reader variability and can be a time expensive process.
Machine learning has in recent years been leveraged by the healthcare industry and has the potential to transform many aspects of the medical landscape [5]. Deep Learning (DL) holds a promise in automating MRI analysis, though traditional DL methods rely on labeled data. Labeled data means that the data has annotations,
1.1 Motivation
less labeled data exists in the public domain, and acquiring more is time-intensive and expensive from an economic point of view [6], [7].
Autoencoders (AEs) leverage neural networks to learn efficient encodings of unla- beled data (unsupervised learning). Autoencoders have previously been shown to be successful in data compression and outlier detection [8], [9]. To work around the lack of labeled data, this thesis explores the use of AE architectures to detect when a lesion is present in an MRI image of the prostate. There is a higher prevalence of cases that are non-PCa when compared to the number of cases with PCa, and unsupervised learning helps exploit that bias in a positive way. A multi-view Con- volutional Autoencoder (CAE) such as the one proposed in this thesis combines the MRI images of all views, potentially leading to more efficient use of the available data.
1.2 Problem Definition
1.2 Problem Definition
The primary goal is to explore the use of multi-view CAEs in the task of anomaly detection, where lesions in the prostate are the anomalies. The models will train and predict on 2 Dimensional (2D) MRI image data. In essence, the task is abinary classificationproblem because the data consists of two classes; PCa and non-PCa. In order to offer a fair comparison between the multi-view CAEs and their single-view counterparts, the model architecture and the hyperparameter optimization process will be similar. The list of objectives is listed below.
1.2.1 Objectives
• Build and train CAEs and Variational Autoencoders (VAEs) and quantify the overall performance of their ability to detect when a lesion is present using unsupervised learning.
• Build and train multi-view CAEs based on the same architecture as the single- view counterparts to compare its performance.
1.2.2 Proposed Method Overview
First, the data (MRI) is loaded, and the images are pre-processed into different datasets and given labels indicating whether the image contains a lesion. The pro- posed CAE and VAE models are trained exclusively on healthy (non-PCa) data and fine-tuned on a dataset containing both healthy and unhealthy data (validation set).
The models are evaluated on their overall performance in detecting when a lesion is present in a given MRI image of the prostate on the test set.
Figure 1.1: An illustration of the proposed methodology in this thesis
1.3 Related Work
1.3 Related Work
As previously introduced, this thesis focuses primarily on using an unsupervised learning paradigm approach and AEs to detect lesions in MRI image data containing both healthy and unhealthy data, also referred to as anomaly detection. In [9] AEs are used for anomaly detection using normal training data only and demonstrates its ability to successfully detect abnormalities for a range of various objects.
Convolutional autoencoders and VAEs are used in this work, the VAE was first introduced in this paper [10] by Diederik P. Kingma and Max Welling. Our imple- mentation draws a lot of inspiration from the Keras AE code examples [11], [12].
The inspiration to build a multi-view AE structure comes from [13] and [14]. [13]
uses deep convolutional networks combining all views of the image data in one model, more specifically a U-Net architecture to do segmentation of the prostate. [14] com- bines MRI image data of all views to classify PCa and demonstrates better classi- fication results using the multi-view than single-view for several architectures. Our work also draws inspiration from this master’s thesis [15].
The AEs used in this work are inspired by U-Net [16] and VGG16 [17]. Although not used for the same purposes, the models are designed for tasks that indicate that they should work well for the tasks presented in this thesis. U-Net is a type of AE that has obtained good results in medical imaging tasks [16].
Chapter 2
Technical Background
2.1 Artificial Neural Networks
Artificial Neural Networks (ANNs) are computational processing systems based on the processes of the brain, i.e., biological nervous systems [18]. A typical structure has multiple inputs and outputs, and in between, there are hidden layers. The layers in between are called hidden because the data does not show the desired output for these [19]. The hidden layers are typically vector-valued; each unit resembles a neuron because its value depends on the value of other units in the networks and the connections between them [19]. The structure of these connections can come in many different forms, but all of them contain something called weights. The weights between the neurons are what is trained when training the network [20].
x0
x1 x2
x3 Input
layer
h(1)1
h(1)2 h(1)3
h(1)4 Hidden
layer 1
h(2)1
h(2)2 h(2)3 Hidden layer 2
ˆ y1
ˆ y2 Output
layer
Figure 2.1: Illustration of a simple ANN [21].
2.2 Convolutional Neural Networks
something called fully connected layers, also referred to as dense layers. Each neuron in a fully connected layer has connections to all the neurons in the previous and the following layer [18].
2.2 Convolutional Neural Networks
Convolutional Neural Networks (CNNs) are ANNs that use a linear operation called convolution [22]. Convolutional neural networks are often used when dealing with images, one of the reasons being that they can drastically reduce the number of parameters necessary to train the network well [23].
In CNNs, feature maps are generated by applying filters to the input image or the prior layer. Feature map visualization gives insights into where certain features are likely to appear [24].
2.2.1 Convolutional Layer
The following explanation of the convolutional function in the context of CNN is inspired by [22].
Convolutional neural networks are commonly used with image data; the operation is then 2D convolutional. An example of a formula for 2D convolution is given in Equation 2.1.
Y(i, j) =X
n
X
k
X(n, k)K(i−n, j−k) (2.1)
Using Equation 2.1 in the context of an image undergoing 2D convolution, the Y represents the output and is sometimes referred to as the feature map, andX rep- resents the input layer. For the output i and j specifies the specific neuron and n and k specifies the position of the pixel for the input layer. For this to result in an overall reduction of parameters, the Kernal function (K) needs to be 0 for all negative arguments.
For one input image, multiple functions for K can be trained, resulting in multiple channels (feature maps). When speaking about the number of filters of a convo- lutional layer in a neural network, this is the instances for function K. Kernel size
2.2 Convolutional Neural Networks
refers to the size of the function K, in Figure 2.2, a kernel size of 3x3 is used.
0 1 1 1 0 0 0 0 0 1 1 1 0 0 0 0 0 1 1 1 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 0
I
∗
1 0 1 0 1 0 1 0 1
K
=
1 4 3 4 1 1 2 4 3 3 1 2 3 4 1 1 3 3 1 1 3 3 1 1 0
I∗K 1 0 1
0 1 0 1 0 1
×1 ×0 ×1
×0 ×1 ×0
×1 ×0 ×1
Figure 2.2: Illustration of 2D convolution [25].
2.2.2 Pooling Layers and Transposed Convolutional Layer
Pooling layers are typically used with convolutional layers and are a nonlinear way to downsample [26]. Two popular pooling layers are max pooling and average pooling.
A max-pooling layer reduces the input dimension by keeping the maximum value, whereas an average pooling calculates the average value [26]. This is done by applying a filter and a stride. The stride defines how that filter moves with every iteration.
Using a filter size of 2x2 will reduce the dimension by a factor of 2.
Figure 2.3: Illustration of max pooling with stride=2 and a 2x2 filter [27].
On the other hand,transposed convolutional layersare typically used for upsampling the data. Meaning the output size of the layer is greater than that of the input. This is done by adding zeros between the rows and columns of the input. The input can also be enlarged by adding a padding around the input, often containing zeros.
After that, the kernel slides across the picture and generates an output based on the
2.3 Autoencoder
2.3 Autoencoder
An AE is a type of neural network, and it can be trained to try to reconstruct its input to its output [29]. Autoencoders consist of two parts, the encoder, and the decoder [29]. The encoder takes an input and typically carries out dimensional reduction using hidden layers. The decoder is trained to then recreate the original input, given the lower-dimensional representation as input. By first encoding the data, the model prioritizes which features of the input should be kept (or which features of the input carry the most essential information), this often results in the network learning useful properties of the data [29]. Convolutional autoencoder is an autoencoder that makes use of a CNN as part of the architecture. Autoencoders have many different use cases, traditionally it has been used for dimension reduction and feature learning [29]. Although not the most effective method of data compression, some use cases for autoencoders include image denoising or generative tasks such as creating faces never seen before. In this work, we explore autoencoders for anomaly detection.
Figure 2.4: Typical structure of an autoencoder [30].
2.3 Autoencoder
2.3.1 Variational Autoencoder
Variational autoencoders are also ANNs. Whereas the regular autoencoder is a deterministic model, the variational autoencoder is a probabilistic model. When the encoder with functionq(z|x)is given an inputx, the encoder produces a distribution for z described by the function p(z). From the distribution p(z), a specific value z is sampled, and the decoder with function p(x’|z)attempts to reconstruct the input givenz as an input [31].
Figure 2.5: Illustration of a typical VAE model architecture [32].
Through training, the model is not only trained to reconstruct the input but also trained to keep the distance loss between the two distributionsq(z|x) andp(x|z) as small as possible. This can be done by using Kullback–Leibler divergence [31] and something known as the reparameterization trick [10]. This results in that different features end up in different places in what is often referred to as the latent space.
By simply giving the trained decoder values forz it can be used for generative tasks.
2.4 Training
2.4 Training
When training a deep neural network to a specific task, the goal is to adjust the parameters in the network to make it perform well on the training data but with the idea of learning a structure that generalizes to images beyond that training data [20]. Training a neural network comes down to finding a minimum of a cost function [20]. The output of the cost function indicates how well the network is performing [33]. A learning algorithm attempts to optimize the parameters such that the scalar produced by the cost function is as small as possible [33] [20].
Forward propagation is when a feedforward neural network is given an input, and the information flows forward in the network to produce an output [34]. Back- propagation does this in reverse order, the information flows from the cost backward throughout the network to produce a gradient of the cost function [35] [36]. The gradient of a function tells the direction of the steepest descent, by taking steps in that direction, the output of the cost function will become smaller [20]. Since the starting position when training is often random, there is no guarantee that the local minimum that the network goes into is the global minimum [20].
Back-propagation is often misunderstood to be the whole learning algorithm, but it is in fact, only used to compute the gradient. Another algorithm, such as stochas- tic gradient descent, is performing the learning by using the gradients from back- propagation [35].
2.4.1 Loss function
Binary cross-entropy is a loss function that is used for binary application problems [37]. In a case where normalized (having a value between 0 and 1) pixel arrays are used to represent the input image, the loss function could work to minimize the difference between the input pixel value and the predicted pixel value by the autoencoder [38]. Binary cross-entropy is the particular case of cross-entropy with only two classes. Consider pi as the true label and qi as the predicted distribution of the predicted value. By using the notationsp∈ {y,1−y}and q∈ {ˆy,1−y}ˆ the equation can be written as shown in Equation 2.2 [39].
H(p, q) =−X
i
pilog(qi) =−ylog(ˆy)−(1−y)log(1−y)ˆ (2.2)
2.4 Training
Categorical cross-entropy uses cross-entropy as a loss function in a similar manner as the binary cross-entropy function [40]. The main difference is that the sum of the predicted labels is equal to 1, this can be thought of as using a softmax [41]
activation across multiple binary cross-entropy losses. When categorizing between multiple classes, the predictions can be interpreted as probabilities.
The Kullback-Leibler (KL) divergence gives a measure of how one probability distri- bution is different from another probability distribution. The equation in its discrete form with the distributions P and Q are defined on the same probability space χ, the equation can be written as in Equation 2.3 [42].
DKL(P∥Q) =X
x∈χ
P(x)log
P(x) Q(x)
(2.3)
When KL is implemented as part of a cost function for a variational autoencoder, as in this thesis, the probability distributions are not in the same probability space.
Hence, tricks to compute the KL measure are required. More on how it can be accomplished in [10], [43].
The Mean Squared Error (MSE) is a statistical method used to measure the difference between an observed value and a predicted value [44]. The MSE can be calculated by the Equation 2.4. In Equation 2.4yi is theith observed (or true value) value, yˆi is the corresponding predicted value andn is the number of observations.
M SE = 1 n
n
X
i=n
(yi−yˆi)2 (2.4)
Chapter 3
Materials & Methods
3.1 Introduction
Figure 3.1: An overview of the methodology.
3.1 Introduction
As mentioned in the first chapter, this thesis uses unsupervised learning and vari- ous AEs to perform anomaly detection on MRI image data, where lesions are the anomalies. The experimental part of the thesis investigates whether using a multi- view AE structure is a valid approach and if it offers any improvements compared to its single-view counterparts on the task of classifying the MRI image data (PCa or non-PCa).
First, the data is pre-processed, and the models are trained exclusively on healthy data only from the training set. The validation set containing both unhealthy and healthy data is used to fine-tune the hyperparameters of the models. This is done by considering MSE losses between the input images and the reconstructed images by the autoencoder as anomaly scores to calculate an Area Under the Curve (AUC) score (see section 3.5.1). The AUC score is used to measure the model’s overall ability to separate between healthy and unhealthy images.
The proposed CAE and VAE models are fine-tuned for each view, by testing different dimensions in the latent dimension. This is done because lower-dimension embed- dings can improve performance for AEs [8]. The multi-view CAE is built while keeping the common parts from the fined tuned models. In the final stage, the mod- els are used to give anomaly scores on the healthy and unhealthy images of the test set, which is not previously used for neither training nor tuning of hyperparameters.
We use two methods for anomaly detection: the first method is based on reconstruc- tion losses as outlined in section 3.5.1, and the second applies TL and training of a categorical layer, 3.5.2. For both methods, bootstrapping with replacement (see section 3.5.4) is used for one hundred repetitions to account for potential variabilities of the results.
In addition, a third evaluation method is carried out by using a specific threshold based on reconstruction losses from the validation set to classify the images of the test set. The results are documented with several metrics as specified in section 3.5.3.
When referring to the axial, sagittal, or coronal plane (view) throughout this thesis, we are referring to the anatomical planes (an illustration is shown in the appendix B).
3.2 Software
3.2 Software
The programming language Python has been used for this thesis. Python is a popular high-level programming language with a comprehensive library and a large commu- nity [45]. This section will explain our most used external Python libraries. All of the libraries are free and open-sourced. The specific versions of the used libraries can be found on our Github repository in the file namedrequirements.txt.1
Tensorflow [46] (TF) is a library for machine learning developed by Google and released in 2015. TF is supported for a wide range of other programming languages, such as JavaScript and C++ [47]. Specifically, we used TF 2 which is the predecessor to TF 1, released in 2019.
Keras is an Application Programming Interface (API) for neural networks written in Python. Keras focuses on being user-friendly by being modular and easily extensible [48]. Keras acts like a wrapper and does not by it self handle low-level computations thus it requires a backend, such as TF. As of 2017 Keras has been integrated into Tensorflow as module.
Numpy is used for numerical computations, and is actually the most used python library [45]. It supports a wide variety of mathematical functions and is used for a wide range of tasks, from generating random numbers to computations of multi- dimensional arrays, and much more [49].
Matplotlib is a library for visualizing data. In this thesis we use Matplotlib to plot and export MRI images and more. Matplotlib also supports animated and interactive visualization, and allows to easily change visual styles and layouts [50].
Scikit-learn is a library for predictive data analysis, and offers tools to easily and effectively perform classification, regression, data preprocessing, and more [51].
Pydicom is a library to read, modify and create data encoded in the Digital Imaging and Communications in Medicine standard [52].
1Link to the Github repository: https://github.com/LinasVidziunas/Unsupervised-lesio n-detection-with-multi-view-MRI-and-autoencoders
3.3 Dataset and Data Pre-processing
3.3 Dataset and Data Pre-processing
The prostate image dataset is obtained from the PROSTATEx Challenge [53]–[55]
(November 21, 2016, to February 16, 2017)2. The challenge is a collection of prostate Magnetic Resonance (MR) studies curated for research in computer-aided diagnosis of the prostate. The dataset contains T2-weighted (T2w), proton density-weighted (PD-w), dynamic contrast enhanced (DCE), and diffusion-weighted (DW) imaging, from which we only use T2w. The challenge also provides supplementary spread- sheets containing image and lesion information, such as the dimensions of the images and where the lesions are located. From the challenge, only the training dataset is used. The testing cohort is kept as an external validation set in case we consider submitting to the public results table. The training cohort consists of 204 patients with a total of 330 lesions [55] and contains a total of 12845 T2w slices (2D images) where 1109 of the total are classified as containing a lesion.
View No. without lesions No. with lesions Sum
Axial 4798 412 5210
Coronal 3108 339 3447
Sagittal 3830 358 4188
Sum 11736 1109 12845
Table 3.1: Distribution of images with and without lesions in the for the T2W.
x and y dim. Axial Coronal Sagittal Sum
256 21 21
320 296 3426 4169 7891
384 4742 4742
512 21 19 40
640 151 151
Sum 5210 3447 4188 12845
Table 3.2: Frequencies of image dimensions for each view.
Table 3.2 shows that the image dimensions slightly vary, but they have a clearly prevailing dimension for each view, 384x384 for axial and 320x320 for coronal and sagittal. The images are encoded using the Digital Imaging and Communications in Medicine (DICOM) standard. The DICOM standard is an international standard for medical images and their relative information and is used in almost every radiology, cardiology imaging, and radiotherapy device [56].
2Download link: https://www.cancerimagingarchive.net/nbia-search/?CollectionCrite ria=PROSTATEx
3.3 Dataset and Data Pre-processing
3.3.1 Data Pre-processing
This subsection explains how the dataset has been pre-processed and the techniques used. All of the following data pre-processing has been accomplished using Python.
Some of the patients in the PROSTATEx dataset contain multiple sets of T2w images, and this is because the radiologist has decided to retake the scans due, to, for example, the image being blurry because the patient moved during the scan. If a patient had multiple sets of T2w images in one direction (view), we only used one of them and discarded the rest. The image dimensions used to feed the networks are 384x384 for the axial view, and 320x320 for the coronal and sagittal view. No registration of images between the different has been performed.
Splitting Datasets
The original dataset has been split by patients into three different sets; train, vali- dation, and test, with a distribution of 70%/10%/20% respectively. As a result of splitting the dataset by patients, a patient and their images cannot exist in multiple datasets (train, validation, test), i.e., if a patient is part of the training set for axial, they are also a part of training for the coronal and sagittal sets.
View Train Validation Test Sum
Axial 2679 425 845 3939
Coronal 2143 347 690 3180
Sagittal 2489 399 709 3597
Table 3.3: Total images after limiting each patient to one set of images for each view.
View Validation Test
Axial 31 58
Coronal 33 58
Sagittal 33 60
Table 3.4: Number of images containing lesions of the total amount.
Using Python, every image has been checked for whether it has a lesion in the supplementary spreadsheet, and a tag has been added to the DICOM file containing a boolean value of either true or false, true for containing a lesion, false for not containing a lesion. Reading, modifying, and saving DICOM files was accomplished using the pydicom library [52]. Then, the modified DICOMs were saved on a new path, containing information about which set and view it belongs to, with a file name
3.3 Dataset and Data Pre-processing
that includes whether the DICOM contains a lesion, the patient ID, DICOM series number, and its slice position (in the z-axis).
Figure 3.2: An example of a used file path to an image.
Further Pre-processing
When loading the images into memory for use in an AE, empty lists are created for the pixel arrays for each dataset (train, validation, test) and for the true labels of the pixel arrays. For each DICOM file in our custom directory structure, the DICOM file is read, and information such as the pixel array and the previously added tag of whether the image contains a lesion or not, are extracted.3
The previously added tag is retrieved and appended to the list of true labels. The pixel arrays are further pre-processed via normalization and reshaping. Normaliza- tion ensures that all pixel intensities range from 0 to 1 (see Equation 3.1). The now normalized pixel arrays are reshaped to the most prevailing image dimension for their view, 384x384 for axial and 320x320 for coronal and sagittal. Reshaping is necessary because the proposed models contain convolutional layers which require a constant dimension for its input. Finally, the reshaped normalized pixel arrays are appended to the pixel array list for their corresponding dataset (train, validation, or test).
xnorm= x−xmin
xmax−xmin (3.1)
In Equation 3.1: x is the pixel intensity,xmin and xmax are, respectively, the mini- mum and maximum pixel intensities for the current image, andxnorm is the resulting normalized pixel [15].
3Link to the Github repository: https://github.com/LinasVidziunas/Unsupervised-lesio n-detection-with-multi-view-MRI-and-autoencoders
3.3 Dataset and Data Pre-processing
3.3.2 Multi-view Pre-processing
Our proposed implementation of a multi-view CAE has an extra set of requirements for the data than the single-view implementations due to containing multiple inputs:
it requires an equal amount of axial, coronal, and sagittal images to train. Such a condition comes from the fact that models need a constant batch size to be trained on (same amount of data coming from all the different sources). As a result of this requirement, some images had to be discarded. In particular, we discard the images that appear at the start and end of the patient sequence of each view. The rationale underlying this choice is that we want to keep the images containing the whole prostate (middle of the sequence) and potentially contain more information that can be used at training time.
We assume that the order of input images for each view matters and that feeding the same patient into all of the model’s inputs will result in better detection performance.
Furthermore, we want to retrieve the images for each view for each patient in the correct order. We hypothesize that feeding the model images in sequential order of the same patient in each input, will enforce the AEs to make connections between the different views, resulting in an improved detection. Additionally, the approach has a relatively low computational cost (as well as time-wise) since the order of the images is obtained directly from the data retrieved from the MR systems and, in this case, provided directly by the challenge.
We tackle all the previous issues by implementing a heuristic in Python which allows, for each view of each patient, to retrieve a slice/part of the list of the images of equal length to the number of images in the view containing the fewest.4 Additionally, the retrieved slice contains the elements in the middle of the patient sequence, as previously introduced. See Figure 3.3 for a visual representation of how the lists are modified.
4Link to the Github repository: https://github.com/LinasVidziunas/Unsupervised-lesio
3.3 Dataset and Data Pre-processing
Figure 3.3: Illustration of the returned lists from the algorithm for retrieving slices of lists equal to the length of the shortest list, where the rectangles represent an element in a list.
3.4 Model Design
Python code 3.1: Pseudocode of the algorithm for the axial view.
1 # Number of images in the view that contains the least
2 minimum = min(len(axial), len(coronal), len(sagittal))
3
4 a = len(axial)//2-(minimum//2)
5 b = len(axial)-(ceil(len(axial)/2)-ceil(minimum/2))
6
7 axial = axial[a:b]
Using this approach results in 2119 images for each view in the train dataset, 345 for each view in the validation dataset, and 682 for each view in the test dataset. A total of 87 images containing a lesion are present in the validation set, and 155 are present in the test set out of the total number of images present in those sets. For the numbers of lesions present in each dataset, keep in mind that an actual lesion in three dimensions can be calculated up to three times, once for each view.
View Train Validation Test
Normal Abnormal Normal Abnormal
Axial 2119 319 26 636 46
Coronal 2119 312 33 624 58
Sagittal 2119 317 28 631 51
Table 3.5: Distribution of normal and abnormal (without lesion and with lesion) images in the datasets for only the multi-view architecture.
3.4 Model Design
This thesis explores two different architectures; one based on the widely-used U-Net architecture and one based on VGG16. The VAE and multi-view CAE models are extensions of the proposed CAE, meaning they use the same encoder and decoder architecture. All the AE models have been trained to reconstruct their input, using Adam as the optimizer from the Keras library [48].
3.4.1 Convolutional Autoencoder
The original paper describing VGG architecture used it to perform classification and, therefore, contains a fully connected classification layer at the end [17]. Our proposed CAE model inspired by VGG16 follows the standard AE structure: an encoder composed of the VGG16 architecture and a decoder mirroring the encoder.
3.4 Model Design
The encoder is the same as VGG16 except without the fully connected layers and the last max-pooling layer, and the decoder is the mirrored image of the encoder with upsampling layers instead of max-pooling. Detailed figures of the encoder and decoder structures used for all models including an illustration of the implemented U-Net CAE can be found in the appendix D. Onwards in this thesis, when referring to VGG16, we are referring to our implementation of VGG16 as an AE.
64
128 256 512 512 512 512 256
128 64 1
Sigmoid
Figure 3.4: The implemented autoencoder inspired by VGG16. Yellow boxes represent convolutional layers, red boxes represent max-pooling layers, and magenta boxes represent upsampling layers. The right-most layer represents a convolutional layer with sigmoid as its activation function. The numbers under the layers represent the number of filters.
The used loss function for single-view CAEs is binary cross-entropy. An explanation of binary cross-entropy is available in subsection 2.4.1.
3.4.2 Variational Autoencoder
In [12], before the bottleneck, they flatten the output from the encoder and feed it into a fully connected layer. This fully connected layer will be referred to as thepre- bottleneck layer. Similarly, after the bottleneck, they have also used a fully connected layer; this layer will be referred to as the post-bottleneck layer. See the Figure 3.5 for a visual representation of how the layers are connected.
Our VAE implementations are inspired by [12] and use both previously introduced layers,pre-bottleneck, andpost-bottleneck. Otherwise, the encoder and decoder parts of the implemented VAEs are precisely the same as their fully convolutional coun-
3.4 Model Design
Figure 3.5: Illustration of the layers in the bottleneck for our implementation of VAE.
We have used the sametrain_step as described in [12]. Thetrain_step specifies the loss as a combination of binary cross-entropy and KL loss. Explanations of both loss functions are available in section 2.4.
3.4.3 Multi-view Autoencoder
Our proposed implementation of a multi-view CAE features an encoder and a decoder for each view, all connecting to a single shared bottleneck.
The explored way to interconnect all encoders and decoders into a single shared bot- tleneck used a concatenation layer. The concatenation layer combines the previous layers into one layer.
Figure 3.6: Proposed multi-view model architecture. The arrows illustrate a concatenation operation.
Multi-view CAEs have been trained to minimize the binary cross-entropy reconstruc- tion loss for each view simultaneously.
3.5 Evaluation
3.5 Evaluation
The three evaluation methods in the following subsections have been used to evaluate the AE models. The evaluation method’s in section 3.5.1 and 3.5.3 were explained and introduced to us by our supervisor in the first and second meeting [6], [7] and also draws inspiration from this master’s thesis [15].
3.5.1 Area Under the Curve based on Reconstruction Loss
The classification performance of an autoencoder is evaluated through AUC. Area under the curve is commonly used to measure a model’s overall performance and ability to detect anomalies [7]. AUC is the area under the Receiver operating char- acteristic curve (ROC curve).
The ROC curve is a graph that shows the model’s performance of binary classification at all classification thresholds [57]. The curve plots two parameters; True Positive Rate (TPR) and False Positive Rate (FPR) (See Equation 3.2 and Equation 3.3).
T P R= T P
T P +F N (3.2)
F P R= F P
F P +T N (3.3)
When calculating the AUC based on MSE losses in this thesis, the MSE losses between the input images and the reconstructed images by the autoencoder are considered a prediction of abnormality. This is done by normalizing the MSE losses such that the value range is [0,1]. To calculate the AUC and plot the ROC curve, the normalized losses and the true labels (indicating whether the image contains a lesion or not) are given as inputs to functions belonging to the Scikit library [51].
3.5 Evaluation
3.5.2 Transfer Learning and Fine-tuning
This detection method is greatly inspired by [58]. TL simply means to transfer some learned weights from one model to another.
Applying TL to help with anomaly detection is done in the following three stages.
1. An AE model is trained on the training set (which contains exclusively healthy data).
2. The encoder and the embedding of the trained model are reused with learned weights, and a new categorical layer (dense layer with softmax [59] as activation function) is put on top of that. The encoder and the embedding weights are then frozen, and the new layer is trained to categorize normal and abnormal slices on the validation set, training it to recognize abnormal slices as the vector [1,0] and normal slices as the vector [0,1].
3. The weights of the whole new model are unfrozen, and the model is fine-tuned (training on a low learning rate and few epochs) on the validation set one more time.
Figure 3.7
When given inputs (image data), the model predicts a vector containing the proba- bilities of belonging to a class (example: [0.65, 0.35] meaning that 65% probability of being healthy and 35% probability of being unhealthy). The values that the model predicts when given an input from the test set are then normalized such that the value range is [0,1]. The normalized values and the true labels (PCa or non-PCa) are used to plot a ROC curve and calculate AUC using functions belonging to Scikit
3.5 Evaluation
library [51]. In both steps 2 and 3, the models are trained with categorical cross- entropy. In step 2, we have used the same learning rate as for the base autoencoder.
In step 3, we have used a learning rate of 1e-5, the same as used in [58]. For the CAEs, the same number of epochs was used as in [58], 20 for step 2 and 10 for step 3. For the VAEs, the number of epochs used in step 3 was extended to 100 because it is a probabilistic model.
3.5.3 Classification and Threshold Selection
Finally, we also explore a threshold-based method to classify between healthy and unhealthy images. Thresholds have been set on reconstruction losses for images in the validation dataset by considering the MSE on the difference between the input images and the predicted images. When classifying the images in the test set, all images with an MSE greater than the threshold are classified as abnormal and all images with an MSE lesser than the threshold are classified as normal.
In the paper [60], the author stated several statistical methods for outlier detection.
However, in this thesis, Tukey’s Method has been used because the method does not assume an underlying Gaussian or normal distribution. Interquartile Range (IQR) is a measure of the data’s variability about the median. IQR is given by Equation 3.4. Q1 is the first quartile of the data, representing 25% of the data lies between minimum and Q1. Q3 is the third quartile and is where 75% of the data is lying between minimum and Q3.
IQR=Q3−Q1 (3.4)
A threshold for outlier detection could then be set by using Equation 3.5 or 3.6
T hreshold=Q1−(1.5∗IQR) (3.5) T hreshold=Q3 + (1.5∗IQR) (3.6) Because the threshold is set on the reconstruction losses, the threshold given by the Equation 3.6 has been used, as we expected the abnormal images to have a higher reconstruction error [6]. The thresholds were determined from the validation set and applied to the test set.
3.5 Evaluation
We used sensitivity, specificity, F1, and accuracy metrics [61] to evaluate the detec- tion for abnormal and normal slices. The formulas are described in terms of True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN) (see Equation 3.7, Equation 3.8, Equation 3.9, and Equation 3.10).
Sensitivity = T P
T P +F N (3.7)
Specif icity= T P
T P +F N (3.8)
F1 = 2T P
2T P +F P +F N (3.9)
Accuracy= T P +T N
T P +T N +F P +F N (3.10)
These metrics were used to evaluate the performance with the threshold set as given by Equation 3.6. The effect of the threshold selection is documented with the metrics from equations 3.7, 3.8, 3.9 and 3.10. For all of the models, the tests have been carried out for every view (axial, sagittal and coronal) independently.
3.5 Evaluation
3.5.4 Bootstrapping with Replacement
Bootstrapping with replacement is a statistical method that mimics the effects of running the evaluation multiple times [62], which can be further used to carry out statistical testing or obtain confidence intervals, giving a measure of uncertainty around the results.
It is done by sampling n number of data points from the original data to create a new dataset. The resampling process is typically done with several repetitions; each new resampled dataset is called a bootstrapping set. Bootstrappingwith replacement means that if a data point is sampled into one set, it can also reappear in future sets. For all our tests, we have used bootstrapping with replacement for one hundred repetitions, where each bootstrapping set has been stratified such that it contains the same ratio PCa to non-PCa. Other works such as [14] use a similar approach.
In our work, we make use of theresample function imported from the Scikit library to carry out this task.
Python code 3.2: Pseudocode of how therasamplefunction is called along with the passed arguments. Detailed explanation of this function and it’s parameters can be found at [63].
1 bootstrap_data, bootstrap_labels = resample(test_data,
2 test_labels, replace=True,
3 n_samples=len(test_data), stratify=test_labels)
3.6 Learning Rate and Batch Size Selection
3.6 Learning Rate and Batch Size Selection
We experimented with different learning rates and batch sizes using both proposed CAE models. We used AUC scores (plotted in a graph every 10th epoch) on the validation set to evaluate the effect of different hyperparameters. The tested learning rates were 1e-3, 1e-4, 1e-5, 1e-6 and tested batch sizes were of 4, 8, 16, 32 samples.
All the experiments were run for 1500 epochs. The graphs can be found in appendix F and E.
Batch size Learning rate 32 samples 1×10−4
Table 3.6: Chosen hyperparameters. These hyperparameters are obtained only using the validation set.
The hyperparameters in Table 3.6 were used for all tests in the results chapter except for the multi-view CAE models because of computational limitations encountered on the batch size. For the multi-view models, a batch size of 8 was used.
3.7 Tuning
The rationale underlying the tuning of latent dimensions for each view is to adapt each model to achieve the best possible results. As mentioned in the introduction of this chapter, the reason for doing this is that lower dimensions have been shown to improve AE performance [8]. The tested dimensions differ for the different model architectures (CAE and VAE) and are presented in the following subsections. After brief experimentation, we determined to run the models for 300 epochs. AUC scores were calculated on the validation dataset at a set of epochs, including; 10, 25, 50, 100, 200, and 300, using the evaluation method outlined in section 3.5.1 to calculate the AUC score. The highest average AUC score determined the filter dimension we used for all further testing. If several dimensions resulted in the same average results, one was chosen randomly. The results from this tuning process were also used to choose the number of epochs to run on the final models. A visual illustration of the effect of tuning latent dimension is shown in appendix G.
3.7 Tuning
3.7.1 Convolutional Autoencoder
In order to tune U-Net and VGG16 for each view (axial, coronal, and sagittal), we experimented with the number of filters in thelast convolutional layer of the encoder structure. The following number of filters were tested: 128, 64, 32, 16, 8, and 4. See appendix G for tables on the calculated AUC scores throughout the set of different epochs on the validation set for the CAE models.
U-Net: Chosen number of filters
Axial 128
Coronal 32
Sagittal 16
Table 3.7: Chosen number of filters for the CAE models with U-Net architecture.
VGG16: Chosen number of filters
Axial 16
Coronal 32
Sagittal 4
Table 3.8: Chosen number of filters for the CAE models with VGG16 architecture.
3.7.2 Variational Autoencoder
This subsection does not contain the U-Net architecture because it failed to optimize the KL loss in the loss function. This is further discussed in chapter 5.
We fixed the size of the pre-bottleneck fully connected layer to the largest latent dimension we planned to test. This was done so that the pre-bottleneck fully con- nected layer does not become the main constraint in the VAE model. The largest latent dimension tested is 4096. Therefore, thepre-bottleneck layer had a dimension of 4096 units in all tests. The post-bottleneck fully connected layer is fixed to the same number of units as the latent image dimension in the encoder multiplied by 32, as 32 is the highest number of filters in the last convolutional layer of the encoder.
The number of filters in the last convolutional layer of the encoder was set to the same as the chosen number of filters presented in tables 3.7 and 3.8. The dimensions ofµ,σ, and sampling layers were tuned with the following number of units for each view: 64, 128, 256, 1024, and 4096.
For more extensive tables on tuning the VAE latent dimensions, see appendix H.
3.7 Tuning
Axial Coronal Sagittal
128 4096 4096
Table 3.9: Chosen latent dimensions for each view.
3.7.3 Selecting Epochs
The tables in appendix G were used to determine the appropriate number of epochs to run the final single- and multi-view CAE models. In addition, the figures in appendix E were used. From those graphs, it seems models with the chosen learning rate start producing diminishing results at around 150 epochs. One hundred epochs were chosen for simplicity as this seemed to be working well for all CAE models.
From the tables in appendix H it seemed the VAE models in general needed more epochs to get better results. Two hundred epochs were chosen and used on all VAE models to obtain the results in the chapter 4.
Chapter 4
Results
This chapter presents the results obtained on the test set for the previously tuned models, with parameters found in section 3.7.
The sectionUsing Reconstruction Loss, 4.1, presents results obtained by using recon- struction loss (see subsections 3.5.1 and 3.5.3) as a predictor for whether the image contains a lesion. In the section Using Transfer Learning Approach, 4.2, the AUC scores are calculated using the predictions from the fine-tuned model (see subsection 3.5.2).
The following AUC results are obtained using bootstrapping with replacement for one hundred repetitions. Measures such as F1, sensitivity, and specificity are not acquired through bootstrapping and are only present in the sectionUsing Reconstruction Loss 4.1 as they were calculated using the reconstructed losses. Again, AUC scores and other measures are acquired on the test dataset in this chapter.
The CAEs were trained for 100 epochs, and VAEs were trained for 200 epochs on the training set.
View Normal Abnormal Sum
Axial 787 58 845
Coronal 632 58 690
Sagittal 649 60 709
Table 4.1: Distribution of normal and abnormal images in the test set for the single-view models.
4.1 Using Reconstruction Loss
View Normal Abnormal Sum
Axial 636 46 682
Coronal 624 58 682
Sagittal 631 51 682
Table 4.2: Distribution of normal and abnormal images in the test set for the multi-view models.
4.1 Using Reconstruction Loss
Table 4.3 shows the mean and the standard deviation of the AUC score on the test dataset for the previously tuned models. The overall highest AUC score achieved is obtained for the coronal view using multi-view VGG16 CAE.
Model architecture View Mean AUC Std. AUC
CAE U-Net Axial 0.5406 0.0350
Coronal 0.5373 0.0352 Sagittal 0.5364 0.0366
VGG16 Axial 0.5394 0.0367
Coronal 0.5756 0.0395 Sagittal 0.5355 0.0380
VAE VGG16 Axial 0.5058 0.0334
Coronal 0.5583 0.0322 Sagittal 0.5420 0.0378 Multi-view CAE U-Net Axial 0.5090 0.0438 Coronal 0.5407 0.0441 Sagittal 0.5387 0.0481
VGG16 Axial 0.5003 0.0444
Coronal 0.5872 0.0424 Sagittal 0.5015 0.0437
Table 4.3: AUC scores based on reconstruction loss calculated as the MSE between input and output loss.
4.1 Using Reconstruction Loss
Model View Threshold TP TN FP FN Sensitivity Specificity F1 Accuracy
CAE U-Net Axial 1.9790e-05 2 770 17 56 0.0345 0.9784 0.0519 0.9136
Coronal 2.6411e-05 0 627 5 58 0.0 0.9922 0.0 0.9087
Sagittal 2.0121e-05 3 709 19 57 0.05 0.9739 0.0732 0.9036
VGG16 Axial 0.003801 3 771 16 55 0.0517 0.9797 0.0779 0.9160
Coronal 0.005697 0 632 0 58 0.0 1.0 0.0 0.9159
Sagittal 0.054743 2 717 11 58 0.0333 0.9849 0.0548 0.9124
VAE VGG16 Axial 0.024534 1 774 13 57 0.0172 0.9834 0.0278 0.9172
Coronal 0.027695 1 719 9 59 0.0167 0.9876 0.0286 0.9137
Sagittal 0.048382 2 712 16 58 0.0333 0.9780 0.0513 0.9061
mCAE U-Net Axial 2.2866e-06 1 619 17 45 0.0217 0.9733 0.0313 0.9091
Coronal 2.5542e-06 0 620 4 58 0.0 0.9936 0.0 0.9091
Sagittal 2.5843e-06 2 620 11 49 0.0392 0.9826 0.0625 0.9120
VGG16 Axial 0.003677 3 619 17 43 0.0652 0.9733 0.0909 0.9120
Coronal 0.008466 0 622 2 58 0.0 0.9968 0.0 0.9120
Sagittal 0.002426 1 629 2 50 0.0196 0.9968 0.0370 0.9238 Table 4.4: Measures of the detection performance for each model, obtained on the test
dataset without bootstrapping.
Table 4.4 shows that there are a lot of missed lesions. The models are good at detecting when an image does not contain a lesion but not as good at detecting lesions. The thresholds used are calculated using Tukey’s method (see subsection 3.5.3) and are not optimized for any specific metric. The best sensitivity score acquired is 0.0652 and is obtained on the axial view using the multi-view CAE implementation of the VGG16 model architecture. A sensitivity score of 0.0652 means that 6.52% of the lesions in the test dataset were correctly classified.
4.2 Using Transfer Learning Approach
4.2 Using Transfer Learning Approach
This section presents bootstrapped AUC score results using the transfer learning ap- proach. This method uses the same AEs presented above, in section 4.1, transferring the weights from their already pre-trained (on healthy data) encoders.
Model architecture View Mean AUC Std. AUC
CAE U-Net Axial 0.6190 0.0309
Coronal 0.5540 0.0361 Sagittal 0.5338 0.0314
VGG16 Axial 0.5842 0.0327
Coronal 0.5927 0.0377
Sagittal 0.5 0.0
VAE VGG16 Axial 0.5193 0.0384
Coronal 0.5501 0.0337 Sagittal 0.4992 0.0404 Multi-view CAE U-Net Axial 0.4831 0.0455 Coronal 0.5810 0.0434 Sagittal 0.5501 0.0340
VGG16 Axial 0.5155 0.0395
Coronal 0.6096 0.0403 Sagittal 0.5621 0.0361
Table 4.5: AUC scores obtained by using the transfer learning method for anomaly detec- tion on the test dataset.
This semi-supervised approach to detection improves the mean AUC score for most models. As a reminder, this method of detection further trains the models on the validation dataset. The validation dataset for the single-view AEs contains 399-425 images (depending on the view direction), where only a few of these contain a lesion, 31-33 (again, depending on the view direction). While the validation dataset for the multi-view AEs contains even less data for each view. See tables 3.3, 3.4, and 3.5 for the exact number of images in the validation dataset and the distribution of lesions amongst them.
4.3 Sensitivity Analysis of the Multi-view Autoencoders
4.3 Sensitivity Analysis of the Multi-view Autoencoders
For the multi-view CAEs, every view has an input and an output. This section ex- plains the processes followed to experiment with favoring one view above the others.
The hypothesis for this experiment was that by increasing the loss coefficients for one view, we would see an increased AUC score for that view. Further, we would be able to achieve a better AUC score on the best-performing view, coronal, and potentially reach an AUC score for the axial view comparable to the performance of the single-view CAE.
Favoring one view above others has been achieved by changing the defaultloss_weights parameter when calling Keras’s model compile function. Theloss_weights parame- ter accepts a list of coefficients for each output and, by default, is 1 for each output.
These loss coefficients are used to weight the loss contributions of different model outputs, where the loss values that will be minimized by the model will then be the weighted sum of all individual losses, weighted by the loss coefficients [64].
For this experiment, we have used the multi-view CAE with VGG16 model architec- ture, with the same parameters used in the in section 4. The chosen coefficients to experiment with were 1.2, 1.5, and 2. Each chosen coefficient has been tested for both the axial and the coronal view. The coronal view was chosen because it performs best for this architecture. The axial view was chosen because it performed better than the sagittal view in single-view VGG16, but underperformed in the multi-view.
This experiment consisted of 6 model runs. The presented results are the AUC scores using reconstruction loss (see section 3.5.1) for each view, and they are obtained through bootstrapping with replacement for one hundred iterations. The presented results also contain a multi-view with default loss coefficients to serve as a baseline.
Epochs Learning rate Batch size 100 1×10−4 8 samples Table 4.6: Summary of the used hyperparameters.
AUC scores shown in the Table 4.7, do not show a clear correlation with the coeffi- cients used. We attribute the variation of the AUC scores primarily to the random- ness of initialized weights and randomness of sampled points in the bootstrapping.
To further investigate how loss weights affect the separation of normal and abnormal data for abnormality detection, we recommend experimenting with a broader and
4.3 Sensitivity Analysis of the Multi-view Autoencoders
Coefficients Axial Coronal Sagittal
Axial Coronal Sagittal Mean AUC AUC std Mean AUC AUC std Mean AUC AUC std
1.0 1.0 1.0 0.5003 0.0444 0.5872 0.0424 0.5015 0.0437
1.2 1.0 1.0 0.5019 0.0445 0.5892 0.0445 0.5040 0.0439
1.5 1.0 1.0 0.5000 0.0446 0.5810 0.0443 0.5118 0.0449
2.0 1.0 1.0 0.5006 0.0445 0.5933 0.0445 0.5043 0.0444
1.0 1.2 1.0 0.5015 0.0445 0.5828 0.0420 0.5065 0.0438
1.0 1.5 1.0 0.5004 0.0440 0.5797 0.0440 0.5081 0.0450
1.0 2.0 1.0 0.5025 0.0446 0.5857 0.0437 0.5065 0.0441
Table 4.7: Comparison of AUC scores for multi-view models with modified loss coefficients.
The AUC scores are based on the MSE of the reconstruction loss for each view and are obtained through bootstrapping with replacement for hundred iterations.
Chapter 5
Discussion
This chapter discusses the methodology’s effectiveness (section 5.1) used in this thesis and reflects on what could be done in future works (section 5.2). Secondly, we evaluate the DL models used (section 5.3) and compare them to related works (section 5.4).
5.1 Effectiveness of Methodology
All patients in the PROSTATEx dataset [53] had at least one lesion. The supplemen- tary spreadsheet with information about where the lesions are located only specifies the center point of the lesion in each view. The slices (2D images) removed in the training dataset and the slices labeled as containing an abnormality in other sets only used this single point. Therefore it is not guaranteed that parts of the lesion are not present in nearby slices that were not labeled as abnormal. Preferably, the base AEs should have been trained on images from patients without PCa.
Both anomaly detection methods were able to differentiate to some degree between unhealthy and healthy data. The semi-supervised (using TL) method performed better than the fully unsupervised in almost all cases. None of the methods used in this work gave good enough results to be applicable in a clinical setting diagnosing patients.
For this thesis, we used the standard ROC-AUC functions from the Scikit package in Python and gave it all data as input, meaning we did not optimize the range of
5.2 Future Work
threshold for better AUC scores. Nevertheless, the AUC scores serve as a quantized way to compare the models. It is also important to note that we did not use a balanced dataset, which could have affected the results.
The latent dimensions of all models were tuned on the validation dataset. The rationale is based on the assumption that different views and models may need different latent dimensions to achieve an optimal latent representation. Too few tests were done to evaluate the effectiveness of this strategy, although it did reduce the amount of RAM necessary to run the models, which was critical when running the multi-view CAEs.
When attempting to build a VAE out of the U-Net architecture, it failed to opti- mize the KL loss out of the loss function; we believe the skip connections of that architecture may have caused this.
Reconstruction losses from the validation set was used to select and apply a threshold for classification on the test set. The threshold was calculated using Tukey’s method, and the threshold was likely too far out to the right of the distribution to give good results. The best sensitivity score of 0.0652 was acquired on the axial view using multi-view implementation of the VGG16 model architecture. Sensitivity measures the rate of true positives, and a score of 0.0652 means that only 6.5% of images containing lesions were correctly classified.
The multi-view models were built in such a way that they needed three inputs and gave out three outputs. The number of images in all datasets (train, validation, and test) were reduced due to the models’ requirement of equal amounts of data for each input. The reduction of data might have negatively impacted the models’ detection performance, especially for the axial view. In addition, changes to the test set might have affected the AUC score slightly.
5.2 Future Work
The data pre-processing for the multi-view CAEs in this work is not optimal mainly because some data had to be discarded. A solution that applies to our particular approach could include upsampling the data of the views containing less MRI image data. Another variation of training the model could be to mix the order of images between the views. In this work, the inputs given to the models were of the same patient. A more optimal approach to the multi-view models would be such that a lesion would appear simultaneously at each view either by using a 3 Dimensional (3D) input-output or registering the data in another way.
5.3 Evaluation of Deep Learning Models
The way that the threshold in the classification evaluation method was chosen could be improved. An alternative to Tukey’s method is to optimize based on a chosen metric on the validation dataset.
The parameters in the TL and fine-tuning were never properly tuned. The cho- sen learning rates and the number of epochs were primarily inspired by [58]. Al- though not quantified, our observations during testing of the TL approach indicated a substantial variability of AUC scores between epochs. We believe that this semi- supervised method for detection is far from being completely optimized and that a thorough tuning would significantly improve its detection ability.
We failed to build a working VAE model that functioned adequately with the U-Net architecture. One approach to solving the problem of the optimizer ignoring the KL-loss in the loss function could be to use a coefficient in front of the KL-loss that decreases with the number of epochs run.
In the future, when experimenting with the view loss coefficients for the multi-view models. We recommend experimenting with a larger and broader range of values than tested in this thesis.
5.3 Evaluation of Deep Learning Models
The U-Net models generally achieve higher AUC scores on the axial view, while VGG16 models performed better on the coronal view. Compared to the VAE, the CAE performed better overall. However, we observed a tendency for bettered per- formance for the VAE when the number of units in the bottleneck increased, but due to computational limitations, the number of units in the bottleneck could not be increased beyond 4096.
With both architectures (U-Net and VGG inspired), the multi-view AE models per- formance was better than the CAE and VAE for the coronal view with both pro- posed anomaly detection methods. The results were also better for the sagittal view with the multi-view CAEs than the single-view based (CAE and VAE) with the semi-supervised approach for detecting outliers. In addition, the multi-view CAE model with U-Net architecture performed better than the single-view CAE models for the sagittal view when considering the fully unsupervised method of anomaly detection. A plausible explanation for why the multi-view models performed worse on the axial view could be because of training data was significantly reduced in the training set from 2679 images to 2119.
5.4 Comparison to Related Work
Additional experimentation using pre-trained ImageNet weights (see appendix A) and tuning the loss coefficients for the multi-view architectures had no significant impact on the AUC scores.
Limitations
A limitation encountered for the VAE and multi-view architectures was depletion of random access memory (RAM) on the graphics processing unit (GPU). The graphics processing units used were the state-of-the-art Nvidia A100s with 40 gigabytes of RAM, so it is hard to blame the depletion of RAM on the GPU. The original plan for the VAE was to experiment with a broader and higher range of units in the bottleneck. The solution reached for the VAE was to reduce the range of tested units in the bottleneck. For multi-view models, the chosen solution to solve RAM depletion was to reduce the number of samples in a batch to 8 samples.
5.4 Comparison to Related Work
To the best of our knowledge, no existing work applies unsupervised learning with multi-view AEs to do anomaly detection on MRI image data of the prostate, and therefore, we do not have any direct comparisons.
Similar to the work in this thesis, [14] compared the performance of a multi-view approach including all image data for the task of classifying between PCa and non- PCa using the PROSTATEx dataset. [14] concluded that the use of a multi-view approach resulted in better results. The best-demonstrated results were with the multi-view approach using VGG16 architecture (0.843 AUC). However, it is worth noting that they used a fully supervised approach without autoencoders in their work. The best results obtained in this thesis with the fully unsupervised method was also with the multi-view VGG inspired CAE model (AUC 0.5872 on the coronal view, Table 4.3).