• No results found

The U-Net uses three types of operations:

• Convolution

• Max Pooling

• Deconvolution

A convolution is an operation that transforms an image into a new one according to a convolution kernel. The kernel (or filter) is a matrix, in this case 3x3, which indicates the weights of each pixel from the old image in the new image. The convolution is applied pixel-wise, and the center of the kernel is the weight of that pixel, while the other values are the weights of the neighbor pixels. See figures2.6and2.7for a visual explanation. For example, the following filter (known as the Sobel operator) will detect horizontal lines in an image.

But the magic part of neural networks is that we don’t need to put this kernel any-where, the network will learn this and many more kernels on its own. The parameters of all the kernels present in aCNNare learnable, they change when trying to improve the accuracy of the network. This means that in theory the network will find the optimal kernels for its specific task.

The original U-Net uses avalidpadding strategy, which means that each convolu-tion loses one border pixel. For example, in figure2.6an 4x4 input is transformed into 2x2. Similarly, a 64x64 image would be converted into a 62x62 image.

This strategy results in many limitations, so in our model it was replaced with zero-padding, which conserves the input size. See figure2.7for a visual representation.

2.4. Network

Figure 2.6: Convolution with 3x3 filter size, 4x4 input size (blue) and 2x2 output size (cyan). Source: [1]

Figure 2.7: Convolution with 3x3 filter size, 5x5 input size (blue) and 5x5 output size (cyan) using zero-padding. Source: [1]

The other operations, max pooling and deconvolution, are used to halve and double the image resolution, respectively. Max pooling finds the maximum value in each 2x2 non-overlapping slice of the image:

The input imageA, represented as a 4x4 matrix, is transformed into a 2x2 matrixB. The reverse operation where each pixel is repeated in order to fill a 2x2 area is known as upsampling:

Since upsampling alone isn’t very effective, in the U-Net a deconvolution is used to increase the resolution. A deconvolution, also known as up-convolution or transposed convolution, is equivalent to an upsampling followed by a convolution. It can be used to obtain more detailed results than those obtained with a simple upsampling. And

since it is a convolution after all, it also will be trained to produce the optimal result.

See figure2.8for an example of deconvolution which doubles the input size.

Figure 2.8: Deconvolution with filter size 3x3. The 3x3 input image (blue) is upsampled, and then a convolution is applied to it. The resulting 6x6 image (cyan) is shown above.

Source: [1]

2.4.2 3D U-Net

The modified 3D U-Net works on 3D images, by replacing all the 2D operations with the 3D equivalents. It useszero-padding, because the input images already use zero-padding. Other changes include the addition of batch normalization [9] and dropout [10] layers.

Batch normalization is the major improvement over the original U-Net. It is a layer which scales the outputs of the previous layer so that the mean is zero and the variance is one. This speeds up the training process by allowing a greater learning rate, reduces overfitting, provides a small regularization, an eliminates the vanishing gradient problem. It basically makes the layers more independent of each other

Overfitting happens when the network memorizes the training data instead of learning the features. An overfitted model will perform very well on the training data, but very poor on validation data (on new data that is has not seen before).

Dropout prevents overfitting. When combined with batch normalization, the effect is less notable, but the result of using batch normalization and dropout together is better than using only batch normalization or only dropout.

The resulting architecture, when using an input size of64x64x32,depth=5 and num_base_features=32is shown on figure2.9. This particular configuration has about 2 million trainable parameters.

The input, a grayscale 3D image, is shown on the top left. Then a series of operations are applied to the input image, represented by the arrows. Finally, on the top right, we have the output: a segmentation map, an image with the same resolution that the input, but instead of showing an image it shows the probability of each voxel being a spine.

The resolution displayed on each layer is the resolutionbeforethe maxpool, because that’s the resolution of the features (channels). This means that there are 32 channels with size 64x64x32, and 512 channels with size 4x4x2. On the decode path, the channels from below are first upsampled to double resolution and then combined with the channels from the image on the left, which has the same resolution as the current

2.4. Network

Figure 2.9: Structure of the 3D U-Net

image. As an example, the 256-channel 8x8x4 image is created by combining the 512-channel 4x4x2 (upsampled to a 256-512-channel 8x8x4) with the 256-512-channel 8x8x4 from the left, creating a 512-channel 8x8x4 image. Then two convolutions are applied to this image reducing the number of channels to 256. There are mainly 4 types of operations, explained later on.

One encode step is the sequential combination of the layers listed in table2.1:

Table 2.1: Layers in an encode step 3x3x3 Convolution

Batch Normalization Dropout

ReLU Activation 3x3x3 Convolution Batch Normalization ReLU Activation 2x2x2 Max Pooling

The activation function used is the Rectified Linear Unit (ReLU). It is defined as:

ReLU(x)=max(0,x) Figure2.10shows the graph of this function.

The last operation, maxpooling, is the one that reduces the resolution in half.

One decode step is the sequential combination of the layers listed in table2.2:

Figure 2.10: ReLU function. The negative values are set to zero.

Table 2.2: Layers in a decode step 3x3x3 Deconvolution

The first operation, a deconvolution, also known as transposed convolution, dou-bles the resolution of the image. The following step, a concatenation, doudou-bles the number of features (channels) by including the images from a previous layer. This concatenation is represented by the horizontal arrows in figure2.9. The remaining operations are similar to the ones in the encode step.

The last layer of the network is a 1x1x1 convolution + sigmoid.

A 1x1x1 convolution acts as a fully connected layer. It merges all the channels of the image (32 in our case) intonchannels, withnbeing the number of labels. n=1 because we only have one label:spine, because thebackgroundis implicit.

The sigmoid is another activation function with the property that the output will always be between 0 and 1. ReLUs do not have that property, which is useful because we represent the intensity in the images as a real number between 0 and 1.

S(x)= 1

1+ex = ex ex+1

Besides of the architecture, it is also important to mention the training process. We use the Adam optimizer with an initial learning rate of 10−4. The learning scheduler will reduce the learning rate when the validation loss has not improved for 10 epochs.

The batch size was chosen to be 2 images, mainly because of memory constrains.

The loss function is binary crossentropy, defined as:

2.4. Network

Figure 2.11: Sigmoid function. The output is capped to the [0, 1] interval.

Hy0(y) := −X

i

(y0ilog(yi)+(1−yi0) log(1−yi))

whereyiis the predicted probability value for sampleiandy0iis the true probability for that sample. In the case of 3D images, a sample is a voxel. When the prediction equals the true value, the loss is 0. Otherwise, the loss grows to positive infinity.

This loss function was chosen in favor of Dice coefficient loss, because even though the latter allows multi-class predictions (with multiple labels) and leads to more well-defined predictions, binary crossentropy was found to perform slightly better. See figure2.12for a visual comparison.

Figure 2.12: Demonstration of how the loss function affects the model. From top to bottom: original image, prediction of binary crossentropy model, prediction of dice coefficient model, ground truth.

Weight initialization from a normal distribution withµ=0 andσ=p

2/N, which is the optimal for this kinds of networks [11], the same is used in the original U-Net.

2.5 Validation

The validation consists of two independent steps:

• First, the 3D U-Net is trained and validated using the ground truth data.

• Next, the trained 3D U-Net is validated using the prediction of Model 1 as the label. In this phase the spines will be extracted from the prediction, which means that there are going to be some false positives. The goal of the Model 2 is to identify these false positives, as well as improving the accuracy of the prediction.

The metrics used in the validation are obtained from a confusion matrix. This matrix is created by comparing the prediction with the real values, and counting the number of matches for each case. Since it is a binary metric: a voxel must be either 0 or 1, but the prediction is a probability, so it can be anywhere between 0 and 1. Therefore, the prediction must be binarized. The binarization method is a simple thresholding witht=0.5. When comparing 3D images, we work on each voxel, so the confusion matrix shows the number of voxels.

Predicted value

0 1

Actual value 0 True Negatives False Positives 1 False Negatives True Positives Figure 2.13: Confusion matrix

As shown in figure2.13, the confusion matrix shows correct guesses: True Negative (TN) (expected 0, predicted 0) and True Positive (TP) (expected 1, predicted 1). And it also shows the errors: False Positive (FP) (expected 0, predicted 1) and False Negative (FN) (expected 1, predicted 0).

The following metrics are used to evaluate the model:

• Sensitivity, or True Positive Rate (TPR):

T P R= T P T P+F N

• Precision, or Positive Predicted Value (PPV):

P PV= T P T P+F P

Both metrics should ideally be close to 1. A high sensitivity indicates that there are few false negatives, while a high precision indicates that there are few false positives.

In other words, a sensible model will detect the spine and its surroundings, without missing any part of the spine, but misidentifying some dendrites as spines. A precise model will only detect spines as spines, but it may miss some parts of them.