Performance metrics - Deep learning for image segmentation

The SGDR learning rate schedule

2.3 Deep learning for image segmentation

2.3.1 Performance metrics

There is one large problem with image segmentation as opposed to other image analysis algorithms – the difficulty of dealing with class imbalance. In classification problems, we classify each pixel in an image, and the object(s) we are interested in segmenting generally occupy a small are of the image. Thus, the (vast) majority of the pixels are background pixels.

The reason class imbalance problems are a particular problem in segmentation is that it cannot be resolved with the sampling algorithms [77]. These algorithms

fortunately, this is not feasible for images, as whole images are fed into the network at once.

Thus, segmentation accuracy, that is, the fraction of correctly classified pixels is not an ideal measure of a networks performance. To illustrate this, consider an image where 9000 pixels are of the background class and 1000 pixels are of the infected tissue class. In this setting, an algorithm that classifies all pixels as background pixels will achieve an accuracy of 90% whilst completely failing at tackling the task at hand. As a result of this, we introduce several performance measures when we compare segmentation algorithms.

All performance measures we introduce are designed for binary classification prob-lems. The background class is the negative class and the class of interest is the positive class. For segmentation problems with more than one class, the perform-ance measures can be computed for each class separately. The class of interest is then used as the positive class and all other classes are combined into the negative class.

There are four terms that are integral when we design performance measures. true positives,false positives,true negatives,false negatives. See definitions 2.3.1 - 2.3.3 for a definition of those terms.

Definition 2.3.1 (True positives). The number of true positives (T P) is the number of pixels belonging to the positive class that were correctly predicted as members of that class.

For a single image, T P is the total number of pixels belonging to the positive class that were correctly predicted as members of that class. For several images, this is summed up across all images.

Definition 2.3.2 (True negatives). The number of true negatives (T N) is the number of pixels belonging to the negative class that were correctly predicted as members of that class.

For a single image, T N is the total number of pixels belonging to the negative class that were correctly predicted as members of that class. For several images, this is summed up across all images.

number of pixels belonging to the positive class that were incorrectly predicted as members of the negative class.

For a single image,F N is the total number of pixels belonging to the positive class that were incorrectly predicted as members of the negative class For several images, this is summed up across all images.

Definition 2.3.4 (False positives). The number of false positives (F P) is the number of pixels belonging to the negatives class that were incorrectly predicted as members of the positive class.

For a single image,F P is the total number of pixels belonging to the Negative class that were incorrectly predicted as members of the Positive class For several images, this is summed up across all images.

Two very popular performance metrics are sensitivity and specificity [36]. These terms are defined in Definition 2.3.5 and Definition 2.3.6. Sensitivity measures the networks ability to correctly detect positive pixels, and specificity measures the networks ability to correctly detect negative pixels.

Definition 2.3.5(Sensitivity). Thesensitivity is thetrue positive rate (T P R) of a network. That is, the proportion of the positives that were correctly identified. Mathematically, this is the same as

T P R=P(P redictedpositive|P ositive) = T P

T P +F N, (2.53) where T P is the number of true positives and F N is the number of false negatives.

Another word used for sensitivity is recall.

of a network. That is, the proportion of the negatives that were correctly identified. Mathematically, this is the same as

T N R=P(P redictednegative|N egative) = T N

T N +F P, (2.54) where T N is the number of true negatives and F P is the number of false positives.

Another word used for specificity is selectivity.

Using only the sensitivity or onlyt ehs specificity doesn’t properly reflect model performance. However, the combination of the two carries much information about the networks efficiency. The reason for this, is that we can get a sensitivity of one if all pixels are predicted as positive pixels. Similarly, the specificity is one if all pixels are predicted as negative pixels. Thus, simply having a network with high sensitivity or specificity is not particularly enlightening.

Another performance metric that is important when reviewing tumour segmenta-tion maps is theprecision, or positive predictive value [78]. The definition for this metric is seen in Definition 2.3.7.

Definition 2.3.7 (Positive predictive value). The positive predictive value (P P V), orprecision, of a network is the probability that a positively predicted pixel in fact belong to the positive class. Mathematically, this is the same as

P P V =P(P ositive|P redictedpositive) = T P

T P +F P, (2.55) where T P is the number of true positives and F P is the number of false positives.

The main downside with the PPV is that it is dependent on the dataset it is computed upon. However, if the class imbalance in our dataset is ”typical”, then the precision can be more descriptive than only the sensitivity and specificity. The reason for this is that the precision can be low, even if both the sensitivity and specificity is high so long as the class imbalance is severe enough. Thus,P P V gives a clear indication of how adapt the network is at detecting objects-of-interest in real-world images.

One metric that is particularly popular to assess image segmentation algorithms is the Sørensen-Dice coefficient [79], [80]. This particular metric has many names;

F₁-score. The Dice score is defined in Definition 2.3.8.

Definition 2.3.8 (Dice score). The Dice score (DSC) is defined as the har-monic average of the precision (P P V) and the sensitivity (T P R). Mathemat-ically, this is equivalent to

DSC = 1

1 T P R+_{P P V}¹

, (2.56)

that is, the reciprocal of the mean of the reciprocals ofP P V and T P R. This can be rewritten as

DSC = 2

T P R+_{P P V}¹ = 2T P

2T P +F N +F P, (2.57) whereT P is the number of true positives,F N is the number of false negatives and F P is the number of false positives.

From Definition 2.3.8, we see that one benefit of the Dice coefficient is that it doesn’t involve the number of true negatives. This is beneficial since, if the object-of-interest is small compared to the background, then it is very easy to get a high number of true negatives.

The Dice coefficient is an average (specifically, the harmonic mean) of the sens-itivity and precision. It is, therefore, easy to generalise it to a weighted average of sensitivity and precision. Doing this gives rise to the F_β score [81], which is defined in Definition 2.3.9. Thus, we have a method to weigh sensitivity more than precision and vice versa.

average of the precision (P P V) and the sensitivity (T P R), where the weight of the precision isβ² and the sensitivity is 1. Mathematically, this is equivalent to

which can be rewritten as F_β = 1 +β²

β²

T P R+ _{P P V}¹ = (1 +β²)T P

(1 +β²)T P +β²F N +F P, (2.59) whereT P is the number of true positives,F N is the number of false negatives and F P is the number of false positives.

When we use the Fβ score, the sensitivity has β times as much influence on the score as the PPV. Thus, if we value sensitivity more than the PPV, β should be high and vice versa. Explaining why the β value must be squared is outside the scope of this project, and the interested reader is recommended to read pages 133 and 134 of Information Retrieval by Rijsbergen [81].

In document Deep learning for automatic delineation of tumours from PET/CT images (sider 75-80)