Results for Instance Segmentation using Mask R-CNN

This section evaluates Mask R-CNN on the corrosion dataset using various data augmen-tation schemes. Results for Mask R-CNN follow the same structure as for PSPNet. Parts of the discussions in this section are therefore brief. Predicted segmentation masks are visualized and compared in Section 5.3. Transfer learning from MS COCO [37] was used in all experiments.

5.2.1 No Data Augmentation

Mask R-CNN was first trained without any data augmentation to define a performance base line. Resulting test frequency IoU values are plotted in Figure 5.6. Final class-wise IoU, mean IoU and frequency IoU are listed in Table 5.7.

Table 5.7: IoU for Mask R-CNN when trained without data augmentation.

Dataset Background IoU Corrosion IoU Mean IoU Frequency IoU

Training 94.7 % 78.9 % 86.8 % 91.7 %

Validation 88.0 % 57.1 % 72.5 % 81.3 %

Test 89.5 % 54.4 % 72.0 % 83.4 %

0 10 20 30 40 50 60 70

Epoch 0.80

0.82 0.84 0.86 0.88 0.90 0.92

IoU train (no augmentation)

val (no augmentation) test (no augmentation)

Figure 5.6: Frequency IoU on training, validation and test set for Mask R-CNN trained with no data augmenta-tion.

Discussion

It is evident from Figure 5.6 that the network is severely overfitting. Whereas performance on training data continuously improves for the first60epochs before stagnating,

perfor-mance on validation and test data never really increase after the first few epochs. After50 epochs, the weights in the network seem to have stabilized to a degree where validation and test data performance no longer oscillate between epochs. At this points, frequency IoU on the training set is approximately13 %higher than for validation data, a huge gap indicating the network struggles to generalize.

The difference between frequency IoU on validation and test data in Figure 5.6 may be explained by the fact that more pixels contain background in the test set compared to the validation set, and that background IoU on the test set is higher than for the validation set.

Additionally, small variations are to be expected when the dataset size, and particularly the size of the test and validation set, are small. The discrepancy should therefore not be emphasized.

As discussed for PSPNet, we are dependent on more images, better networks or better data augmentation to reduce overfitting and hopefully increase test performance. The next section therefore tests random flipping as data augmentation.

5.2.2 Random Flipping Data Augmentation

Mask R-CNN was trained with horizontal and vertical flipping as data augmentation. Re-sulting frequency IoU values are plotted in Figure 5.7. Final class-wise IoU, mean IoU and frequency IoU are listed in Table 5.8.

Table 5.8: IoU for Mask R-CNN when trained with horizontal and vertical flipping as data augmentation.

Dataset Background IoU Corrosion IoU Mean IoU Frequency IoU

Training 92.7 % 70.4 % 81.5 % 88.5 %

Validation 87.6 % 58.6 % 73.1 % 81.3 %

Test 90.2 % 58.7 % 74.4 % 84.7 %

Discussion

We see from Table 5.8 that using flipping as data augmentation decreases overfitting com-pared to no augmentation (Table 5.7). The difference between validation and training frequency IoU is reduced from10.4to7.2, and corresponding numbers for the test set is a reduction from8.3to3.8.

Similar to the previous section, when training without any data augmentation, there is a rather big discrepancy between validation and test performance. A similar explanation applies to Figure 5.7, but it is somewhat surprising that flipping does not seem to decrease the gap. In fact, the gap is enlarged by1.3percentage points. This could indicate that flipping enhances the networks certainty in cases where it is actually incorrect. On the other hand, as discussed earlier, varieties are to be expected when the validation and test set each contain only50images.

The main cause of the overfitting reduction is training IoU being lower. Essentially, the artificially increased dataset size makes it harder for Mask R-CNN to ”remember”

seen images, and hence the resulting performance is more reliable. What is particularly

0 10 20 30 40 50 60 70 Epoch

0.81 0.82 0.83 0.84 0.85 0.86 0.87 0.88

IoU

train (horizontal + vertical flipping) val (horizontal + vertical flipping) test (horizontal + vertical flipping)

Figure 5.7: Frequency IoU on training, validation and test set for Mask R-CNN trained with horizontal and vertical flipping as data augmentation.

interesting, however, is that actual performance is also increased with flipping as data augmentation. That is, for PSPNet data augmentation was only found to reduce overfitting, but for Mask R-CNN both validation and test performance is increased using flipping. A possible explanation is that the real benefits of data augmentation are more prominent for larger datasets with more varying examples. In terms of number of images, the dataset used for Mask R-CNN is equally large as the dataset used for PSPNet, but in terms of training examples the number is vastly increased if we consider each instance a separate image.

The model still overfits more than desired, however. Further data augmentation is therefore tested in the next section.

5.2.3 Composite Data Augmentation

Mask R-CNN was trained with the heavy data augmentation scheme detailed in Sec-tion 4.2. Resulting frequency IoU values are plotted in Figure 5.8. Final class-wise IoU, mean IoU and frequency IoU are listed in Table 5.9.

Table 5.9: IoU for Mask R-CNN trained for50epochs with heavy data augmentation followed by20epochs with light data augmentation (horizontal and vertical flipping).

Dataset Background IoU Corrosion IoU Mean IoU Frequency IoU

Training 92.0 % 68.0 % 80.0 % 87.4 %

Validation 87.9 % 60.9 % 74.4 % 82.0 %

Test 89.9 % 56.5 % 73.2 % 84.1 %

0 10 20 30 40 50 60 70 Epoch

0.72 0.74 0.76 0.78 0.80 0.82 0.84 0.86 0.88

IoU

train (50 epochs heavy + 20 epochs light) val (50 epochs heavy + 20 epochs light) test (50 epochs heavy + 20 epochs light)

Figure 5.8: Frequency IoU on training, validation and test set for Mask R-CNN trained with heavy data augmen-tation for50epochs followed by20epochs with light data augmentation (horizontal and vertical flipping).

Discussion

Yet again the heavy data augmentation scheme is shown to reduce overfitting. The dif-ferences between training performance and validation/test performance in Figure 5.8 are significantly smaller compared to no augmentation (Figure 5.6) and random flipping (Fig-ure 5.7). This corresponds well with results for PSPNet using the same data augmentation scheme. Furthermore, a performance boost similar to for PSPNet occurs at the50epochs mark on training and test data. Validation performance remains roughly constant even after switching to light augmentation, also corresponding well with the somewhat odd validation behavior discussed earlier.

Similar to flipping, although not as prominent, the heavy data augmentation scheme increases test performance compared to no augmentation. Combined with significantly less overfitting, it is easy to argue that the heavy data augmentation scheme is very useful for Mask R-CNN on the corrosion dataset.

It would be interesting to further investigate how the heavy data augmentation scheme would facilitate even larger or more complex dataset with more classes. Due to limited time of the master thesis, this is suggested as further work.

5.2.4 Inference Time and Memory Footprint

Peak memory usage, average inference time per image and average training time per epoch for various data augmentation schemes are listed in Table 5.10.

Table 5.10: Peak memory usage [GB], inference frame rate [img/s] and average time per epoch [s] for training Mask R-CNN on the corrosion dataset using various data augmentation schemes. Inference and training were performed on a Titan X (Pascal) GPU with1image batch size, see Section 4.5 for details.

Memory Inference No Augmentation Flipping Heavy (stage 1)

10GB 1.4img/s 1750s 1780s 1720s

Discussion

As seen in Table 5.10, Mask R-CNN is a computationally demanding network. After all, it consists of four sub-networks, each of which is relatively large. The main reason training Mask R-CNN is considerably slower than PSPNet, however, is that each instance is provided as a separate png file. That is, an image with spatial sizeH ×W annotated with K instances constitutes aH ×W ×K input matrix, i.e. K times as large as a corresponding image for semantic segmentation.

Since PSPNet is much more efficient than Mask R-CNN, yet still fairly slow, it is obvi-ous that more research is needed to obtain good classification performance with reasonable inference speed.

In document Image Segmentation of Corrosion Damages in Industrial Inspections using State-of-the-Art Neural Networks (sider 78-82)