2.2 Semantic Image Segmentation (SIS)
3.1.5 Partitioning Vale v2.0
elevation differences. These concepts are further described in Subsection
??ch:exp - subsec:Vale v2.04.2 A single JavaScript Object Notation (JSON) file is produced from the VIA annotator software for the batch of samples that have been annotated. The JSON format is a simple human-friendly standard of saving data to a file. In its raw format, it is readable and writable by people. To make the process of checking the annotations and performing the next steps in the data set annotation process, several changes have been made to the structure of the data in the files. First, the data in each batch file have been separated in single files, so every image has its corresponding JSON file with information on its annotation.
Redundant and excess information has been stripped from the files. Each frame is left with a single JSON file, which includes,img_height,img_width andregions, that contain every segment in the frame with its corresponding labelandpolygon.
5) Labelling ignored segments
The polygons in the JSON files are points that have been placed by trailing the edge of each segment. As there is no support from the software in using one line from a polygon as the border between 2 segments, there will be space in between the segments that are not annotated/ignored. The solution that has been used in this thesis is dilation. Temporary masks are made from the JSON files with corresponding RGB color codes for each category. Next, these masks are processed with a dilation script, extending the borders of each segment equally. Each unique segment will extend its borders by one pixel, this repeats until a border is reached between two segments, and all non-labeled areas in the image are labeled. Figure 3.7 illustrates dilation of unlabelled areas.
6) Producing color-coded masks
The Deeplabv3+ model requires the data set to include a color-coded (grayscale, single (1) channel) mask and the original data during training and evaluation. By color-coded masks, Deeplabv3+ restricts the number of category entries to 256 (grayscale allows for 256 possibilities), additionally the Deeplabv3+ locks "0" to the categorybackground. This forces the color code mapping to have five categories, Vale v2.0 requires still only four categories, hence, an overhead of one category entry is present without being used. An overview of category and label correspondences can be seen in Table 3.4. The procedure of going from temporarily RGB masks from the labelling ignored segmentsstep to grayscale masks is just a conversion from one labeling system to another.
(a) Original Mask
(b) Dilated Mask
Fig. 3.7: Dilation of RGB color coded masks.
three sets, atraining set, avalidation set, and atest set. The training set is used to train the model, the validation set is used to validate the trained model, and a means to improve hyperparameters based on the results of this set.
Finally, the test sets purpose is to get a final value on the performance of the model, and no further training is performed after the evaluation. The size of each set is based on the purpose of the set, the training set requires a large size as it is training the model. The validation set is also used to train, as there are far fewer parameters (hyperparameters) that need tuning this set can be smaller. The test sets the purpose of giving value on the result can be met with a good sample selection that covers a general state of the problem space, hence the size is not the highest priority. Important to note that with a smaller/larger test set the result would of course vary. The final partition percentage based on the factors above were set to, the training set
~58.83%, validation set ~25.33%, and test set ~15.83%, an overview of the partitioning is presented in Table 3.5.
Table 3.4: Vale v2.0 label correspondences. The categories are going from 3 channels from RGB to only 1 channel in the grayscale labels. *background label is not used during training, visualization or evaluation process.
Category / Tra-versability
RGB color RGB Code color Code
Background* Black (0,0,0) 0
Wheeled Green (0,255,0) 1
Belted Yellow (255,255,0) 2
Legged Orange (255,128,0) 3
Non-traversable Red (255,0,0) 4
Table 3.5: Vale v2.0 training, validation and test set partitioning
Set Frames Percentage of
Data set
Origin Video
Vale v2.0 600 100% 3,4,5,6,7,9
Training 353 58.83% 3,4,5,6,7
Validation 152 25.33% 3,4,5,6,7
Test 95 15.83% 9
Partitioning Strategy
To guaranty that the model never sees or comes in contact with the test set before the final evaluation, video 9 has exclusively been set aside for the test set. Video 9 has 95 unique frames in the data set (600 frames), see Table 3.3 for an overview of frame sources in Vale v2.0. The percentage division between the training and validation set has been set to 70% and 30% of the remaining 505 frames. The importance of not training on frames that are available at the evaluation phase is equally important in the validation phase, as validation also evaluates the trained model. Additionally, the fact that Vale v2.0 is based on frames extracted from videos, frames with almost the same content, a partitioning strategy to avoid validation on almost the same frames as the model has been trained on is required. Otherwise, if the partitioning strategy of randomly picking frames would be used, the possibility of having a training and validation set that consists of every other frame from the same video could occur. This would result in a situation where the trained model would be evaluated on almost the same frames.
Since video 9 is the test set, five more videos are left for the training and validation set. As mentioned the video frames closer to each other contain almost the same contents, this suggests a partitioning strategy where the partitioning should happen on the video level and not on the frame level.
However this is not optimal as the videos cover different terrain, lighting conditions ( e.g. bright sections because of the sun rays, darker sections because of shadow cast from objects) and different varieties of the same category (i.e. stones in one video and stairs in another, but both are in the
legged traversability category). This also applies within the videos, as they span over different areas when comparing the start of the video and the end of the video. Because of these factors, a strategy on the frame level of each video, a35%-30%-35% partitioningis used. The first and last 35% of frames from each video is reserved for the training set and the middle section of 30% is set as the validation set. This strategy is illustrated in Figure 3.8, and Table 3.6 presents a thorough overview of the partitioning.
Fig. 3.8: Chart illustration of Table 3.5. Distribution of frames relative to video sections.
Table 3.6: The training set contains frames from 0%-35% and from 65%-100% (Section #1 and #3) from each video. The test set contains frames in the interval 35%-65% (Section #2) from each video
Video & Frames 35% #1 30% #2 35% #3
Vid. 3 - 19 f 6.65 ~7 f 5.7 ~6 f 6.65 ~6 f Vid. 4 - 88 f 30.8 ~31 f 26.4 ~26 f 30.8 ~31 f Vid. 5 - 147 f 51.45 ~51 f 44.1 ~44 f 51.45 ~52 f Vid. 6 - 139 f 48.65 ~49 f 41.7 ~42 f 48.65 ~48 f Vid. 7 - 112 f 39.2 ~39 f 33.6 ~34 f 39.2 ~39 f
The partitioning of the data set is saved in separate files, train.txt, val.txt and test.txt, the filename states the corresponds set. The .txt files should contain every frame filename included in the respective set, excluding the .txt extension. The files cannot contain anything else, as the pre-code presented in SubSection 3.1.6 TFRecord - Deeplabv3+, or the model will crash.