• No results found

5. Discussion

5.1. Methodological considerations

Feature selection

The SEM images were not corrected for their number of pixels. This correction was deemed unnecessary as the images used for each dataset had the same number of pixels. However, it did differ across magnifications and colour categories. Only SEM images of category 1 and 4 UOCs had a consistent number of pixels in the image dimensions across all magnifications. The variation in pixel dimensions between the other categories were small; in fact, the maximum difference in pixel dimensions between categories 3, 5 and 6 was four pixels. Large difference in dimensions make the interpretation of optimized features between datasets difficult. Let us say that some features were selected for one category, but not for another; the cause might be the difference in image texture between the two categories or that the number of pixels in each dimension of the category images were not the same.

Mainly LBP features were selected, and hence they seemed to contain more meaningful information for discrimination by the LDA classifier. The reason why other feature groups were not selected could be that the classifier itself was not very good at using them, or that the selected parameters for the feature extraction algorithms were poorly selected.

The final features were chosen by visually examining the training and validation curve for each of the three runs on each dataset. The decision was based on the combination between the number of

86 features, and the trade-off between training and validation accuracy (bias-variance trade-off). This decision approach affected the model performance as classification accuracy was observed to be dependent on the number of features (when relatively few features were used). The average number of selected features for each dataset among all datasets differed by about a handful. A possibility would have been to devise an approach for automatically selecting the number of features from validation curves to avoid human interception. This would have speeded up the analyses.

The chosen metric to measure model performance throughout this work was "accuracy”. This is a common metric for classification purposes (Raschka & Mirjalili, Python Machine Learning, 2019).

However, other performance metrics can be used. Receiver Operating Characteristic Area Under the Curve (ROC AUC) is one technique to assess model performance given their false positive rate, and true positive rate (i.e. the metric recall) (Raschka & Mirjalili, Python Machine Learning, 2019).

Sample limitation

There was a limited amount of data available for each UOC category. This affects the models' ability to generalise the learning of known data onto new and unknown data. In addition, it is challenging to estimate the final model performance on new unknown data. A fixed amount of UOC was available for each origin (class). The UOC powder was distributed into a maximum of three sample holders for each class. Even though it was assumed that sample holders containing the same class of UOC could be treated as independent samples, there was still a limited number of available independent samples as the images acquired for each sample holder were considered to be dependent.

Per sample holder, it was acquired five non-overlapping images at each of the different magnifications (for the SEM images), as explained in chapter 3.2.2.1. Images originating from the same sample holder were treated as dependent images and labelled with group id. The group identification ensured that the dependency of images was considered during dataset splitting, by prohibiting dependent images being used in both training and validation folds at the same time.

Therefore, assuming that all sample holders were independent of each other, all results are unbiased regarding images used, except for the hold-out test results. For some random sample holders, one image was preserved as a hold-out test sample; the other images from the same sample

87 holder were used to develop and train the model. This strategy can be discussed, but when comparing the out test results with the corresponding unbiased validation curves, the hold-out test results were within the standard deviation of the validation result.

For the hyperspectral images, one image was acquired for each sample holder, and a square area within the image was cropped and then divided into four sub-images. These four sub-images were treated as one group for each sample holder and, just like for the SEM images, the dataset splitting considered the dependent image groups. Here also, the hold-out test results scored within the SD of unbiased validation performance.

A random set of unique test images was held out three times for each dataset to limit the chance of dataset split dependency on performance; this gives a more robust final model performance estimate.

The reported standard deviation (SD) values in the heatmaps, validation curves and variation of assigned probabilities of predictions were population standard deviation. It can be argued that sample standard deviation should have been used instead. However, these values were only included to illustrate the spread of values; no decisions were made based on any values of SD.

Image acquisition technique and consistency

The image focus varies between SEM images. The level of sharpness, or lack thereof, and how it affects the resulting model performance is uncertain. As unfocusing an image blurs out the pixels, it could be that relatively small-scale texture that is informative is removed. The same applies to the hyperspectral images. For some sample holders, several images were of poor image quality.

The best image was visually determined and chosen for each sample; hence some images were discarded.

88