Classifier complexity and classification method

CCReq from the best CSDEMsumBright-feature (which as mentioned attained an expected CCReq of 68.4 %) and also from the best CSDEMsum-features (see table 7.15). The best expected CCReq is however not significantly better than with the geometrical features, alone (see table 7.17) or in combination with the GLEM4D-features (see table 7.18), thus the Eccentricity-feature may be the most contributing feature to the good expected CCReq of 70.8 % in table 7.19.

The best expected CCR when using all patients is obtained with the negative CSDEMsumBright-feature when using the algorithm based on the watershed transform in combination with the Eccentricity-feature, which gives an expected CCR of 72.6 % with the Parzen window classifier. This is slightly better than the best expected CCR of all features based on the sum histogram of CSDEMs, which as mentioned was 72.2 %, and may even be significantly better than the best combination of the geometrical features and the GLEM4D-features, which as mentioned was 71.3 %. Because the Eccentricity-feature alone only attains an expected CCR of 69.0 % when using all patients (see table 7.17), the CSDEMsumBright-feature may be the most contributing feature to this good performance estimate.

In total, we can claim that the combination of the negative CSDEMsumBright-feature when using the algorithm based on the watershed transform and the Eccentricity-feature is the generally best performing feature set when using all patients, attaining an expected CCReq of nearly 71 % when using NMSC and nearly 73 % when using the Parzen window classifier. This is slightly or sig-nificantly better than the best performance of all other evaluated feature sets.

In particular, it is slightly better than the performance of the combination of all cell features and the best NO-features with respect to the expected CCReq, and significantly better with respect to the expected CCR.

If excluding the patients with tetraploid or polyploid histograms, we have noted that the best performance of the CSDEMsum-features is obtained when used alone. The performance of this feature set is still good, attaining an ex-pected CCR of 83.9 %, which is slightly or significantly better than the best expected CCR of all other evaluated feature sets. In particular, this may be significantly better than the combination of the cell features and the best NO-features, which was 82.8 %, and is significantly better than the best combination of the geometrical features and the GLEM4D-features, which was 82.3 %. How-ever, the best expected CCReq of the CSDEMsum-features is only 76.9 %, which is significantly worse than with the combination of the cell features and the best NO-features, and also the best combination of the geometrical features and the GLEM4D-features.

7.6 Classifier complexity and classification method

The last section concluded the investigation of new and improved classifiers, i.e. combinations of a feature set and a classification method. The rest of the chapter will be devoted to give a better understanding of some related aspects of the classifiers that have not jet been discussed. We will begin this discussion by considering the classifier complexity and the classification methods.

When we discussed overfitting in section 6.3, we mentioned that the number of features and complexity of the classification method are essential factors for

148 CHAPTER 7. RESULTS AND DISCUSSION the classifier complexity. We also found indications supporting a belief that the optimal classifier complexity may be sufficiently prominent to be reasonably estimated for a given number of learning patterns. We will therefore attempt to estimate this optimal complexity for our datasets. It should however be repeated that because the optimal classifier complexity is a trade-off between the decreased performance caused by more estimation and the increased per-formance caused by the added complexity, the optimal classifier complexity is not completely determined for a given learning dataset. In particular, the true distribution of the conditional pdfs and the effectiveness of the features are relevant.

The previous sections reveals that the best classification method is typically different with respect to the CCReq and to the CCR. Before we attempt to find the optimal classifier complexity, we must therefore agree upon which measure to use in order to detect the peak in classification performance. As previously mentioned, we are equally interested in classifying patients with either prognosis, thus the CCReq seem most interesting. When studying the relative performance of different classification methods, it is however more relevant to make this comparison with respect to the performance quantity the methods attempts to optimise, if this quantity is equal for the compared methods. For all parametric classification methods and the Parzen window classifier, this quantity is the CCReq when we use an evened bootstrap method, but would have been the CCR if not. This is because they are all based on the Bayes’ classifier, which chooses the class that corresponds to the maximum a posteriori probability, and estimates the a priori probabilities using the corresponding class proportions in the learning dataset, thus weighting the two classes equally when we use an evened bootstrap method. The kNN classifier does also attempt to optimise the CCReq when we use an evened bootstrap method. This is because this classifier indirectly weights each class according to its frequency in the learning dataset.

We will therefore in the following base the comparison on the CCReq, both because this is the most interesting quantity and because it is this quantity all used classification methods attempt to optimise.

The previous sections shows that the best classification methods are typi-cally parametric, more precisely, often the NMSC or LDC (with respect to the CCReq). The complexity of any parametric classifier is, using our definition in section 6.1, the number of independent parameters in the classification method.

As mentioned in section 6.2.1, this number is cd+ 1, 0.5d(2c +d+ 1) and 0.5cd(d+ 3)for the NMSC, LDC and QDC, respectively. For our case of two classes, this reduces to respectively 2d+ 1, 0.5d(d+ 5)and d(d+ 3). We may attempt to use these formulae to estimate the optimal number of independent parameters.

For all adaptive texture features in the sections 7.2-7.4, the best classifica-tion method with respect to the expected CCReq was the LDC, with a single unimportant exception. This indicates that the optimal classifier contains at least 7 independent parameters. As we expect that the ‘allowed’ number of independent parameters is larger than this, it may seem strange that the QDC with its 10 independent parameters does not perform better than the LDC. This may be explained in light of the scatter plots in all previous sections, which in-dicate that the patterns of good prognosis typically cluster much more than the patterns of bad prognosis. This will make the estimated variances of the QDC classifiers very different, the variance corresponding to the good prognosis class

7.6. CLASSIFIER COMPLEXITY AND CLASSIFICATION METHOD 149 will be much smaller than the variance corresponding to the bad prognosis class, which in turn will make the decision region of the good prognosis class relatively larger than when assuming a common variance, thus resulting in a higher CCR, but lower CCReq.

When using all five cell features, we see from section 7.1 that the LDC still performs best with respect to the CCReq. What we do not see is that the difference between this classification method and the NMSC is now much smaller than for the adaptive texture features, which indicates that even LDC is starting to become a too complex classification method. Indeed, when including also the NO-features, thus increasing the number of independent parameters with the LDC from 25 to 42, the NMSC is the best performing classification method with its 15 parameters.

In total, we see that the simple LDC classification method is generally rec-ommendable for our dataset when using about five features or less. If more features are used, then the NMSC is the appropriate choice. Roughly speak-ing, about 30 independent parameters may be estimated by the classifier before it becomes overfitted. This approximate value is of course dependent on our datasets, but also on the used features, in particular their conditional pdfs and effectiveness.

Table 7.20 shows the complete classification results when using the combina-tion of the cell features and the NO-features which attained the best expected CCReq. Notice how the classification performance significantly decreases with

Table 7.20: The classification results of the cell features and the NO-features when using the algorithm based on the watershed transformation without the edge removal step and evaluating on all 134 patients.

NMSC ParzenC

CCReq 70.4 % [58.0 %, 82.2 %] 67.0 % [53.8 %, 78.8 %]

CCR 70.4 % [59.0 %, 78.2 %] 68.7 % [52.6 %, 76.9 %]

Specificity 70.3 % [56.1 %, 80.3 %] 69.5 % [51.5 %, 80.3 %]

Sensitivity 70.4 % [41.7 %, 91.7 %] 64.6 % [41.7 %, 91.7 %]

LDC kNNC

CCReq 67.5 % [56.1 %, 78.0 %] 66.9 % [53.0 %, 79.9 %]

CCR 67.3 % [57.7 %, 75.6 %] 69.5 % [51.3 %, 79.5 %]

Specificity 67.3 % [54.5 %, 78.8 %] 70.7 % [48.5 %, 83.3 %]

Sensitivity 67.8 % [41.7 %, 91.7 %] 63.2 % [33.3 %, 91.7 %]

QDC NNC

CCReq 64.7 % [51.9 %, 76.9 %] 59.4 % [46.2 %, 72.0 %]

CCR 62.9 % [50.0 %, 73.1 %] 58.0 % [47.4 %, 67.9 %]

Specificity 62.1 % [47.0 %, 77.3 %] 57.3 % [45.5 %, 69.7 %]

Sensitivity 67.4 % [41.7 %, 91.7 %] 61.4 % [33.3 %, 91.7 %]

Using 28 learning patterns in each prognosis class.

150 CHAPTER 7. RESULTS AND DISCUSSION

Figure 7.19: The ROC point cloud of the cell features and the NO-features when evaluating on all 134 patients and using the NNC classification method. The NO-features are computed using our watershed segmentation method without the step which removes bright edge objects.

the complexity of the classification method. When the classifier is sufficiently complex, the performance approaches randomness, as indicated by the PIs of the NN classifier and its ROC point cloud in figure 7.19. Notice also that the PIs of the CCR in the nonparametric classification methods is much larger than the corresponding PIs of the parametric methods, even after correcting for the difference in estimated expectation. Also this indicates a too large classifier complexity for the nonparametric classification methods as this uncertainty can be seen as a result of overfitting, either because of a too complex classification method (NNC) or because of the adaption of a relevant parameter (ParzenC and kNNC, respectively the window width and the number of neighbours).

It may surprise some that the Parzen window classifier and the NNC classifier performs so respectably, at least with respect to the estimated expectation, even in this case where the feature space is so sparse and simple classification methods like the LDC results in overfitting. The reason for this is the adaptive choice of window width and number of neighbours, respectively. As the feature space becomes sparser, the typical estimate of both these quantities increases significantly to allow optimal classification of the learning dataset (using the leave-one-out cross-validation method, which was our choice for optimising these parameters). The increase of these quantities results in simpler decision regions and thus also a lower classifier complexity. Therefore, these nonparametric classifiers can be said to attempt to adapt their complexity according to the optimal complexity, but, of course, this adaption is generally suboptimal.

Figure 7.20 illustrates that these estimates are indeed typically large when using the same features as in table 7.20, which we have seen is many features

7.6. CLASSIFIER COMPLEXITY AND CLASSIFICATION METHOD 151

Figure 7.20: Histograms of the frequency of the chosen: left) window width when using ParzenC, right) number of neighbours when using kNNC over the 500 bootstraps when using the cell features and the NO-features and evaluating on all 134 patients. The NO-features are computed using our watershed segmentation method without the step which removes bright edge objects.

with respect to our datasets. From the left histogram we see that the chosen window width is essentially always 1. Because we have standardised the variance of each feature of the learning patterns to 1, this means that the variance of the interpolation function will in each direction be equally large as the variance of the features, thus the classification of a pattern is based on essentially all learning patterns (at least when using a normal window function as we do).

Similarly, from the right histogram we see that all chosen number of neighbours up to over 40 are relatively frequent, with slightly more occurrences from about 25 to 40. As we only have 28 learning patterns in each class, this indicates that relatively many learning patterns are included to determine the class of a validation pattern, thus the classification is obviously very coarse and therefore the classifier complexity is low.

If we instead use only a few features, say only the difference features when using the same segmentation method, which was the CSDEMsum-features that obtained the best expected CCReq, we obtain the classification result in table 7.21. From these results we see that the classifier complexity do not seem to be a problem anymore. The simplest classifier, the NMSC, does now perform significantly worse than the more complexed classifiers. We however note that it is the LDC, and not e.g. the slightly more complex QDC, which performs the best. With respect to the CCReq, this is as mentioned the typical behaviour and can be seen as a result of the difference in variance between the classes. It is however not typical that the QDC attains a lower CCR, though not significant, nor it is common that the lower limit of its PI is so low relative to the same limit when using the other parametric classification method (though this limit is typically some percent lower). We will however not dwell on what causes this for this precise choice of segmentation method (and features).

We note that the Parzen window classifier and the kNN classifier still perform reasonably, but significantly worse than the best classifier. Much of the relatively small lower limits of the PIs of the CCR of these classifiers in comparison with the corresponding PIs of the parametric classifiers are also gone, but the limits

152 CHAPTER 7. RESULTS AND DISCUSSION Table 7.21: The classification results of the difference CSDEMsum-features when using the algorithm based on the watershed transformation without the edge re-moval step and evaluating on all 134 patients.

NMSC ParzenC

CCReq 65.5 % [53.4 %, 77.7 %] 65.7 % [52.3 %, 78.0 %]

CCR 68.1 % [60.3 %, 75.6 %] 68.3 % [55.1 %, 75.6 %]

Specificity 69.3 % [59.1 %, 78.8 %] 69.4 % [51.5 %, 80.3 %]

Sensitivity 61.7 % [33.3 %, 91.7 %] 61.9 % [33.3 %, 83.3 %]

LDC kNNC

CCReq 69.2 % [56.1 %, 79.2 %] 65.2 % [50.4 %, 77.7 %]

CCR 70.6 % [61.5 %, 78.2 %] 66.8 % [51.3 %, 76.9 %]

Specificity 71.2 % [60.6 %, 80.3 %] 67.6 % [47.0 %, 81.8 %]

Sensitivity 67.2 % [41.7 %, 91.7 %] 62.7 % [33.3 %, 91.7 %]

QDC NNC

CCReq 67.2 % [53.8 %, 79.2 %] 60.6 % [46.6 %, 74.2 %]

CCR 70.1 % [52.6 %, 78.2 %] 59.7 % [47.4 %, 69.2 %]

Specificity 71.4 % [50.0 %, 81.8 %] 59.4 % [45.5 %, 71.2 %]

Sensitivity 63.0 % [33.3 %, 83.3 %] 61.9 % [33.3 %, 91.7 %]

Using 28 learning patterns in each prognosis class.

still seem to be significantly smaller than the corresponding limits when using the NMSC and the LDC. The likely reason is still overfitting because of the adaption of the relevant parameter, but the improved relation may be seen as a reduced risk of overfitting. This suspicion is enforced by the histograms of the estimated parameters for these classifiers when using the same features as in table 7.21, see figure 7.21. As expected, the typical chosen parameter results in the use of far less learning patterns than was the case when using the seven features, see figure 7.20. However, the suspicion was also correct as there still is a significant proportion of unnaturally small choices, see for instances the peak at k = 1in the right histogram, which are estimates that are likely to be too small to result in classifiers that generalise well.

In conclusion, if CCReq is the most interesting quantity, then the LDC is the recommended classification method for our dataset when using five features or less, otherwise the NMSC is the recommended choice. The Parzen window classifier and the kNN classifier perform reasonably, also - or maybe even es-pecially - when using many features, but both methods perform significantly worse than the best parametric method, at least with respect to the CCReq and for few features. With respect to the CCR, the QDC is the best classifi-cation method if the number of features is low, otherwise the two mentioned nonparametric methods perform well or maybe even better than its competitor, the NMSC. The NN classifier always performs badly; it is just too complex to be meaningful for our dataset. This may be seen in light of the challenges with our

In document MasterofScienceinInformatics,ﬁeldofstudyImageanalysisAndreasKleppe Prognosticsfromadaptivespatialentropyinearlyovariancancercellnuclei (sider 159-165)