Partitioning the dataset - MasterofScienceinInformatics,ﬁeldofstudyImageanalysisAndreasKleppe P

6.6 Evaluation

6.6.1 Partitioning the dataset

We have indicated in section 6.3 and above that both the number of learning patterns and the number of validation patterns are essential to obtain the best possible classifier and estimated performance, respectively. We have also argued that no intersecting patterns in the two datasets should be a ground rule in image analysis. Because we in practise typically only have a limited number of patterns available, following this rule will result in a trade-off problem between a decent classifier and its estimated performance. We will in this section discuss how to reasonably partition the entire dataset with respect to this trade-off problem.

It has been shown [56, p.256] that under the assumption of two classes with normal conditional pdfs, the increase of expected PMC due to finite learning

6.6. EVALUATION 95 Table 6.1: Contribution caused by the estimation of particular groups of inde-pendent parameters. [56, p.257]

i θ_i ]param Independent parameters to be estimated 1 1 1 Thea priori probabilities,fΩ(ω1)andfΩ(ω2) 2 ^δ₂² +d 2 The means,µ₁ andµ₂

δ4 8+^dδ₄² 1− ^d

d(d+1)

2 The common covariance matrix, Σ 6

δ4

8+^{d(d+δ2 )}₄

1−^2d

d(d+ 1) The covariance matrices,Σ1 andΣ2

patterns is:

∆^α_n

L :=E(P_n^α

L)−P_∞^α =f_Z(δ/2) nLδ

i∈Cα

θ_i (6.57)

where α denotes the classification method, δ is the Mahalanobis distance be-tween the classes (see equation (3.4)), Z ∼N(0,1) is the random variable of the standard normal distribution, Cα is a set of indexes specifying groups of independent parameters andθi is the increased contribution on the difference caused by the estimation of a particular group of independent parameters, see table 6.1. This formula is interesting in several ways. In particular, it clearly shows the dependency on the number of learning patterns and also the effect of increased classifier complexity and the combined discrimination value of the features. We note that the formula depends neither on the values of thea priori probabilities nor on the relative discrimination value of each feature.

In equation (6.55), we reproduced a formula for the variance of any estimator of a PMC. This formula can thus provide us with an estimate of the variance of the expected PMC estimator, either by performing classifications iteratively until the optimal partitioning ratio converges, i.e. by inserting each classifier’s estimated expected PMC into equation (6.56) and find an updated estimate of the optimal partitioning ratio, or by simply assuming the true value of the expected PMC.

By combining a variance estimate of an expected PMC estimator with the increase in expected PMC caused by a limited number of learning patterns, we obtain a criterion function that we can use to estimate a reasonable partition-ing of the dataset. In correspondence with Nielsen et al. [45, p.136], we will assume that the cost associated with increased expected PMC is equal to the cost associated with increased variance of the estimator. We then obtain the following criterion function:

J(r) := ∆^α_n_L+Var( ˆPη) =f_Z(δ/2) δnr

i∈Cα

θi+E(P[_nr,η^α )(1−E(P[_nr,η^α ))

n(1−r) (6.58)

which we wish to minimise. r := nL/n, the ratio of the number of learning patterns to the total number of patterns, is here chosen as the free variable, but eithernL ornV could have been used instead.

96 CHAPTER 6. CLASSIFICATION AND EVALUATION Let us analytically minimise the criterion function in equation (6.58) for the simple case of assuming normally distributed conditional pdfs with known co-variance matrices. This is similar to the assumption of independent features with equal variances (case 1), but such a distribution class is not assumed here because the article presenting the terms in table 6.1 did unfortunately not in-clude the term associated for a single common variance [56, p.257], however, the resulting partitioning with known covariance matrices can be expected to be representative for the case of a single estimated variance too. In any case, we will under the assumption have three independent parameters; one from the a priori probabilities and two from the class means. Inserting the correspond-ing terms from table 6.1 into equation (6.58), we obtain the followcorrespond-ing criterion function:

J(r) = fZ(δ/2) δnr

1 +δ²

2 +d

+E(P[_nr,η^α )(1−E(P[_nr,η^α ))

n(1−r) (6.59)

By differentiating and setting equal zero, we obtain:

J⁰(r) =−f_Z(δ/2) δnr²

1 +δ²

2 +d

−E(P[_nr,η^α )(1−E(P[_nr,η^α ))

n(1−r)² (−1) = 0

⇓ E(P[_nr,η^α )(1−E(P[_nr,η^α ))

(1−r)² = fZ(δ/2) δr²

1 +δ²

2 +d

⇓ E(P[_nr,η^α )(1−E(P[_nr,η^α ))δr²=f_Z(δ/2)

1 +δ²

2 +d

(1−r)² mr∈[0,1],both square root terms are non-negative

r q

E(P[_nr,η^α )(1−E(P[_nr,η^α ))δ= (1−r) s

f_Z(δ/2)

1 + δ² 2 +d

mf_Z(δ/2)>0whenδis finite

fZ(δ/2) 1 +^δ₂² +d q

fZ(δ/2) 1 +^δ₂² +d +

E(P[_nr,η^α )(1−E(P[_nr,η^α ))δ

(6.60)

We can make several interesting comments about this result. Firstly, we note that the ratio is independent of the particular number of patterns (when ignoring its indirect effect on the estimatorP[_nr,η^α ). Secondly, we note that as the number of features increases, the optimal partitioning (with respect to the used criterion function) is eventually to use the entire dataset as learning dataset. As we can expect that the variance of the expected PMC estimator is in general extremely high when only a microscopic proportion of the patterns is used for evaluation, this indicates the general negative impact of the classifier complexity on the expected PMC and that this is relevant for even the most simple classification methods.

Equation (6.60) also reveals that the Mahalanobis distance between the classes is highly relevant for the optimal partitioning (with respect to the used

6.6. EVALUATION 97

Figure 6.7: The relation between the Mahalanobis distance between the classes and the asymptotic PMC of Bayes’ classifier for the case of two equally probable classes with normal conditional pdfs with covariance matrices equal to Id/(2π). In this case, the PMC is independent of the number of features.

criterion function); the optimal partitioning approachesr= 0, i.e. to use all pat-terns for evaluation, as the Mahalanobis distance between the classes increases.

However, as the Mahalanobis distance between the classes is a measurement of the combined discrimination value of all features, this is only natural because it is relatively easy to construct a good decision rule when the classes are well separated, so we should be more concerned with obtaining a reliable estimate of the expected PMC. Figure 6.7 shows the relation between the Mahalanobis distance between the classes and the asymptotic PMC of Bayes’ classifier for the case of two equally probable classes with normal conditional pdfs (as equation (6.57) also assumes) with covariance matrices equal toId/(2π).

The result in equation (6.60) and the results of applying equation (6.58) when not assuming known covariance matrices are illustrated in figures 6.8 and 6.9.

These illustrations enforces the already stated asymptotic behaviours of the op-timal partitioning when increasing the number of features and the Mahalanobis distance between the classes. They however also provide more information.

In light of the relation in figure 6.7, we see from figure 6.8 that the classes must be very well separated before the optimal partitioning decreases below 0.5.

We also note the importance of the complexity of the classification methods, even in this case with only four features; higher complexity makes it more important to have many learning patterns because there are relatively many independent parameters that must be estimated. Less expected PMC has the same effect, but this must be viewed in light of a constant Mahalanobis distance between the classes, thus this indicates that the difference between the expected PMC and the asymptotic PMC becomes more significant as the expected PMC decreases, which results in an increased need for learning patterns. We also note that if we increase the number of patterns, most of the new patterns should be assigned as validation patterns and thus the optimal partitioning ratio decreases.

Finally, it is interesting to see that the optimal partitioning ratio converges with respect to the Mahalanobis distance between the classes, though it does not happen before the asymptotic PMC of Bayes’ classifier is nearly10⁻¹⁰; at this point it is far more important to reliably estimate the expected PMC than use

98 CHAPTER 6. CLASSIFICATION AND EVALUATION

Figure 6.8: Results of minimising the criterion function in equation (6.58) for different values of the Mahalanobis distance between the classes when there are four features. All conditional pdfs are assumed normally distributed and with;

top row) known covariance matrices, middle row) common covariance matrix, and bottom row) arbitrary covariance matrices. In the left column the expected PMC is 0.3 and the total number of patterns are 10 (magenta curve), 30 (cyan curve), 60 (black curve), 102 (red curve), 134 (green curve) and 1000000 (blue curve). The values 102 and 134 included because this is the number of patients in our dataset when excluding and including the tetraploid and polyploid cases, respectively. In the right column the total number of patterns are 102 and the expected PMC is 0.001 (cyan curve), 0.01 (black curve), 0.1 (red curve), 0.3 (green curve) and 0.5 (blue curve).

6.6. EVALUATION 99

Figure 6.9: Results of minimising the criterion function in equation (6.58) for different number of features when the Mahalanobis distance between the classes is one. All conditional pdfs are assumed normally distributed and with; top row) known covariance matrices, middle row) common covariance matrix, and bottom row) arbitrary covariance matrices. In the left column the expected PMC is 0.3 and the total number of patterns are 10 (magenta curve), 30 (cyan curve), 60 (black curve), 102 (red curve), 134 (green curve) and 1000000 (blue curve). In the right column the total number of patterns are 102 and the expected PMC is 0.001 (cyan curve), 0.01 (black curve), 0.1 (red curve), 0.3 (green curve) and 0.5 (blue curve).

100 CHAPTER 6. CLASSIFICATION AND EVALUATION many learning patterns to reliably estimate the independent parameters.

Figure 6.9 enforces the comments made on the expected PMC and the num-ber of patterns. It also better illustrates the problem of overfitting, with respect to both the number of features and high complexity of the classification method, as this is indicated by the great need of many learning patterns. As a Maha-lanobis distance between the classes of one, which these plots assumes, is rep-resentative for many of our feature sets, we note that the optimal partitioning in our case will be to assign most patterns as learning patterns.

Our illustrations only partially corresponds to the stated relation in [45, p.136] that a 50/50 split is optimal if the product of the number of features and the Mahalanobis distance between the classes is approximately 30. While this is a very good approximation in view of figure 6.8, it is far off in view of figure 6.9.

In document MasterofScienceinInformatics,ﬁeldofstudyImageanalysisAndreasKleppe Prognosticsfromadaptivespatialentropyinearlyovariancancercellnuclei (sider 106-112)