• No results found

The following list contains six choices for the property array. For each of these choices, we will always evaluate all four adaptive texture features in the set described in section 3.2.3. Typically, only the best of the four classification results will be provided, along with a specification of which adaptive texture feature this result corresponds to. Also, all these features includes a scaling of the cell images to fewer grey levels and the computation of an entropy. As mentioned, we will in this study always use linear scaling and the Shannon entropy with binary base for these choices.

The six choices for the property arrays, which results in 24 adaptive texture features, are:

• GLEM: The average of the GLEM computed from the cell images of a spe-cific patient (see definition in section 3.2.5). The number of grey levels is set to 64, the window size is set to 9 and the number of quantification levels per integer entropy value is set to 10. The cell images are grouped accord-ing to the cell area and only the cell area groups[2000,2999],[3000,3999]

and[4000,4999]are used, which is in correspondence with the findings in [48, p.94]. All these parameter choices1 are the same as in the study by Nielsen et al. [46]. It is worth mentioning that the same study also shows that the choice of number of grey levels and the window size is insignif-icant for our dataset (evaluated choices were GS ={16,32,64, . . . ,1024}

andw={3,5,7, . . . ,31}).

• GLEM4D: The average of the 4D-GLEM computed from the cell images of a specific patient (see definition in section 3.2.5). The number of grey levels is set to 64, the window sizes are set to {3,5,7,9,11,13}, the cell area groups are set to {[1,999],[1000,1999], . . . ,[9000,9999],[10000,∞)}

and the number of quantification levels per integer entropy value is set to 10.

• CSDEMdark andCSDEMbright: The average of the CSDEM of the object size of the dark and bright primitive type, respectively, computed from the cell images of a specific patient (see definition in section 4.1). The number of grey levels is set to 64 and the number of quantification levels per integer entropy value is set to 5 for both the grey level entropy and the spatial entropy. The set of containing the CSDEMdark- and the CSDEMbright-feature will be called theCSDEM-features.

• CSDEMsumDark and CSDEMsumBright: The average of the CSDEM sum histogram of the object size of the dark and bright primitive type, respectively, computed from the cell images of a specific patient. The sum histogram, which was proposed by Unser [68], is the sum of the two axes in the matrix, here the sum of the two entropy values of the CSDEM. The

1The chosen quantification was selected using a different approach in [46], but resulted in approximately the same quantification.

5.3. ADAPTIVE TEXTURE FEATURES 61 number of grey levels is set to 64 and the number of quantification levels per integer entropy value is set to 5 for both the grey level entropy and the spatial entropy. The set of containing the CSDEMsumDark- and the CSDEMsumBright-feature will be called theCSDEMsum-features.

As the CSDEM- and the CSDEMsum-features depend on a segmentation, we will for these features specify the used segmentation whenever they are applied.

62 CHAPTER 5. FEATURES

Chapter 6

Classification and evaluation

In image analysis, the goal is either supervised, unsupervised or reinforcement learning [13, pp.16–17]. Insupervised learning, the membership of each pattern isa priori known to belong to one of a specific set of categories called classes.

The natural goal is then to correctly classify each new, unknown pattern to one of the specified classes, or, in some cases, conclude that this particular pattern could not be classified.

Inunsupervised learning, which is sometimes calledclustering, the patterns are not labelled with their category membership. Typically, the number of cate-gories (clusters) is also unknown and will depend on the particular application.

As an example, consider the scatter plot of the set of two-dimensional feature vectors in the unsupervised learning problem in figure 6.1. Even under the as-sumption that the feature vectors represents the underlying patterns well, it is not clear whether these patterns should be clustered into two or three clusters and the desired number of clusters may depend on the application.

Lastly, in reinforcement learning orlearning with a critic, we do nota pri-ori known the membership of each pattern, but each tentative category is re-sponded, typically with eithercorrect or incorrect for image analysis problems [13, p.17]. Such problems may be viewed as a hybrid between supervised and unsupervised learning because the category of each pattern is not known, though it in a way do exist.

As we for the problem under study in this thesisa prioriknow the category of each pattern, which is the outcome of each patient, we will in the following only study the case of supervised learning. The classification rule will thus be designed using both the class memberships and the values of a set of features that have been extracted from each pattern. To estimate the performance of any designed classifier, i.e. its ability to classify novel patterns, we need to apply the classification rule to a set of patterns. More precisely, for each pattern in a set of patterns, we will extract the values of the same set of features as those who where used to design the classification rule, estimate the class of this pattern by feeding its feature values into the classification rule and finally compare this estimate with the true class of this pattern. To make the estimated performance reliable, we should at least have an empty intersection between the patterns used to design and those used to evaluate the classifier, see section 6.6 for a more precise discussion. The dataset used to design the classification rule is called the learning dataset or thetraining dataset, while the dataset used to evaluate the

63

64 CHAPTER 6. CLASSIFICATION AND EVALUATION

Feature 1

Fe atu re 2

Figure 6.1: Scatter plot of the learning dataset in an unsupervised classification problem where the true number of clusters depends on the particular application.

performance of the classifier is called thevalidation dataset or thetest dataset.

The chapter begins in section 6.1 with some general definitions about fun-damental terminology and quantities that will be applied in the following dis-cussions. Then the set of classifiers most frequently used in supervised learning problems are discussed in a general context in section 6.2. Other classifiers, e.g. those based on optimisation, are not discussed in order to limit the extent of this thesis. We instead continue with a general discussion of the problems with complex classifiers and many feature candidates in section 6.3, which both may cause overfitting. Two commonly used techniques to reducing the risk of overfitting,dimension reductionandfeature selection, follows in section 6.4 and 6.5, respectively. Section 6.6 contains a discussion of the methods and some challenges associated with the evaluation of a classifier, including how the total number of patterns should be appropriately divided into a learning and a vali-dation dataset. We will conclude the chapter in section 6.7 with a description of the classification procedures applied in this thesis.

Most of the content of this chapter is based on the textbook by Duda et al. [13] and the paper by Raudys and Jain [56], in particular section 6.1, 6.2, 6.4 and 6.6. The discussion found in this chapter is more detailed than the one found in these two sources and is also based on several other sources.

6.1. DEFINITIONS 65

6.1 Definitions

In order to be able to provide a precise discussion of the methods and challenges with classification and evaluation, we will start off by defining some fundamental terminology and quantities.

Define Ω :{ωi|i = 1, . . . , c} →[0,1]as the discrete random variable giving the true a priori probability for a novel pattern to belong to each of the c possible classes,ω1. . . ωc, e.g. each of the two prognosis classes. Also, uniquely index each element in the set of features with an integer in the interval[1, . . . , d], where d is the cardinality of the set of features. Let the set of feature values for a single pattern, e.g. for a single patient, be denoted as thed-dimensional column vector~x= (x1, . . . , xd), ordered according to the indexing of the feature set, and be known as thefeature vector of that pattern. The set of all possible feature vectors is known as thefeature space and is denoted as D. Moreover, letnL be the number ofd-dimensional vectors in the learning dataset, nV the corresponding number in the validation dataset andn:=nL+nV be the total number of patterns. Lastly, defineni as the number of feature vectors in class ωiof the learning dataset,~xi,1, . . . , ~xi,ni as the feature vectors of these patterns, Xi := {~xi,1, . . . , ~xi,ni}, i = 1, . . . , c, as the set of these feature vectors and X:=X1+· · ·+Xcas the collection of all feature vectors in the learning dataset.

Every classifier has adecision rule, which is the concrete rule the classifier bases its classification (or decision) on. The effect of the decision rule is to di-vide the feature space intoc decision regions,R1, . . . ,Rc, each decision region Ri corresponds to the set of possible feature vectors for which the classifier in question classifies the corresponding pattern into classωi. Thedecision bound-aryof classωiis the piecewise continuous boundary which separatesRifrom all other decision regions. Note that there exists only a single decision boundary for the case of two classes. The complexity of a classifier is here defined as the number of independent parameters the classifier estimates. If a classifier’s decision boundaries are of a specific polynomial degree, the complexity of the classifier is positively correlated with the product of this polynomial degree, the number of features and the number of classes in this classifier.

One of the most useful ways of representing the classifier is in terms of a set ofdiscriminant functions, gi(~x)for i= 1, . . . , c[13, p.29]. These functions may be viewed as the relative evidence that a specific pattern belongs to each possible class with respect to the corresponding classifier. A general decision rule can be expressed in terms of these functions as follows:

Decision rule 1 (General discriminant functions). For all~x∈D, classify the pattern to the classωi for which gi(~x)≥gj(~x)for all j = 1, . . . , c(if there are multiple choices forωi, any will do).

The decision boundaries of a classifier represented using discriminant func-tions are the set of feature vectors for which the classifier is satisfied with multi-ple choices forωi. More precisely, the decision boundaries are the set of feature vectors for which there exists at least two unique indexes,i and j with i6=j, such thatgi(~x) =gj(~x)≥gk(~x)for allk= 1, . . . , c.

If the number of classes is only two, as is the case for cancer prognosis, the formulation can be simplified by definingg(~x) :=g1(~x)−g2(~x). The classifier, which is sometimes called a dichotomiser in the case of only two classes [13, p.30], can now be expressed in terms ofg(~x)as follows:

66 CHAPTER 6. CLASSIFICATION AND EVALUATION Decision rule 2 (Dichotomiser using discriminant functions). For all ~x∈D, classify the pattern toω1 if (and only if)g(~x)>0, otherwise classify the pattern toω2.

It could be noted that any class could have been chosen at the decision boundary, which now can be simply defined as the set of all feature vectors for which g(~x) = 0, but we have in the rule above made the decision that these boundary points should be classified toω2.

While many classifiers can be represented in the terms of discriminant func-tions, the representation is not unique in the sense that there will always exists multiple sets of discriminant functions that classifies all patterns equally. To relate sets of discriminant functions which are equivalent with respect to classi-fication, we will define a equivalence relation (see [22, pp.194,214]), denoted≡, on the set of discriminant functions that associates sets of discriminant func-tions which for all ~x∈D have either equal unique maximum ofgi(~x)or equal index sets which attains the common maximum (giving equal classification on the decision boundaries). It should be noted that this definition indirectly re-quires both equal feature spaces and equal number of discriminant functions (which is equal to the number of classes).

It is useful to note the following general equivalences:

Theorem 1 (General equivalences of discriminant functions). A set of dis-criminant functions are equivalent to the set obtained by adding, subtracting or multiplying all discriminant functions in the set with the same positive function that is independent of the class indexi (but may depend on e.g. ~x). Moreover, the set gi(~x), i= 1, . . . , c, is equivalent to the set f(gi(~x)), i = 1, . . . , c, when f :R→R is an arbitrary strictly monotonically increasing function. [13, p.30]

When studying the performance of a set of features or a specific classifier, the most interesting quantities are those who gives aprobability of misclassification (PMC). Indeed, if the learning dataset was of unlimited size, there would only be two quantities worth mentioning: [56, p.253]

• Asymptotic PMC: Pα - the PMC of a specific classifierα.

• Optimal PMC: P := maxαPα - the best PMC of all (known and un-known) classifiers when using a specific set of features, i.e. the theoretically best achievable PMC for the given set of features (when extracted from the specific data source under study).

When using an unlimited number of learning patterns, the feature values and the corresponding true classes can in theory be used to find the true values of the parameters that the classifier depends on. Since we in practise have a limited number of learning patterns, the true parameters must typically be replaced by more or less accurate estimates, which may (and probably will) result in a designed classifier different from the one designed with infinite number of patterns. Moreover, since different datasets provide different feature values which in general result in different parameter estimates, the PMC of a designed classifier with a given number of patterns is a random variable. We therefore have the two following interesting PMCs in addition to the asymptotic and optimal PMC: