• No results found

6.2 Bayesian decision theory

6.2.1 Parametric classification

In parametric classification, we assume that all conditional pdfs belongs to the same distribution class, but possibly with different true values of its parameters.

The problem of estimating each conditional pdf is then reduced to the problem of estimating the parameters of each distribution. The method of estimation is of course arbitrary in general, but the common practise is to use themaximum likelihood estimator (MLE). The MLE is recommended by most statisticians, at least for large datasets, due to some attractive properties. In particular, the invariance principle of the MLE states that given the MLEs of a number of parameters, the MLE of any function of these parameters is simply obtained by replacing the parameters with its MLEs in the function expression. In general, the computation of a composed estimator must be computed from scratch, but this property makes such computation trivial when using MLEs. The most important theoretical property is however related to the large sample behaviour of the MLE, which roughly states that all MLEs are approximately unbiased estimators attaining (nearly) the minimum possible variance of all unbiased estimators. [11, pp.346,351–352]

6.2. BAYESIAN DECISION THEORY 69 In practise, the assumed distribution of the conditional pdfs are almost with-out exception the normal distribution (when parametric classification is used).

Moreover, the covariance matrix of the normal distribution is assumed to be invertible, but this is only a minor assumption as a singularity of the covariance matrix indicates a redundancy of some features [13, p.34].

There are several arguments supporting the assumption of normality of the conditional pdfs, or at least supporting that this assumption is in general more appropriate than assuming any other distribution class of all conditional pdfs.

An essential argument follows from the central limit theorem, in particular that the sum (or average) of many small, independent random sources of noise will converge to a normal distribution as the number of sources increase, which makes the normal distribution the appropriate model if the conditional pdfs of X|Ω =~ ωi may be viewed as the result of a typical or prototype vector~µi cor-rupted by continuously valued, random noise [13, pp.31,33]. Another interesting property is that the normal distribution has the maximum entropy of all dis-tributions having the same mean and variance, i.e. the normal distribution has the theoretically maximum informational content and uncertainty with respect to a given mean and variance [13, pp.32–33]. Finally, the classifiers produced when assuming normality of the conditional pdfs are to some degree simplistic and therefore likely to generalise acceptably. In fact, we will later see that under addition assumptions the classifier includes the best possible linear separation with respect to an intuitively reasonable criterion function, see section 6.4.1.

The assumption of normality can be stated asX|Ω =~ ωi∼N(~µii)fori= 1, . . . , c, whereµi∈Rdis the expectation andΣi∈Rd,dis the covariance matrix.

The Bayesian decision rule can then be obtained by using the discriminant functions:

gi(~x)(6.4),1≡ ln(fX|Ω=ω~ i(~x|Ω =ωi)) + ln(pi))

normality

= ln( 1

(2π)d/2i|1/2e12(~x−~µi)TΣ−1i (~x−~µi)) + ln(pi))

=−1

2(~x−~µi)TΣ−1i (~x−~µi)−dln(2π)

2 −ln|Σi|

2 + ln(pi))

≡ −1 1

2(~x−~µi)TΣ−1i (~x−~µi)−ln|Σi|

2 + ln(pi)) (6.6)

=−1

2~xTΣ−1i ~x+~µTi Σ−1i ~x−1

2µ~TiΣ−1ii−ln|Σi|

2 + ln(pi)) (6.7) where the last equality follows from that the covariance matrix is symmetric by definition.

Since the covariance matrix is symmetric, it contains ‘only’0.5d(d+1)unique parameters, and the expectation contains dvalues, thus a total of0.5d(d+ 3) independent parameters are available for each class. If the number of features is high, then the number of independent parameters will be large. This can in turn make the designed classifier overfitted to the learning dataset, a phenomenon which will receive more attention later in section 6.3, but for now it is sufficient that we know it may be beneficial to restrict the number of independent pa-rameters. In the case of normal conditional pdfs, this can be done by assuming values or relationships among the0.5d(d+ 3)parameters. There are two such assumptions which are commonly mentioned, both restricting the number of independent covariances.

70 CHAPTER 6. CLASSIFICATION AND EVALUATION Case 1: Independent features with equal variances, Σi2I

In this simple case, all features are assumed to be statistically independent and to have the same common variance σ2. Thus there is in this case only a single independent covariance parameter, making the total number of independent parameterscd+ 1. The discriminant functions can in this case be written as:

gi(~x)(6.6)= −1

2(~x−~µi)TΣ−1i (~x−~µi)−ln|Σi|

2 + ln(pi))

=−k~x−~µik22

2 −2 lnσ+ ln(pi))

≡ −1 k~x−~µik22

2 + ln(pi)) (6.8)

In the case of equala priori probabilities, it is easy to see that a classifier based on the discriminant functions in equation (6.8) will simply classify each feature vector to the class belonging to the nearest expectation, as the discriminant function of this class will provide the largest value of the discriminant functions because it attains the least possible distance k~x−~µik2, i = 1, . . . , c, for the specific feature vector. Such a classifier is known as the minimum distance classifier and the distance is measured using the Euclidean norm.

In the case of a dichotomiser and by applying theorem 1, it is easy to further simplify the discriminant functions in equation (6.8) tog(~x) =w~T(~x−~x0)where:

~

w=~µ1−µ~2 (6.9)

~

x0=~µ1+~µ2

2 −σ2(~µ1−µ~2) k~µ1−~µ2k22 ln

p1) p2)

(6.10) The resulting decision boundary is thus a hyperplane orthogonal to the line between the expectations, as given in equation (6.9). Furthermore, equation (6.10) shows that the location of this hyperplane is precisely the middle point of the expectations if the a priori probabilities are equal, an observation that can also be derived independently as we known the classifier is equivalent to the minimum distance classifier in this case. If there is a deviation in thea priori probabilities, then the second term in equation (6.10) is a nonzero constant times the difference vector between the expectations, so the hyperplane will be translated along the line between the expectations. These observations are visualised in figure 6.2 and 6.3 for the case of one and two features, respectively.

Case 2: Equal covariance matrices, Σi= Σ

A slightly more complex case is when all features are assumed to have the same common covariance matrix Σ. As a general covariance matrix contains 0.5d(d+1)independent parameters, the total number of independent parameters iscd+ 0.5d(d+ 1) = 0.5d(2c+d+ 1) in this case. The discriminant functions can now be written as:

gi(~x)(6.6)= −1

2(~x−~µi)TΣ−1i (~x−~µi)−ln|Σi|

2 + ln(pi))

≡ −1 1

2(~x−~µi)TΣ−1(~x−~µi) + ln(pi)) (6.11)

1TiΣ−1~x−1

2~µTiΣ−1i+ ln(pi)) (6.12)

6.2. BAYESIAN DECISION THEORY 71

Figure 6.2: Two normal distributions with equal variances. Under the assump-tion of normality and equal variances, this is an idealised case when the number of features is one and the number of classes is two. The dashed, black line is the decision boundary of the Bayes decision rule under the assumption of (left) equal a priori probabilities or (right) p1) = 0.8 and p2) = 0.2, and the area of A1 andA2 is the probability of misclassification when the true class is ω1 andω2, respectively.

Figure 6.3: Two bivariate normal distributions with independent components with equal variances. Under the assumption of normality and independent fea-tures with equal variances, this is an idealised case when the number of feafea-tures is two and the number of classes is two. The dashed, black line is the decision boundary of the Bayes decision rule under the assumption of equal a priori prob-abilities. The solid, black vector isw~ =~µ1−~µ2, which also is the normal vector of the decision boundary.

The formulation in equation (6.11) strongly resembles the formulation of in equation (6.8), a set of discriminant functions that under the assumption of equala priori probabilities defines a minimum distance classifier. In fact, under the assumption of equal a priori probabilities, equation (6.11) also defines a classifier that decides the class with minimum distance between its expectation and the feature vector, as the minimum distance classifier, only that the distance is now measured in terms of the norm defined by the matrixΣ−1.

In the case of a dichotomiser, it is easy to show that g(~x) = w~T(~x−~x0)

72 CHAPTER 6. CLASSIFICATION AND EVALUATION

Figure 6.4: Two bivariate normal distributions with equal covariances. Under the assumption of normality and equal covariances, this is an idealised case when the number of features is two and the number of classes is two. The dashed, black line is the decision boundary of the Bayes decision rule under the assumption of equal a prioriprobabilities. Though the decision boundary passes through the average of the expectations, ~µ1−~µ2, this vector is not the normal vector of the decision boundary as indicated by the solid, blank vectors.

where:

~

w= Σ−1(~µ1−~µ2) (6.13)

~x0= ~µ1+µ~2

2 − ~µ1−~µ2

(~µ1−~µ2)TΣ−1(~µ1−~µ2)ln

p1) p2)

(6.14)

The resulting decision boundary is thus again a hyperplane, but the normal vector of this hyperplane, as given in equation (6.13), is generally not orientated as the difference vector between the expectations (as it was in case 1). The middle point of the expectations is however still a part of the hyperplane if we assume equala priori probabilities, as we see from equation (6.14), which also shows us that the translation caused by unequal a priori probabilities is along the line between the expectations as in the previous case. These observations are visualised in figure 6.4 for the case of two features.

Case 3: No further assumptions, Σi is arbitrary

In this last case we do not assume any values or relationships on the parame-ters. We have already mentioned that we then will have0.5d(d+ 3)independent parameters for each class, thus a total of0.5cd(d+ 3)independent parameters.

The discriminant functions of this case were given in equation (6.7). The result-ing decision boundaries are general piecewise hyperquadrics, or a sresult-ingle general hyperquadric for dichotomisers. In fact, given any hyperquadratic, there exists always two normal distributions whose decision boundaries using the Bayes de-cision rule is that hyperquadric [13, p.42]. Some of the many possible forms of the decision boundaries are visualised in figure 6.5 for the case of two features.

6.2. BAYESIAN DECISION THEORY 73

Figure 6.5: Two bivariate normal distributions. Under the assumption of nor-mality, this is an idealised case when the number of features is two and the number of classes is two. The dashed, black curves are the decision boundaries of the Bayes decision rule under the assumption of equal a prioriprobabilities, which are quadrics for the case of two features.

Discussion

While all derived classifiers attain the optimal PMC under the stated assump-tions and with knowledge about the true value of parameters, problems arise when any of these conditions are not met. The problem of not knowing the true value of the parameters, which is the typical case in practise, is related to the a general phenomenon known asoverfitting, a problem that will receive more attention in section 6.3. For now we shall only note that the problem increases with the number of independent parameters. A violation of the stated

74 CHAPTER 6. CLASSIFICATION AND EVALUATION assumptions, in particular an incorrect assumption of the conditional pdfs (the assumption of the feature vectors’ origin is mild if the patterns can be said to be ‘independent’), may further decrease the performance of the classifier.

It has been noted by many authors that the linear discriminant functions obtained in case 2 above is robust to discrepancies from the normality assump-tion [56, p.253]. This can be partly explained by its strong correlaassump-tion with the Fisher’s linear discriminant, a method described in section 6.4.1 that op-timises the linear separation between the classes with respect to an intuitively reasonable criterion function. As this method does not rely on the normality assumption, nor any other assumptions about the parametric form of the condi-tional pdfs for that sake, it seems reasonable that importance of the normality assumption in a classifier of case 2 above is minor.

In contrast, the hyperquadratic decision boundary obtained when using arbi-trary covariance matrices, significantly suffers from a violation of the normality assumption. Another critique directed toward this classifier is that its perfor-mance degenerates when the number of patterns in each class in the learning dataset differs, in fact, a learning dataset with equal number of patterns of each class may have a lower expected PMC than a similar learning dataset where the number of patterns of only one class has increased. A generalisation accommo-dating for this negative effect has been proposed, see [56, p.253–254] for details.

[56, p.253]