A short cascade of classifiers with adaptive detection thresholds

Chapter 3 Face Detection using an a contrario approach

3.5 A short cascade of classifiers with adaptive detection thresholds

The results displayed so far prove that by adapting the detection threshold to the image content, using the a contrario approach, it is possible to dramatically improve the performance of a single strong classifier, achieving detection performances similar to the ones of a full cascade.

Nevertheless, the performance of a face detector can be simply measured not only in terms of its detection rate but also in terms of its computational efficiency. It is clear that a single classifier with 200 features must check all these features on all possible image subwindows in order to produce its result. On the other hand, the use of a cascade of classifiers permits us to reject most of the false positives in the early stages, which are composed of a small number of features. Therefore, even if the total number of features in the full cascade is big (up to 6000 features in the 38-stages original Viola–Jones cascade), the average number of features tested per subwindow is relatively small.

In this Section we will see how to combine the cascade-of-classifiers idea and the adaptive- single-classifier method to produce a very short cascade (just four classifiers) whose performance, in terms of detection rates, shall be comparable to that of a long cascade.

Moreover, since its total number of features will be much smaller, it will be computationally more efficient and much faster to train.

3.5.1 The proposed short cascade.

We propose a cascade of four classifiers with 5, 10, 80, and 200 features, respectively. As with any cascade the goal is to reject most of the subwindows in the initial stages (in our case the first three stages) and then apply the 200-features classifier to a fraction of the subwindows.

We seek to preserve the detection rates of the single 200-features classifier presented in the previous Sections at a much lower computational cost. We have trained the cascade as proposed

Chapter 3. Face Detection using an a contrario approah | 53 thresholds to allow a fixed percentage of subwindows through the classifier. For the 5-features classifier we reject all the subwindows whose detection value is below the 80th percentile (only the top 20% of subwindows are let through), while for the 10-features classifier all the subwindows below the 95th percentile are rejected (only the top 5% of subwindows are let through). For the 80-features classifier we set the threshold as in Section 3.3 and we use a fixed value of NFP_𝑚𝑎𝑥^{80 feat}= 100. These values of the parameters are quite permissive, the goal being to preserve as many as possible of the true positives, which should be correctly classified by the last stage of the cascade. We have tested different sets of values (e.g., 60%, 90%, and 200) but we have found little difference in the final results, so we have fixed the set (80%, 95%, and 100) for all our experiments. The only tunable parameter of the cascade is therefore the maximum number of false positives for the 200-features classifier (NFP_𝑚𝑎𝑥^{200 feat}) in the last stage. In sub-Section 3.5.2 we further elaborate about the design criteria used to construct the cascade.

3.5.1.1 Sampling the set of subwindows.

In order to characterize the Gaussian functions used to model the distributions of detection values described in Section 3.2 we need to compute their mean and variance. We have assumed so far that these distributions are computed using the entire set of possible image subwindows⁹ for each of the classifiers. However, this would imply to check 295 (= 5 + 10 + 80 + 200) features on each subwindow. In order to reduce the number of computations we check all the subwindows just for the 5-features classifier and then subsample the set of subwindows to get the rest of the distributions. If N denotes the total number of subwindows, we use pN of them to obtain the detection histograms for the 10-, 80-, and 200-features classifiers, with p a fixed parameter which we have set to p = 0.01 = 1%.

This choice of the p parameter is justified by basic arguments of inferential statistics [2].

This theory states that the sample mean and variance of a set of values taken from a Gaussian population are random variables that concentrate around the population parameter and whose variance is inversely proportional to the sample size and directly proportional to the population variance. In our case (see Section 3.2, Figures 3.2 and 3.4), N is typically of the order of millions, the population variance is always smaller than 5, and the population mean is always above 1; therefore by using 0.01N samples we are confident that the estimated values will not be far from the real ones. We have also tested with p = 0.1, finding very little difference in the final results.

For the set of fixed parameters described above it is possible to estimate the computational efficiency of the proposed cascade. Let N be the total number of image subwindows; then the

9As already noted, in our experiments we consider all nonflat subwindows of sizes ranging from 20 × 20 to 220

× 220 pixels.

average number of checked features per subwindow is computed as 𝐴𝑣𝑔 =𝑇_𝑓

𝑁 , (3.13) where Tf denotes the total number of checked features:

Tf = 5N + All the subwindows go through the 5 features classifier.

(10 + 80 + 200)·0.01N + A subset of 0.01N subwindows is used to estimate the distribution of values for the 10, 80, and 200 features classifiers.

10·0.2N + 20% of the subwindows go through the 10-features classifier.

80·0.05N + 5% of the subwindows go through the 80-features classifier (this is an upper bound, since some subwindows may be rejected by the previous classifier).

200·p’N An unknown (but very small) percentage p’ of the subwindows goes through the 200-features classifier.

By replacing this expression in (3.13), and taking into account that the unknown value p’ is very small, we get an upper bound for the average number of checked features: Avg ≈ 13.9.

Table 3.7 in Section 3.6 compares, for different datasets, the actual value of Avg for our short cascade and the values obtained for the 31-stages cascade described in [102].

3.5.2 On the design criteria of the short cascade.

One might argue against the arbitrary decision of using four classifiers in the short cascade presented in this Section. Moreover, the thresholds used at each stage seem also arbitrary. In this paragraph we try to justify these choices. The following criteria have guided the design of the cascade:

1. We want to build a cascade whose last stage is the 200-features classifier presented in the previous Sections. The reason is that this classifier exhibits good detection performance and it uses an adaptive detection threshold with statistical meaning.

2. In order to preserve the detection rates of the 200-features classifier, no true positives should be rejected by the previous stages of the cascade. This implies the use of quite permissive detection thresholds in these stages.

3. In order to accelerate computations the initial classifier(s) of the cascade must be composed of a small number of features.

Chapter 3. Face Detection using an a contrario approah | 55

Concerning the detection thresholds used by the classifiers, note that only classifiers with a large enough number of features display a distribution of detection values regular enough to admit a statistical analysis (see Figure 3.2). For this reason, when using classifiers with fewer that 80 features we are led to use fixed, not adaptive, thresholds. We found that, in general, no true positives were rejected (requirement 2) by a 5-features classifier when admitting through the classifier at least 20% of the subwindows. In the case of a 10-features classifier the minimum value of the threshold was 5%. By using other percentages (we tested 40% and 10%, respectively) the results were very similar but the computational cost (requirement 3) higher.

Similarly, for an 80-features classifier, all the true positives are in general preserved when allowing at least 100 false positives at the output.

Several different short cascades could have been designed using these criteria. For example, a 2-steps cascade (e.g., 5 + 200 or 10 + 200 features) could have been used. But in both cases the computational efficiency of the cascade (computed in terms of the average number Avg of checked features per subwindow; see the previous Section) would be low. Indeed, by using the 20% and 5% thresholds proposed above and by estimating Avg as in the previous Section we would get Avg5+200 = 47 and Avg10+200 = 22. The use of more features in the first stage would imply that more features should be tested on all the image subwindows, thus increasing the computational cost (e.g., for a 20 + 200 cascade Avg20+200 > 20). The use of a 3-steps cascade, where the two initial stages are composed of a small number of features, permits us to reduce the overall computational cost (e.g., Avg5+10+200 = 19.1). Another possibility is to use a 3-steps cascade with a very discriminative next-to-last classifier, which reduces considerably the number of subwindows reaching the 200-features classifier (e.g., Avg5+80+200 ≈ 23.8, Avg10+80+200

≈ 16.8). We opted for a 4-steps cascade which combines the advantages of the two types of 3-steps cascades commented above: two initial stages with a small number of features, followed by a very discriminative classifier. Our final proposal is thus a 5 + 10 + 80 + 200 cascade, for which Avg5+10+80+200 ≈ 13.9. Other choices for the 4-steps cascade were pondered. For example, with a 5 + 10 + 20 + 200 cascade we would get Avg5+10+20+200 = 12.3 (assuming a fixed detection threshold of 1% for the 20-classifiers stage), but at the cost of adding a new fixed threshold to the method. We preferred to use the 80-features classifier for which an adaptive threshold could be estimated using the methodology described in Section 3.3. Of course other choices could have been made (e.g., 4+15+100+200, or the use of more stages in the cascade), but the final results, in terms of detection performance (which ultimately depends on the 200-features classifier of the last stage) and computation time (which mainly depends on the initial stages of the cascade), would not be very different from the ones obtained with the proposed implementation.

In document Facial detection and expression recognition applied to social robots (sider 52-55)