RQ3 - Research Questions - Flexible Ensemble Structures for Gradient Boosting

7.1 Research Questions

7.1.3 RQ3

RQ3 asked the question; How can gradient boosting ensembles with per-tree hyperparameters be effectively optimized? The motivation behind this question was to establish a standard for how per-tree hyperparameters should be optimized, as well as gain insights into aspects that could benefit such processes. The answer to RQ3 is based primarily on the results of Experiment 2, detailed in Section 6.3, with additional insights derived from the results of Experiment 1, 4 and 5, detailed in Section 6.2, 6.5 and 6.6, respectively.

Experiment 2 was based on comparing the effectiveness of the Holistic and Incremental approach to flexible ensemble structure optimization. Specifically, we compared the prediction performance of flexible ensemble structures, thoroughly optimized with the Incremental approach, to that of ones thoroughly optimized with the Holistic approach.

From the results of this experiment we found that the Incremental approach achieved considerably worse prediction performance that the Holistic approach in all of the 7 explored datasets. In fact, the Incremental approach achieved worse results than the traditional approach for 6 of the 7 explored datasets. This is a quite clear indication

7.1. RESEARCH QUESTIONS 59 that the Incremental approach is not only worse than the Holistic approach, but seems detrimental to prediction performance. We theorize that this is the case because the Incremental approach is too greedy in optimization of each ensemble tree by not considering how following trees are affected.

Experiment 1 was primarily focused on providing insights into RQ1 through comparing the prediction performance of flexible and traditional ensemble structures. However, we saw an oppurtunity to also obtain insights relevant to RQ3. We, specifically investigated which hyperparameter combinations seemed to be best for prediction performance, by observing the frequency of which different hyperparameter scenarios achieved the best performing flexible structure; We investigated whether hyperparameters influenced each others behavior when optimized together, by comparing the configuration characteristics between different hyperparameter scenarios; And we attempted to obtain insight into reoccurring and exploitable characteristics of configurations, by comparing configurations grouped by hyperparameter scenario.

Based on the results of Experiment 1, we found that flexible structures of Scenario 4 were the ones most frequently achieving the best prediction performance. This scenario specifically achieved the best prediction performance for 4 of the 7 datasets. Scenario 1 and 3 got a shared second place, having the best prediction performance for one dataset each. Scenarios of the last dataset, Car Evaluation, was excluded from this particular point of investigation because the best prediction performance, was not achieved by a flexible ensemble structure. Thus, the only scenario that did not achieve the best prediction performance for any dataset, was Scenario 2. These observations indicate that most datasets benefit from more regularization, considering that Scenario 3 and 4 optimize combinations of multiple hyperparameters, while Scenario 1 and 2 optimize hyperparameters in isolation.

They also indicate that learning_rate is a quite important hyperparameter to optimize, even in isolation, considering this lone hyperparameter obtained the best prediction performance for one dataset. Max_depth, on the other hand, does not seem to benefit from being optimized in isolation. This implies that optimizing flexible ensemble structures based on multiple hyperameters should probably be the standard.

Based on comparing the configuration characteristics between different hyperparameter scenarios, we found that these were considerably different from one hyperparameter scenario to another. This indicates the hyperparameters influenced each others optimal values when optimized together. This is inherently not surprising, considering that this type of behavior has been observed since the dawn of gradient boosting. For instance with learning_rate and n_estimators [23]. The implication here, is that optimizing the hyperparameters in a pipeline process is most likely not going to be effective, because these processes often assume that the hyperparameters are relatively independent.

We also found several reoccurring aspects in the configurations. For instance we found that learning_rate values often seemed to rise with with later trees. This was specifically observed in 5 datasets with Scenario 1, in 4 datasets with Scenario 3, and in 1 dataset with Scenario 4. This totals to 10 of the 21 total relevant scenarios with this hyperparameter.

Learning_rate values were additionally never observed as less than 0.15. Max_depth values were observed to often be at the higher end of value ranges, while subsample values were observed to be mostly above 0.75. These findings are thereby implications that a bias towards rising learning_rate values and higher end max_depth values could be beneficial to search processes. And also that the search ranges can potentially be narrowed. The implicit value ranges from these particular results are 0.15 to 1 for learning_rate and 0.75 to 1 for

subsample. However, we should not assume that these biases and value ranged should be used without further investigation, as we know that the hyperparameters influence each other. These ranges could thus become obsolete with larger ensembles.

To gain even further insights into these types of aspects and reoccurring behaviors, we used the results of Experiment 2 to additionally analyze the characteristics of flexible structure configurations optimized with the Incremental approach. From this analysis, we observed that learning_rate values frequently appeared to decrease in values with later trees. Specifically, this was observed in in 4 of the 7 explored datasets. This is the opposite of what was observed with the Holistically optimized flexible structures from Experiment 1.

From these observations, we can derive an implication that such learning_rate behaviors are indeed tied to predictive performance, and can likely be exploited in search processes.

From Experiment 4, which was based on comparing flexible ensemble structure configurations (learning_rate) of good and bad prediction performance, we found indications that well

performing areas of search landscapes are relatively lenient size, and can be differentiated from the areas of bad prediction performance. This was based the observation that the configurations grouped by best prediction performance were generally quite similar in conceptualized landscape location, but were still clearly distinguishable in characteristics.

Similar observations were made with with the worst values, though in a different area of the landscape and with less distinguishable characteristics. Similarity was specifically measured based on learning_rate standard deviations grouped by tree, which were generally below 0.3 for the best configurations, and below 0.2 for the worst. These observations imply that the areas of good performance can be discovered and focused on in search processes, while safely avoiding areas of bad performance.

Experiment 5 was based on investigating the possibility of rounding hyperparameter values to reduce search complexity. More specifically, we compared the prediction performance obtained with the quniform and uniform value selection methods, native to Hyperopt, in three different optimization processes of 500, 1000 and 2000 iterations. All processes optimized learning_rate in isolation. From this experiment, we found that the quniform value selection method, native to Hyperopt, obtained configurations of better prediction performance, compared to the regular uniform value selection method. This was specifically observed with 5 of the 7 explored datasets when optimizing learning_rate. However, for most of the datasets, quniform still required 2000 iterations to find the best performing configurations. This indicates that reducing search complexity by restricting hyperparameter values could be a reasonable method for achieving better prediction performance in less search iterations. However, it is unclear whether the benefits of this method are substantial.

At the very least, we can say that restricting the detail of hyperparameters is not inherently detrimental to searches.

In summary, we answer RQ3 by stating that per-tree hyperparameters of gradient boosting ensembles should by standard be optimized holistically and together in the same process. We also found many indications of aspects that could help effectivize optimization processes. Such aspects include: Optimally performing combinations of optimized hyperparameters; Common hyperparameter value ranges and value developments between trees; Common locations and a discovered leniency of the area of best performance in hyperparameter search landscapes; And the result that reducing search complexity by restricting the detail of hyperparameter values can be employed without major repercussions to prediction performance.

In document Flexible Ensemble Structures for Gradient Boosting (sider 92-95)