• No results found

3.2 Decision tree models

3.2.3 Random forests

Random forests is a tree-based ensemble learning method which performs well on binary classication tasks. Similar to the boosting algorithm shown in Section 3.2.2 it trains several trees and aggregates the results. However, instead of changing the weighing of the training data between trees, random forests uses bootstrap sampling to generate new training data from the same distribution for each tree. In random forests a random subset of the variables available is used for each tree. This randomization is used each time a split in the tree is considered. The reason this is done is to reduce the correlation between the dierent trees produced by the model. If there exists one very strong predictor in the data set, along with several other somewhat strong predictors, then it could very easily be the case that the rst few splits chosen for most of the trees would be splits done on the same variables. Consequently, the trees would all look similar, even though the trees are built using dierent bootstrap samples. In such an event, the predictions from the trees would be highly correlated, and we would not get the desired reduction in variance.

A normal subset of variables to choose from for classication at each split is m =√ p where p is the total number of variables in the training set. By using a small number of variables for each split, we will have many splits where the strongest variable is not available, so the other predictors will have a better chance to be chosen for the split. This process decorrelates the trees and makes the average of the resulting trees less variable and therefore, more reliable. This builds on the concept of weak learners mentioned previously, since we are often building decision trees excluding the most important variables. The predictive accuracy of these trees will likely not be very good, but the results can be aggregated to form a stronger learner.

Historical background

Tin Kam Ho introduced the method of random decision forests in 1995 as a way to overcome the fundamental limitation on the complexity of tree classiers. [Ho, 1995] If trees are grown too complex they will overt the training data, which will lead to low bias and high variance. This in turn leads to poor generalization on unseen data. Ho showed that producing several trees with random restrictions on the variable selection for each tree could lead to improved accuracy while avoiding overtting. An extension of the algorithm was later proposed by Breiman where he describes random forests as it is used today. [Breiman, 2001a]. He describes building a forest of uncorrelated trees using randomized variables for each split, while also combining this with bootstrap sampling.

We thus get a random subset of variables and observations for each tree, which greatly reduces the correlation between trees. This leads to models that are slightly worse in bias, but have reduced variance compared to the original random decision forests proposed by Ho.

Theoretical implementation

Given a training set of size(N xp), where N is the number of observations, andpis the number of variables, the random forests algorithm can be implemented algorithmically in the following way:

1. Forb= 1 to B:

(a) Draw a bootstrap sample Zb of size N from the training data.

(b) Grow a random forest treeTbto the bootstrapped data by recursively repeat-ing the followrepeat-ing steps for each terminal node of the tree, until the minimum node sizenmin is reached.

i. Select m variables at random from the p available variables.

ii. Pick the best variable/split-point among the m.

iii. Split the node into two daughter nodes.

2. Output the ensemble of trees {Tb}1−B To make predictions at a new point x:

Regression: fˆrfB(x) = B1

To nd the optimal split-point and variable in step b-ii above we use a measure of node impurity to grade the dierent variables and split points. In a node m, representing a region Rm withNm observations, let pmkˆ be the proportion of classk in nodem.

ˆ

Three common measures of node purity are:

Classication error:

Cross-entropy or deviance:

The implementation of the random forests algorithm in this thesis is done using the Ranger package in R. It has been shown to be less taxing computationally and it also has short training time when compared to other implementations.

The random forests algorithm involves several hyperparameters controlling the structure of each tree, the structure of the forest, and the level of randomness in the trees and forest. in Table 14 the parameters tuned for in this thesis is listed.

Table 14: Tunable parameters for

Num trees controls the number of trees that will be grown for the forest. It should be a sucient number of trees so that every observation is sampled at least a few times.

The mtry hyperparameter controls the number of randomly drawn candidate variables that are available for selection at each split when growing a tree. As mentioned above, the default setting is to use √

p, but the optimal value depends on the underlying dataset. In general, lower values of mtry lead to more dierent, less correlated trees, which can lead to better stability when aggregating. Having a low value of mtry can also lead to better exploitation of variables with moderate eect on the dependent variable, since the more powerful independent variables are more likely to be excluded when the number of candidate variables is low. On the other hand, low values of mtry leads to worse performing trees, since they are built on suboptimal variables. This denes the trade-o between stability and accuracy that has to be made when selecting the size of mtry.

With the replace hyperparameter we get to decide whether observations are sampled with replacement or not. The sample fraction hyperparameter sets the fraction of the dataset that will be randomly sampled for each tree.

Another complexity constraint is the min node size hyperparameter. It sets the mini-mum required size for a terminal node. If this is set to be a large number, then shorter and smaller trees are grown. This will reduce the complexity of the trees, and can help prevent overtting. It will also improve the speed of growing the trees.

Advantages and disadvantages

Random forests is a very exible method that can handle a variety of input data. It is able to produce fairly good results, even without tuning of the hyperparameters. Similar to xgboost, with random forests we can calculate a variable importance measure which lets us know what variables are best able to separate the target classes. The random forests technique is also resistant to overtting, granted the number of trees in the model is suciently high. A disadvantage of the model is that it requires a lot of computational power to run, and the more trees you have in the model the higher the cost.