• No results found

2.3 Machine learning

2.3.10 Decision tree learning

A decision tree is a hierarchical structure composed of several nodes branching out from a root node and ending in leaves. In a binary decision tree, which is most common in machine learning applications, each node has two branches. Each node is an evaluation of some information into true or false, and a branch is followed according to the result of the evaluation. The new node is then evaluated, this continues until a leaf is reached.

The leaves contain the decisions of the decision tree. The maximum number of decisions, i.e. the number of nodes in the longest branch, is the depth of the tree. A decision tree

is shown in figure2.12. As is apparent in the decision tree shown, the features evaluated can be both continuous (temperature) and Boolean (presence of rain).

Figure 2.12: An illustration of a decision tree deciding if a person should go outside. Shown in the figure are the root node (a), branches (b), and leaves (c).

In decision tree learning, an optimal decision tree is fitted to the training data. This is done by evaluating the predictive power of each feature using an impurity measure and selecting the feature with lowest impurity as a node. There are several impurity measures (information gain, gain ratio, gini-index, variance reduction, etc.). If the decision tree is allowed to be very large, it can grow to make perfect predictions on the training data as there is effectively a branch for every sample. This creates a decision tree that generalises poorly as it struggles to classify new samples, it is over-fitted. To combat over-fitting random forest are introduced.

Random forest

A random tree is grown like a normal decision tree, only it uses bagging to combine several trees generated from bootstrapped data sets into a forest. To further improve the generalisation capability of the forest, each tree is trained on a randomly selected subset of the features. This creates a forest of full-size trees that is less prone to over-fitting.

Each tree has an equal vote in the final classification. To further improve performance, boosting is introduced.

Random forests can also be used for feature selection. Each tree, t, in the random forest assigns a feature importance to each of the features in its feature subset. It does this assessing the misclassification rate on its out-of-bag (OOB) samples, OOBt. These samples are the ones that were not included in the bootstrapped data set and hence has not been used to construct the tree. This is the baseline for the tree’s performance, denoted byerrOOBt. It then takes the feature column of featurej, designatedXj, in the OOB samples and randomly permutes the values in the column. The misclassification rate is then assessed anew on this randomly permuted OOB sample set,OOBtj. The new misclassification rate is denoted errOOBtj. If errOOBtj is unchanged after the feature was scrambled, feature j is deemed to be less important. If it deteriorates, on the other

hand, the feature is important. This is repeated for every feature, Xj, for every tree, t, in the forest of T trees. The feature importance of each feature, V I(Xj), is the average importance across the entire forest as shown in (2.24) [24].

V I(Xj) = 1

Boosted decision tree models are very strong classifiers. The first practical boosting al-gorithm, AdaBoost, was created in 1995 and is still useful today. AdaBoost, short for adaptive boosting, is a forest of stumps [25]. A stump is a tree with just a root node and two leaves. The stumps are made sequentially and weighted according to their predictive performance. The samples are also given a weight, initially equal, that is increased if the samples are misclassified by the last stump that was generated. The weights of the stumps are calculated as given in (2.25), where the total error is the sum of the weights of the misclassified samples. The weights of each sample are then adjusted as shown in (2.26). The scalar a is either 1 or -1 depending on if the sample was misclassified or correctly classified by the stump, respectively. When every sample has been adjusted, all the sample weights are normalised.

N ewSampleW eight=OldSampleW eight·ea·StumpW eight

(2.26) A bootstrapped data set of equal size to the original is made, where the probability that each sample is drawn is proportional to its weight. The weights in the new data set are set to be equal. This new data set is used to train a new stump. The hard-to-classify samples are then given extra emphasis by being more numerous in the data set. The process is continued until the forest is complete. AdaBoost has now been superseded by modern alternatives as boosted trees outperform boosted stumps [26]. Popular modern boosting algorithms are XGBoost, CatBoost, and LightGBM. They use what is known as gradient boosting.

Gradient boost decision trees

XGBoost is extremely popular and does very well in Kaggle competitions as over half the winning implementations in 2015 made use of XGBoost [27] [28]. Kaggle is an inter-national on-line machine learning platform where machine learning experts compete to implement the best machine learning algorithms with a given data set. XGBoost provides the option to increase tree depth and is heavily optimised by use of parallel processing.

It is a quick and powerful algorithm [28]. CatBoost and LightGBM are still considered frontier algorithms, but are set to outperform XGBoost in some respects. LightGBM is an extremely optimised boosting algorithm that is especially suited to extremely large

data sets as it can achieve a 20 times speed increase with nearly the same accuracy [29].

CatBoost is an algorithm that seeks to reduce target leakage and therefore increase gen-eralisation in the models, it was shown that CatBoost could outperform XGBoost and LightGBM on several popular machine learning tasks [28].