• No results found

2.3 Machine learning algorithms

2.3.2 Classifiers

The classifier, or the machine learning algorithm, as seen in figure 2.10 are the component which produces the inferred function h=f(x). There are a vast number of classifiers and the three next paragraphs will give a summary of the classifiers intended to be used in this thesis and their strengths and weaknesses and why they have been chosen.

Decision trees The decision tree classifier is one of the most simple classifiers, hence the first to be introduced. It was first presented by JR Quinlan in 1986 [51]. The learning phase involves constructing a tree consisting of nodes and edges using the dataset. A node represents a test on one of the features, and the edges represent an outcome of the test. Leaf nodes represent the outcome class. Predicting then consists of following a path from the root node to a leaf node. An example of a decision tree is seen in figure 2.12.

An important aspect of decision trees is how to create one, as different trees give rise to different classifiers. MATLABs (used in this project) method for creating classification tree is known as a top-down induction of decision trees (TDIDT) and is an example of a greedy algorithm. It works recursively in the following manner:

1. Start with all input data and examines all possible binary splits on every feature.

2. Select a split with the best optimization criterion. Information about

2.3. MACHINE LEARNING ALGORITHMS

Figure 2.12: An example of a decision tree that based on the outlook, humidity and windy classifies a day to belong to either the P or N

class [51].

the split criterion is given in appendix A.

3. Impose the split

4. Repeat recursively for the two child nodes.

There are several reasons for the choice of testing decision tree in this project. It is conceptually simple and deals well with noise [43] (data that is not representative of the true function). It is a white box model, meaning that the decisions that led to the specific prediction can be seen.

This is desirable if knowledge about the dataset is sought. One of the disadvantages stems from its greedy strategy. Finding the right root of the tree can greatly impact the final result. I.e., small changes early on can have big effects later. Furthermore, if the true underlying function is truly linear, decision tree does usually not work well. Lastly, decision tree does not function well if there are large numbers of uncorrelated features since it works by finding the interactions between variables.

Ensemble learning Ensemble learning is the process of combining multiple classifiers into one to obtain better performance. The assumptions behind combining classifiers are: (i) if every classifier is wrong in one unique subset of the dataset, the model will still predict correctly since the rest of the classifiers are correct on the same subset, and (ii) this holds for every subset. The question arises on how to combine the classifiers used. The most widely used way of combining classifiers is through a technique called boosting. In 1989, Robert E. Schapire introduced the first boosting algorithm [59], in the paper “The strength of weak learnability”.

The general idea is to collect a set of poor classifiers, where each classifier performs slightly better than chance. By combining them together, they will perform arbitrarily well. An improvement of boosting was introduced by Yoav Freund and the same Schapire in a paper from 1999 [20], where AdaBoost (Adaptive boosting) was presented. The fundamental difference

was to assign weights to each feature vector according to how difficult previous classifiers have found to get it correct. In short, AdaBoost works iteratively until some number of iterations is reached or if the error rate is beneath some threshold. At each iteration, a new classifier is trained on the weighted training set. Initially, all weights are set equally, but in each round, the weights of incorrectly classified feature vectors are increased.

The next classifier will then have to focus to get these feature vectors correct. The change of weights is usually set to decrease with a fixed number as the number of iterations grows. This is because the uncertainty increases with the number of iterations. This fixed number is known as the learning rate. The general idea can be seen in figure 2.13.

Figure 2.13: Figure showing two iterations of AdaBoost, labeled 1, 2 and 3.

In figure 1, a point belonging to the triangle class is misclassified. The point gets a bigger weight, indicated by its bigger size and is thus correctly classified in figure two. This leads to a misclassification of a point in the star class. The point gets a bigger weight resulting in a perfect

classification in figure 3.

AdaBoost is proven to work on small datasets which are the main reason it is chosen to be tested [45]. AdaBoost uses the mentioned decision trees as its weak learners. In practice, the trees will favor splitting feature vectors with high weights. Consequently, noisy feature vectors will be given higher weights. Hence, AdaBoost will not work well when the number of outliers is to be considered high. Additionally, training can be time-consuming when the dataset is large, and convergence time can be slow for complex problems [60]. This weakness is alleviated since the dataset at hand is relatively sparse

Support vector machines (SVM) The original SVM algorithm was first presented by Vladimir N. Vapnik and Alexey Ya. Chervonekis in 1963 while the standard used today was introduced by Corinna Cortes and Vladimir Vapnik in 1995 [16]. It is considered to be a complex classifier.

Hence, the presentation of it will be limited to the general idea.

2.3. MACHINE LEARNING ALGORITHMS

Figure 2.14: Figure showing the optimal margin for a two dimensional two-class classification problem [16]. The circles and crosses are feature

vectors with values according to the x- and y-axis. The circle and cross represent its belonging class. The examples surrounded with a grey

square denotes the support vectors. The dotted line is the decision boundary/hyperplane created by the SVM.

Consider the 2-class classification problem in figure 2.14 where the classes are represented as crosses and circles and the approximationh = f(x)i.e. the decision boundary is represented as a dotted line. The prediction of new data consists of investigating which side of the decision boundary it lies on.

If it lies above the dotted line, it is classified as belonging to the cross-class, and it is classified as the circle class if it lies below. If the predicted data lies close to the decision boundary, one can see that if the decision boundary was moved only a small amount, there is a risk of misclassifying that data.

SVMs however, will take this into account and try to find the best decision boundary. Formally, the best decision boundary is the one that lies fur-thermost away from any of the training examples. That decision boundary is said to be maximizing the margin. Hence, the training process consists of creating a decision boundary which maximizes the margin. Cortes et al[16] showed that the process of maximizing the margin with respect to the input vectors only depends on the dot product of them. This is a key observation as an apparent weakness of SVMs is how it tries to create a straight line between the classes in the 2-dimensional case or a hyperplane in the high-dimensional case. Consequently, SVMs apparently only feasi-ble if the classes are linearly separafeasi-ble, as the profeasi-blem in figure 2.14. In order to overcome this issue, one can transform the data to a higher dimen-sional space where the data is linearly separable, as seen in figure 2.15. The decision boundary was dependent only on the dot-product of the feature vectors. This means that the transformation itself is not needed, but only a function that implicitly computes the high-dimensional dot-product. This function is referred to as a kernel function and is a vital component of the

Figure 2.15: Figure showing the transformation, often denotedφ, transforming the data into a linearly separable space. The solid line is the

decision boundary, and the circles are data from the training set [6]

classifier. The linear, polynomial and Gaussian kernel is provided by MAT-LAB. A more detailed description about them is given in appendix B. The choice of which kernel to use depend on the dataset, but the RBF kernel may be the most used kernel [12], hence is the only kernel to be tested in this project. This is done to ease the computational complexity.

The reason for including SVM as one of the three classifiers is that it often provide significantly better classification performance than other classifiers on reasonably sized datasets [45]. It has strong theoretical foundations and excellent empirical successes [37]. Furthermore, SVM is very robust against ill-behaved distributions, high dimensionalities and a low ratio of training examples to the dimensionality of the input data [10]. One drawback is the lack of transparency in results as the computation might be done in a very high dimensional space. SVM is also computationally expensive, thus runs slow for large datasets [40]. As with AdaBoost, this weakness is alleviated since the dataset at hand is relatively sparse.