• No results found

Calculating huge systems of differential equations can be a difficult and time-consuming task for humans while it is relatively simple for computers. If a task is well defined with mathematical formulas and equations, there is almost no limit for how complex the problem can be as long as the computer has enough computational power. However, when a task cannot be well defined with formulas, it is hard to write a computer program to solve the problem. An example of such a task is recognition of objects or speech, which humans do automatically from a young age. This process happens unconsciously without any clear rules of how to do it, and so it is hard to explain the process with formulas. On the other hand, if there are huge amounts of data available to represent the task, it is possible to feed the data to an algorithm which then can extract knowledge from the data and transform it into a model that performs the task.

Machine learning is useful when a task has a certain amount of complexity to it that makes it difficult to formulate a computer program. In addition, there should also be a sufficient amount of data available. An algorithm needs data to train a model, validate the model and test the model. The test data cannot be the same as the training data. The validation data will be used either after training or during training, for cross-validation. After the validation data has been used they become a part of the training data and can thus not be used to confirm the performance of an algorithm. Based on the training data, an algorithm sets parameters for a model. The test data is necessary to check that the parameters for the model perform well on data other than the training data.

2.1.1 Unsupervised learning

A problem that only contains input data without any corresponding output, or target, data is called unsupervised learning. In [7] this type of problem is expressed as 𝐷 = {𝑥𝑖}𝑖=1𝑁 , where D is a training set, N is the number of training examples and x is the input data, often called the feature. The feature can be the eye colour and gender of a person, an image, a time series or a sentence. When using an unsupervised learning algorithm, the input data is not labelled, but the algorithm learns to differentiate between the input data by recognising patterns. Some methods for doing this are clustering, density estimation and visualization.

An example of a clustering task is to identify whether an image depicts a dog or a wolf if there is a data sample of 1000 images. A training set of images would be fed to the algorithm, which would not know what the images depict, but would try to cluster the images of similar types together in groups. A cluster could, in this case, be images of a dog.

If an image is of a wolf, it would go as a no-dog image, and therefore not be part of the dog cluster. If there were more than two cases, this process would be run through for all the cases. When the algorithm has run through all the training data and made some clusters

4

of images with similar features, a set of test data is fed to the algorithm. The performance of the algorithm can be measured as accuracy or as the error rate [8]. The accuracy will be determined by the number of images that the algorithm manages to identify correctly, i.e. the algorithm identifies an image as a dog when the image actually depicts a dog. The number of images that are wrongly identified by the algorithm is called the error rate or test error.

2.1.2 Supervised learning

A problem that contains both input data and corresponding output data is called supervised learning. In [7] this type of problem is expressed as 𝐷 = {𝑥𝑖, 𝑦𝑖}𝑖=1𝑁 , where D, N and x are the same as for unsupervised learning, while y is the output data, the targeted value, which in this case is known as well. When using a supervised learning algorithm, the data is labelled, and the algorithm generates a model that best represents the relation between the inputs and the outputs. Some methods for doing this is classification and regression.

A classification algorithm could also solve a similar task to the one described in chapter 2.1.1; differentiating images of dogs and wolfs. In this case, all the images would already be labelled as either dog or wolf before being fed to the algorithm. The inputs are the images and the outputs are the label. There is no need for the algorithm to group the images in this case, because the output, which decides the group, is already given. In this case, the algorithm needs to learn the parameters that are common for all the images in the group of dogs and in the group of wolfs so that when a new image is run through the algorithm the model can recognise these parameters and give the new image the right label. Classification algorithms work for problems that have a finite number of categories.

If a problem has output data that is one or more variables, this would be assigned as a regression problem. Regression problems are further discussed in chapter 2.2.

2.1.3 Reinforcement learning

A reinforcement learning algorithm learns by trying and failing. The goal of reinforcement learning is to find the optimal actions for a given situation so that the performance of the algorithm is the most optimal. For this type of algorithm to give good results, it is necessary to have a balanced amount of exploration and exploitation [9]. Exploration will say that the algorithm tries out new actions to see how it affects the performance, while exploitation is when the algorithm uses actions that already has proven to give good performance in a given situation.

5

2.1.4 Data representation and variation in data sets

An algorithms ability to perform well on new data is dependent on the representation and variation of data in the training set and test set. A well-known example of this is a classification example with images of dogs and wolfs, where the machine learning algorithm labels an image as a wolf or a husky. An example of inputs and outputs for this algorithm is shown in figure 2.2. When looking at the dogs and wolfs in the images there is no obvious reason to why one of the wolfs gets labelled as a husky. On the other hand, if one looks at the background it is clear that all the images of the wolfs labelled as wolfs are in snow landscapes, as well as the husky labelled as a wolf, while all the huskies labelled as huskies are not in snowy landscapes. This is probably the result of a training set where most of the images of wolfs were in snowy landscapes and the images of huskies were not, so the algorithm learned that snow was a feature of a wolf. However, this feature has nothing to do with whether the image contains a wolf or a dog, but it is a result of a machine learning algorithm learning to do a task on its own based on the information it is given without anyone telling the machine what features to look for. To avoid such problems, it is important to have both training, validation and test sets with a huge variety of data. As many different cases as possible should be present in the data sets.

Figure 2.2: The predicted response from a machine learning algorithm compared to the true response [10].

A machine learning algorithm will be able to make objective features to differentiate between cases, but the features are based on the data fed to the algorithm, and this data might not be objective. An example of this is when Amazon made an algorithm to pick out good job applications [11]. The algorithm turned out to be biased towards women. In the past, technical jobs had mainly been possessed by men, so when the algorithm learned good traits from previous job applications, being a man was favoured. When the task to be solved is labelling images as wolf or dog, it is easy to say if the algorithm has been successful in prosecuting the task. But not all tasks are easy to check. As a task and the data becomes more complex, it becomes harder to check whether the results given from the algorithm is actually good or not, as in the task of finding god job applications.

6

Another problem is deciding how the algorithm should weight mistakes under the training phase. Should a big mistake be weighted as worse than several smaller mistakes, or should it be the other way around. This is dependent on the problem one is trying to solve, and it is important to be aware of the effect this has on the performance of an algorithm.

The way data is represented is also of importance for the performance of an algorithm. In figure 2.3 some data is represented in both cartesian and polar coordinates. If an algorithm divides data with a line, it will be essential for the algorithm that the data is presented in polar coordinates.

Figure 2.3: Data represented in cartesian coordinates and polar coordinates [8].