• No results found

Experimental Setting

6.2 Experimental Setup

6.2.1 Grid search

In the grid search, we first split the training data into a separate training and vali-dation set. The training set consists of the data from 01/06/2019 to 31/12/2019, while the validation set consists of the data from 01/01/2020 to 31/01/2020.

Then, for each hyperparameter combination, we train our models on 100 epochs for the FFNN and 70 epochs for the GRU and LSTM. The recurrent models are trained on fewer epochs because we see that they more quickly overfit due to model complexity, i.e., from having a richer set of model parameters. Then, for each epoch, we track the MAE on the validation set, and the lowest MAE gained during the training is defined as the score of the hyperparameter combination.

This concept is visualized in Figure 6.2. By evaluating the different combina-tions and compare their scores against others, we can get useful insights into the hyperparameters that perform better on the chosen evaluation metric.

Minimizing MAE is chosen as the optimization criterion because it, from trial and error, seems to be more robust than other alternatives.

Table 6.1 presents hyperparameters included when grid searching the FFNN, while Table 6.2 presents the hyperparameters tested in the GRU and LSTM grid search. Since GRU and LSTM have similar architectures they are also tested with the same values. For theRN Ninitwe are testing three initialization methods, all described in Table 6.3.

2We train with a Tesla P100-PCIE-16GB GPU

Figure 6.2: Example on how hyperparameters are evaluated. In this case, the model got the lowest MAE on epoch 40, with a value of 7.67, which is then set to be the score of the hyperparameter combination.

Batch Size

Loss Function

Learning

rate d SD1;...;SDn

32 SmoothL1Loss 0.001 0 16

64 L1Loss 0.0001 0.25 64

128 MSE 0.50 128

16;16 64;16 64;64 128;128 16;16;16 64;64;64 12;64;16 128;128;128 Table 6.1: Tested hyperparameters in the FFNN grid search.

Batch Table 6.2: Tested hyperparameters in the GRU/LSTM grid search.

Initialization method Description

zero Parameters in the hidden/cell state are initialized to zero distribution

random

Parameters in the hidden/cell state are initialized to random numbers sampled from the normal distribution

learn Parameters in the hidden/cell state are first xavier initialized, then updated during backpropagation Table 6.3: Initialization methods in the GRU/LSTM.

Findings from the grid search

Overall, it was not trivial to distinguish between the results in the grid search.

The differences were small, due to the fact that the dataset undoubtedly is noisy and complex. However, there were some indications that certain hyperparameters produced more robust results than others.

For the FFNN, the effect of dropout is hard to evaluate solely by looking at the MAE scores. Models without dropout produced just as good results as other models, but models with no dropout showed a more unstable learning behavior throughout the epochs. In Figure 6.3 (a) and (b), we are plotting the train and validation learning of two models with the same hyperparameters, except for the dropout rate. We can see that the learning results seem to be more robust by adding dropout. When it comes to the structure of the dense layers, one layered network with a size of either 64 or 128 tended to yield better results. A fixed-size of 100 epochs was not enough for networks trained with a learning rate of 0.0001, but as the simpler models performed best and their training time is fast, a low

learning rate seemed advantageous. From what we could see, no loss function and batch size had any results significantly different from others.

Figure 6.3: Dropout effect in the FFNN. Both models are trained with the same hyperparameters, except for the dropout rate, where blue is trained with no dropout and red with a dropout rate of 0.5. We can clearly see that the dropout yields more robustness.

It was also difficult to come to a conclusion on which parameters that worked best for the GRU and LSTM. These models are more complex than the FFNN, with many more knobs to turn and tweak. This was noticeable as the models, regardless of the hyperparameters, started to overfit on the train set from an early epoch. Figure 6.4 visualizes a typical learning curve for these models.

Fortunately, there are hints of learning in the first epochs, and that might be enough to produce good results if we can stop the training at the right time. Due to the overfitting issues of the LSTM and GRU, a low learning rate with a high dropout seemed to be advantageous. When it comes to the other parameters, it

was again difficult to distinguish between their relative performances.

Figure 6.4: Typical GRU/LSTM learning curves in the grid search. The Figure is taken from the GRU grid search for stage 1.

Based on these findings, a smaller hyperparameter space was further explored in the forward chained validation. Since we did not seem to improve perfor-mance by using a specific loss function or batch size, we set the batch size to 32 and SmoothL1Loss as the chosen loss function. In PyTorch, SmoothL1Loss is formulated as:

SmoothL1Loss(ˆy, y) = 1

This loss function uses a squared term when the absolute error is below 1 to get more out of the gradients when the prediction is relatively close to the target.

This can make sense for a target series that is closely distributed around zero mean. To prevent the gradients from being dominated by outliers, the error is not squared when it is above 1. This behavior can be preferable to avoid overfitting to extreme observations.