• No results found

4.3 Predicting the BVP amplitude

4.3.3 Training and Results

We trained one CNN per participant. The reason for training models on a single participant is that we want the model to approximate the emotional response of a distinct person, and not be a mix of different, possibly interfering responses, from many people at once. Mixing, aggregating, prioritizing etc. of the emotional responses of multiple people is however, an interesting avenue for further work.

The models were trained with a callback method that monitors the mean absolute error on the validation set, and stops the training after ten consecutive epochs with no improvement in this metric. The model from the best performing epoch was saved to be used for the reinforcement learning experiments. The models trained in a range of 11 to 49 epochs.

For a simple evaluation of the CNN models, we also calculated the error when using the mean of the response variable in the training for prediction.

This measurement is also called ZeroR. Comparing the ZeroR with our results should give us some indication about how well our model performs at predicting the BVP amplitudes. The performances from the best training epoch for each model and the corresponding ZeroR can be seen in Table 4.2.

From these experiments we can see that most of the trained models, eight out of ten, slightly outperform the ZeroR on the mean absolute error metric, while a smaller minority, two out of 10, outperform the ZeroR on the root mean squared error. All tough a more significant improvement

CNN ZeroR

ID MAE RMSE MAE RMSE

0 0.090 0.116 0.075 0.099

1 0.098 0.135 0.103 0.131

2 0.073 0.109 0.075 0.100

3 0.037 0.059 0.050 0.070

4 0.052 0.084 0.069 0.094

5 0.090 0.128 0.094 0.121

6 0.088 0.124 0.090 0.116

7 0.104 0.146 0.109 0.139

8 0.064 0.103 0.060 0.090

9 0.099 0.135 0.109 0.134

Table 4.2: This table shows the results the experiments of trying to predict the BVP amplitude using video game frames.

would have been desirable, due to restrictions on time these models will have to suffice as providers of the emotional signal for our experiments.

All though the models can not be assumed to reliably reproduce a truly plausible BVP amplitude, at least can be said to produce a non-random signal based on human emotional arousal data.

To gain a better understanding of what these models actually produce, we collected one thousand test images from different levels of the play sessions in the dataset, and had the CNN models predict values based on this common set of frames. Table 4.3 shows the ranges of values, as well as the mean values, predicted for each model.

We can observe that some of the models show large variation here, some larger ranges of values, up to variations of about 0.9 at the largest, while several others only produce predictions in a much narrower range.

Furthermore, we can see that all the models produce somewhat similar mean values, in the range of 0.149 to 0.289. One likely reason for the narrow ranges could be that these CNN models are converging more around a mean value. Other contributing factors could be the existence of outlier peak values in the BVP data, which may results in a large majority of the amplitude values used for training being in a narrower range than the true range of the set, for some of the models.

To begin to analyse he incentives produced by the models, i.e., what types of game situations would create what kinds of values, we manually reviewed the predictions given to the lower and higher scoring groups of test frames for the model trained on participant 0. From this cursory look we could see that much of the predictive variation could still be determined by level, or world, specific identifiers, like the differences in backgrounds among levels. For instance, the top 2%, meaning those with the highest associated prediction values, of the test images, and likely much of the top range, where mostly dominated by frames from levels using a dark background. These are either castle levels, or certain other late levels

Values predicted by CNN:

CNN ID: Low: High: Range size: Mean

0 0.226 0.374 0.148 0.281

1 0.127 0.653 0.526 0.221

2 0.14 0.371 0.231 0.194

3 0.087 0.565 0.478 0.149

4 0.108 0.556 0.448 0.196

5 0.065 0.968 0.903 0.229

6 0.14 0.442 0.302 0.24

7 0.095 0.925 0.83 0.289

8 0.119 0.512 0.393 0.195

9 0.104 0.702 0.598 0.266

Table 4.3: This table shows the lowest, highest, range size and mean values predicted by the trained CNN models, based on 1000 test images collected from the play sessions. The CNN ID is the same as the ID of the participant used to train the model.

Figure 4.3: Some examples of images with predicted values in the bottom 2% of the test image set (1000 frames), using the CNN based on participant 0.

that use different background graphics from the rest of the levels. Some examples of such frames can be seen in Figure 4.4.

On the opposite end of the spectrum, frames taken from water levels, which offer their own distinct set of game mechanics, populated most of the bottom 2%. This might indicate that the nature of the arousal response changes throughout the game session or are associated with specific levels or features of longer stretches of game play. We would have liked to see if the models show a reaction to specific situations, however this limited investigation revealed no such connections. Whether there indeed exists relative differences in the predictions based on more specific situations remains an open question based on this analysis.

Ideally we would have wished to perform some more extensive experiments to test the behavioral incentives produced by the CNN models at this point, before moving on to implementing them into a reinforcement learning framework. For instance, by collecting sets of test frames that

Figure 4.4: Some examples of images with predicted values in the top 2%

of the test image set (1000 frames), using the CNN based on participant 0.

represent distinct game situations, one could begin to measure the typical values predicted for these, and thereby gain a better understanding of what situations are likely to produce what reward for what model etc.. However, as we must maintain a narrower scope for this thesis, we limit ourselves to this reduced, and somewhat superficial, investigation of the behaviour of this single model. This approach might be limited in terms of offering actionable insights, but may still serve as a rudimentary example of a way to begin such an investigative process.

Chapter 5

Building a Double Deep

Q-learning Network to play Super Mario Bros

In order to explore how introducing an emotional component into a reinforcement learning agent would affect the learning process, we first needed a more standard reinforcement learning learning algorithm„ that performs decently in the SMB environment, and which could serve as a baseline for comparisons. The model we built for this purpose was a type of Learning network, or more specifically a Double Deep Q-learning Network (DQN). As detailed in section 2.4.3 the DQN is a model-free reinforcement learning algorithm, implementing the experience replay method, and having two separate deep neural networks, one for choosing the next action and one for predicting future values.

5.1 Environment and Wrapper Functions

The environment that the DQN agent is trained in is the SMB game, as given by the gym-super-mario-bros [37] reinforcement learning frame-work, which is the same one used for collecting the dataset. Due to limita-tions on time, as well as other resources, the model was not trained on the full version of the game, but was constrained to learning on a single course, in order to reduce the scope of the training experiments. Several wrapper functions were also applied to the environment to further customize it for efficient learning. We look at some of these in this section.