The Elbas Volume-Weighted Average Price

4.5 Exploring the Data

4.5.2 The Elbas Volume-Weighted Average Price

Price development

Figure 10 shows the development of thenear Elbas VWP over the dataset timeline, split into the training, validation and test sets, and Table 2 provides summary statistics for the near Elbas VWP for each of the three sets. Both the highest and the lowest VWP over the full period occurred around the beginning of 2012, and their respective values were 259.55 EUR/MWh and -75.68 EUR/MWh. Hence, the training set contains the most extreme values observed in the period. In the validation set, the VWP peaked in the beginning of 2016, and dropped below zero (to -22.84 EUR/MWh) once, at the end of 2016. The test set period generally exhibits less extreme fluctuations, with peaks just above 100 EUR/MWh, and the lowest point just below zero at the end of 2017.

In the training set, the IQR goes from 25.63 EUR/MWh to 38.83 EUR/MWh, with a median of 32.09 EUR/MWh. The values are somewhat lower for the validation set, where the IQR is 21.91–34.16 EUR/MWh and the median is 27.58. For the test set, the upper limit of the 1st quartile is similar to that in the training set, but the 3rd quartile ends on a value that is slightly lower. The median, 29.97 EUR/MWh, is also slightly lower.

Figure 10: Development in thenear Elbas volume-weighted average price over the period Novem-ber 2, 2011 to DecemNovem-ber 31, 2017. The training set period (dark blue), validation set period (”middle” blue) and test set period (light blue) are also indicated.

Dataset Minimum 1st Qu. Median Mean 3rd Qu. Maximum Stdev

Training set -75.68 25.63 32.09 32.83 38.83 259.55 13.07

Validation set -22.84 21.91 27.58 29.19 34.16 196.74 12.07

Test set -6.324 26.26 29.97 30.95 34.36 113.56 9.07

Table 2: Summary statistics for the near Elbas VWP in the training, validation and test sets.

Near VWP across delivery hours

The level and volatility of electricity prices may vary depending on delivery time (Hagfors et al., 2016), which is also the case in our dataset, as shown in Figure 11. The near VWP is generally higher and more volatile, measured by the standard deviation, during delivery hours 7–20, with notable peaks around hours 8–9 and 18–19. As such, gauging how performance varies across delivery hours may be relevant when contrasting the different models and benchmarks.

(a) Distribution (b) Volatility

Figure 11: Distribution and volatility of the near VWP across delivery hours. Data is from the test set period.

5 Methodology

Per Section 1.1, we aim to address the research questions:

1. To what extent can deep learning predict the volume-weighted average Elbas price six hours ahead of a given hour of power delivery, and how reliable are the forecasts in practice?

2. What does this suggest about the potential of deep learning in wider applications in the Elbas market, and what are the salient hurdles to implementing such AI agents?

We therefore need a set of metrics to evaluate and compare how various models perform on our prediction problem. This allows us to both asses in absolute terms how well various implemen-tations of deep learning perform, but also to compare against benchmarks to assess the relative value-add versus the added complexity of these implementations. To this end, we define and develop a set of simple heuristics and baselines from more traditional techniques to serve as benchmarks. With this empirical harness in place, we experiment extensively with a range of various deep learning implementations, before selecting a final set of model designs with which to thoroughly evaluate the effectiveness and pragmatic value of deep learning in intraday electric markets. In the interest of brevity, the focus is on these final implementations, rather than all the underlying experiments that lead up to them.

The following sections outline the choice of metrics (i.e. how we measure success), the simple

domain-based heuristic baselines, the more traditional benchmark models used, and finally the design and implementation of the various deep learning networks developed. We assume the reader has some knowledge of more traditional techniques in keeping the descriptions of these to a minimum, but do provide simple summaries in Appendix E. Seeing as deep learning is the focus of this thesis, we provide some brief descriptions of how various types of deep learning function, but focus on the reasoning behind our design choices and on providing a sufficiently detailed overview for reproducability. Appendix A is intended to be a self-contained introduction to how various forms of neural networks work from a technical, albeit accessible perspective, and includes more detailed explanations of established concepts used in our implementations. The specific sections of the appendix are referenced for convenience in footnotes, where relevant.

5.1 Model Evaluation

For a given problem we assume there is a true, unknown relationship f^∗ between the observed outcomeyand the input predictorsxsuch thaty =f^∗(x)+ε, whereεis an independent random error term with assumed variance of 0 (Hastie et al., 2009). We estimate this relationship to make predictions that are as close to the true outputs as possible: ˆy = f(x) ≈ y, where ε averages to zero and is excluded. Our main concern is getting accurate predictions rather than knowing the exact functional form off^∗. This accuracy depends on areducibleand anirreducible error. The former can be mitigated by better statistical learning methods or ways to estimate f, but the former stems from theεcomponent of the truey, which by definition is random and cannot be predicted usingx(Hastie et al., 2009). Mostsupervised⁵⁸machine learning algorithms tune a set of parameters to estimate f with the goal of minimising some loss function — i.e.

some chosen measure of how well the model performs. For regression problems, it is common to use either the Mean Absolute Error (MAE) orMean Squared Error (MSE) — alternatively the Root Mean Squared Error (RMSE) — although it is possible to use tailor-made loss functions to reward or penalise specific behaviours (Chollet, 2018).

We choose MAE as the loss function, for two reasons. First, we want models that can be better relied upon in typical day-to-day situations. Second, we believe such models will better complement analysts and decision-makers — or even specialised models — that are better suited to spotting exceptional situations, such as price spikes. As Equation 2 shows, MAE is calculated

58Supervised problems are where the output is known, so that the algorithm can iteratively improve its per-formance compared to a defined ”correct” answer.

as the mean of the absolute deviation between the predicted value ˆy and true value y, over the samples i = 1, ..., N. Hence, MAE penalises all errors proportionately, which increases performance in general, but at the cost of greater magnitudes of peak errors. In other words, being better across the board means that the model will miss by more when it first does miss.

M AE= 1 N

i=1

|yˆi−yi| (2)

Although we use MAE as the loss function to train the models, RMSE is also reported as an additional performance metric to inform model choice and design. RMSE is calculated as the square-root of the MSE, as shown in Equation 3. As the deviation between predicted and target values is squared, RMSE penalises larger deviations more severely. Hence, reporting both MAE and RMSE allows for a deeper understanding and evaluation of models’ merits and limitations.

RM SE = v u u t

1 N

i=1

( ˆyi−yi)² (3)

The MAE and RMSE metrics are reported on both the validation and test sets, as the latter is truly unseen data and a better representation of how our models would perform in practice.

In document Artificial Intelligence and Nord Pool’s intraday electricity market Elbas : a demonstration and pragmatic evaluation of employing deep learning for price prediction : using extensive market data and spatio-temporal weather forecasts (sider 44-48)