Neural Networks Used for Time Series Prediction of Power Consumption

(1)

Neural Networks Used for Time Series Prediction of Power Consumption

Zuitao Ma

Thesis submitted for the degree of

Master in Informatics: Programming and Networks 60 credits

Department of Informatics

Faculty of mathematics and natural sciences UNIVERSITY OF OSLO

Spring 2019

(2)

(3)

Neural Networks Used for Time Series Prediction of Power Consumption

Zuitao Ma

(4)

Neural Networks Used for Time Series Prediction of Power Consumption http://www.duo.uio.no/

Printed: Reprosentralen, University of Oslo

(5)

Abstract

Predicting power consumption is of significant importance for all the participating parts in energy sector. With the rising-up of smart grid and widespread installation of smart meters, to predict power consumption in residential houses has become more feasible and is getting attention.

In this research, neural network methods are used to build prediction models for power consumption in residential houses, by utilizing previous consumption in the form of time series data. The chosen methods include Multilayer Perceptron, Conventional Neural Network and Long Short-Term Memory. And prediction models are built at the scale of both individual houses and house group.

Despite a relatively small training dataset, the neural network models give reasonable performance in the prediction of power consumption in the residential houses. It is also found out that extracting explicit time features from time series data can be a countermeasure again small training dataset when using neural networks.

(6)

Acknowledgment

First of all, I must thank my supervisor Yan Zhang, who has provided many precious suggestions for my thesis work. Not even to mention, all the guidance and help I got from him during my years at University of Oslo.

I want to also thank all the wonderful people in Hafslund Nett. I really appreciate their support to my master study, and it has been a great pleasure to work with them.

At last, I would like to thank my parents, Magnus and all my friends for their caring and encouragement.

(7)

List of Figures

4.1 a single neuron . . . 15

4.2 a simple ANN scheme . . . 16

4.3 Forward and Back Propagation . . . 18

5.1 CSV data file . . . 21

5.2 missing value . . . 22

5.3 total consumption trend . . . 23

5.4 detrended data . . . 23

5.5 typical consuming patterns . . . 24

5.6 data file of collective consumption . . . 26

5.7 collective consumption pattern. . . 27

5.8 extracted time features . . . 28

6.1 Underfitting and Overfitting. . . 33

6.2 K-fold Cross-Validation . . . 34

6.3 Walk Forward Validation . . . 35

6.4 Grid Search vs Random Search . . . 38

7.1 MLP architecture . . . 41

7.2 Linear Function . . . 44

7.3 ReLu Function . . . 45

7.4 Sigmoid Function . . . 45

7.5 Tanh Function . . . 45

7.6 Dropout . . . 46

7.7 Weight Optimization . . . 48

7.8 Learning Rate . . . 49

7.9 Batch Size . . . 49

7.10 Epochs . . . 51

(12)

LIST OF FIGURES

7.11 prediction result by MLP for individual house . . . 53

7.12 prediction result by MLP for house group . . . 54

8.1 CNN Architecture . . . 55

8.2 Filter . . . 56

8.3 Max Pooling . . . 58

8.4 Flattening . . . 58

8.5 prediction result by CNN for individual house . . . 60

8.6 prediction result by CNN for house group. . . 61

9.1 comparison between RNN and FFNN . . . 63

9.2 Vanishing Problem . . . 64

9.3 Memory Cell in LSTM . . . 65

9.4 Hard Sigmoid . . . 66

9.5 prediction result by LSTM for individual house . . . 68

9.6 prediction result by LSTM for house group . . . 69

(13)

Chapter 1 Introduction

1.1 Motivation

Prediction of power consumption has always had primary importance in energy industry. It plays an essential part for power suppliers, distributors and other participants in the generation, transmission, distribution and transaction of power. Statistical methods are traditionally used for power demand forecasting. Over the last decades, soft computing techniques such as machine learning methods have also gained its popularity in this field.

Power consumption predictions in the industry are usually focused on a high voltage level. The prediction results are considered to be more meaningful in the use of monitoring power grid and keeping the balance in it. On the contrary, prediction of power consumption in residential houses hasn’t got that much attention. Consumption data in residential houses was seen to be less retrievable and of little use.

However, as the emergence of smart grid and development of digital technologies, to predict power consumption in residential houses is showing its significance. Effective demand response and user-side energy management in a smart grid are closely dependent on the prediction of power consumption at the scale of individual homes. Meanwhile, the widespread installation of smart meters enables consumption data at household level easily accessible with fine-grained granularity.

(14)

CHAPTER 1. INTRODUCTION

1.2 Goal

The intention of this thesis is to employ neural network methods to make prediction of power consumption one day ahead at the scale of residential houses, both individual house and a group of houses.

Power consumption at the residential level is related to a bunch of factors, such as weather, temperature, living style, daily schedules, consuming preferences and so on. All these factors are variable, which means there is a certain degree of changeability in them. However, power consumption has also a correlation with another factor that is more stable. And it is time. Therefore, the prediction of power demand in this research will be conducted by using only historical consumption data with timestamp.

Neural networks are the major method in this research, and three variants of neural networks are chosen to build prediction models. Neural networks are usually good with big dataset, while they will be used on a relatively small dataset in this research. It is of interest to find out the performance of neural networks on small time series dataset.

1.3 Thesis Structure

In chapter 1, an introduction is given about the motivation and the goal in this research.

In chapter 2, the main background concepts in this research are presented, including smart grid, smart meter, time series data mining and machine learning.

In chapter 3, there is a review on the relevant studies about neural works, including using neural networks on time series data, using neural network on power consumption, and using neural networks on small dataset.

In chapter 4, artificial neural network is introduced briefly. This includes its basic components, learning processes and the three variants that are picked out for this research. The introduction in this chapter provides a theoretical basis for the building of models in chapter 7, 8 and 9.

In chapter 5, the dataset used in this research is presented. Moreover,

(15)

CHAPTER 1. INTRODUCTION the preprocessing on the dataset and choosing of data features for prediction models are illustrated.

In chapter 6, the main tools and techniques for building models are introduced, including the programming tools, validation technique, evaluation metrics and hyperparameter tuning method.

In chapter 7, prediction models are built by using multilayer perceptron neural networks. The architecture of multilayer perceptron is first stated. Then prediction models are built by explaining the major parameters. At last, the models’ performance and predicted results are evaluated and discussed.

In chapter 8, prediction models are built by using convolutional neural network. Like in the previous chapter, the architecture and particular parameters in convolutional neural network are explained.

Then the predicted result are evaluated and discussed, and the models’

performance is compared with the models in chapter 7.

In chapter 9, prediction models are built by using long short-term memory neural network. The architecture and particular parameters in long short-term memory are stated. Then the prediction results are evaluated, and the performance of models is compared among the three used neural network methods.

In chapter 10, the conclusion of this research is made. And the potential application and future work on this research is proposed.

(16)

CHAPTER 1. INTRODUCTION

(17)

Chapter 2 Background

2.1 Smart Meters

Smart grid is an intelligent network that is used to supply power to consumers via two-way communication. The E.U. defines smart grid as electricity network that integrates the actions of all actors in the network, including generators, distributors, consumers, with the aim of efficiently delivering sustainable, economic and secure power supplies [1].

Smart meters are the core devices in a smart grid. It is an advanced energy meter that acquires information from the consumers’ load devices and measures their power consumption, then sends the information to utility operators.

Smart meters have been deployed around the world during the past decades. In Norway, all the households will have smart meters installed in their houses by February 2019 [2]. Across Europe 80% of the households are expected to have smart meters installed by 2020. The penetration rate in Asia is predicted to reach 70% by 2022, driven primarily by China [3].

Unlike the old-fashioned meters in which consumption data has to be read manually, smart meters provide automatic metering of the power consumption. Automatic metering proceeds usually on hourly basis, with possibility to increase the frequency to 15 minutes or even higher. Besides consumption data, smart meters can be configured to record other variables such as voltage and current. They are also able to make registration of abnormal events in the grid like power outage and earth faults.

(18)

CHAPTER 2. BACKGROUND

Smart meters provide the consumers more correct and precise information about their consumption, almost in real time. This contributes a better understanding for consumers about their consuming patterns. With display on smart meters, users can relate directly their consuming behaviors with the power bills.

On a macro level, the employing of smart meters improves the controlling and monitoring of power grid, and consolidates construction of a well-operating power market. As power suppliers and grid operators gain deeper knowledge about the consumers, more innovations are being brought in into the energy sector, such as demand response, grid tariffs, security of supply and so on.

2.2 Time Series Data Mining

Time series data is defined as an ordered sequence of values for a variable at equally spaced time intervals [4]. Clearly, the power consumption data generated from smart meters is time series data. Large size, high dimensionality and continuous updating are described as the features of time series data. Due to these features, time series data has initiated a good number of research and attempts in the field of data mining.

Data mining is the discovery of interesting, unexpected, or valuable structures in large datasets [5]. Data mining in time series data is a field where data mining techniques and methods are adapted to the temporal nature of time series data. Time series data is often used in two phases: to gain an understanding of the implicit structure and trend that underlies in the observations, and to apply monitoring, controlling and prediction on the data sources.

The main tasks carried out in time series data mining include : 1) clustering, in which data is grouped into homogeneous clusters so that they can be as distinct as possible from each other; 2) classification with predefined labels, by which each series of dataset is assigned to one of the labels; 3) prediction, in which models are built to forecast the future values in a series; 4) anomaly detection with objective to detect abnormal sub-sequence in a series; 5) pattern discovery aiming to catch sub-sequence that appears recurrently in a long time series.

(19)

CHAPTER 2. BACKGROUND In real-life applications, time series data is usually generated from sensors or raw observations, and the quality of data is always below the mark for data mining. So before heading to actual data mining, it is crucial to preprocess the raw dataset. Typical preprocessing activities include data selection, data reorganization, data cleansing, data exploration and data transformation. These activities can vary based on the objectives and requirements of the research. In the light of lacking natural clarity of the raw data, it is of great help to use visualization tools during data preprocessing.

2.3 Machine Learning

Statistical methods like auto regressive integrated moving average (ARIMA) are often used for time series prediction. In the recent decades, machine leaning methods have been proposed and utilized as alternative to statistical ones for time series prediction, including neural network algorithms.

Machine learning is an application of artificial intelligence that provides systems the ability to learn and improve from experience automatically, without being explicitly programmed [6]. As a branch of machine learning, deep learning concerns algorithms that are inspired by the structure and function of neural networks in human brains.

Machine learning is mainly classified into two categories: supervised learning and unsupervised learning. In supervised learning, input samples are provided together with labels as their desired output. The objective is to learn patterns so as to assign labels to new unlabeled data.

In unsupervised learning, input samples come without labels, and the algorithm is left to learn commonalities among the input data samples.

To some extends, machine learning methods are overlapped with data mining methods regarding time series prediction. They both provide solutions based on experience or historical data. The difference lies in that data mining places emphasis on discovery unknown property of data, while machine learning focuses on prediction of new information from training the existing data.

(20)

CHAPTER 2. BACKGROUND

(21)

Chapter 3 Related Works

3.1 Neural Networks for Time Series Data

Neural networks have been widely used as a promising method for time series prediction. The research trend has been developed from comparing neural networks with statistical methods to proposing dedicated neural network models for time series data.

In the paper Neural Network Models for Time Series Forecasts[7], Hill et al. compared time series forecasts produced by neural networks with forecasts from six statistical methods generated in a major forecasting competition. Across monthly and quarterly time series, the neural networks did significantly better than traditional methods. And they found out that the neural networks were particularly effective for discontinuous time series.

Oleg Ostashchuk [8] carried out a survey in his master thesis on existing time series forecasting methods, including ARIMA method, artificial neural networks method and double exponential smoothing method. Three real life datasets from different areas were used to make experiments. The experiments demonstrated good results for ANN models and ARIMA models. Both of them can be suitably used for forecasting time series of various complexity.

In the paper Machine Learning Strategies for Time Series Forecasting by Bontempi et al. [9], an overview of machine learning techniques in time series prediction is presented by focusing on three aspects: the formalization of one-step prediction problems as supervised learning

(22)

CHAPTER 3. RELATED WORKS

tasks, the discussion of local learning techniques as an effective tool for dealing with temporal data, and the role of the prediction strategy when moving from one-step to multiple-step prediction.

Selvin et al. [10] proposed a deep learning based formalization for stock price prediction. Three deep learning architectures were used for the price prediction of NSE listed companies and their performances were compared, including RNN, LSTM and CNN. A sliding window approach was applied for predicting future values on a short term basis.

The performance of the models were quantified using percentage error.

The results shows that CNN architecture is capable of identifying the price changes in trends.

Sharat C Prasad and Piyush Prasad [11] discussed suitability of Recurrent Neural Networks for time series prediction in their study. They proposed an architecture for Recurrent Neural Networks that grows along three dimensions: number of units in layers, number of hidden layers and extent in number of instants of back-propagation. The fact is proven analytically that additional hidden layers improves the approximation error rate.

Iffat A. Gheyas and Leslie S. Smith [12] proposed a simple approach for forecasting univariate time series, which was an ensemble learning technique that combined the advice from several Generalized Regression Neural Networks (GRNN). They compared this GRNN ensemble with existing algorithms ARIMA-GARCH, MLP, GRNN with a single predictor and GRNN with multiple predictors on forty datasets. The one-step process was iterated to obtain predictions ten-steps-ahead. The results obtained from the experiments showed that the GRNN ensemble was superior to existing algorithms.

Liu et al. [13] proposed a hybrid deep-learning for prediction of wind speed. The model combined the empirical wavelet transformation and two kinds of recurrent neural network, including the long short-term memory neural network and the Elman neural network. In spite of the wind speed signal is stochastic and intermittent, their models give satisfactory performance in the high-precision wind speed prediction.

(23)

3.2 Neural Networks for Power Consumption

As power consumption data is becoming more easily accessible, neural networks have been utilized in power sector with different focuses, such as demand prediction, anomaly detection, consumer clustering and so on.

Krzysztof Gajowniczek and Tomasz Zabkowski [14] proposed a data-mining scheme for demand modeling through peak detection, by representing it as a two-stage pattern recognition problem. They utilized a set of machine learning algorithms to benefit from both accurate detection of the peaks and precise forecasts, as applied to the Polish power system. It is concluded that artificial neural networks, through their ability to approximate complex nonlinear functions, as well as their generalization capability, seem to be very effective tools for capturing hidden trends in the load data and delivering stable short-term forecasting.

To learn household load prediction, Shi et al. [15] proposed a pooling-based deep recurrent neural network (PDRNN) , which batched a group of customers’ load profiles into a pool of inputs. This method was implemented on Tensorflow deep learning platform and tested on 920 smart metered customers from Ireland. Compared with the state-of-art techniques in household load forecasting, the proposed method outperforms ARIMA by 19.5%, SVR by 13.1% and classical deep RNN by 6.5% in terms of RMSE.

In the paper Electric Power System Anomaly Detection Using Neural Networks[16], an approach was proposed to monitor and protect Electric Power System by learning normal system behaviour. The task was addressed by using auto-associative neural networks, reading substation measures. The results proved that neural networks are suitable to learn parameters underlying in the system behaviour, and their output can be processed to detect anomalies due to hijacking of measures, changes in the power network topology or unexpected power demand trend.

Wang et al. [17] investigated how characteristics of consumers can be acquired from fine-grained smart meter data. CNN was first used to automatically extract features from massive load profiles, and then SVM was used to identify the characteristics. The case study on an Irish dataset

(24)

indicated the effectiveness of the proposed deep CNN-based method, achieving higher accuracy in identifying the socio-demographic information about the consumers.

Kim et al. [18] proposed a hybrid model to predict power demand for a n-day profile by combining the benefits of LSTMs and CNNs. They preprocessed a dataset by pairing a power demand value as key, with a context value which incorporated different types of contextual information such as temperature, humidity and season. This constructed bivariate sequences to efficiently reflect important context information to be used when training hybrid neural networks. The proposed model showed 77% lower prediction error than ARIMA.

3.3 Neural Networks for Small Dataset

In spite of that neural networks are often connected with big data analysis, there are also researches that applied them on smaller dataset.

Feng et al. [19] attempted to predict solidification defects by deep neural network regression with a small dataset that contains 487 data points. The result showed that a pre-trained and fine-tuned deep neural network gave better generalization performance over shallow neural network, support vector machine, and deep neural network trained by conventional methods. And they concluded that deep neural network with small datasets and pre-training can be a reasonable choice when big datasets are unavailable.

In their research Modern Neural Networks Generalize on Small Data Sets[20], Olson et al. applied large neural networks to a collection of 116 real-world data sets from the UCI Machine Learning Repository. The collection of data sets contained a much smaller number of training examples than the types of image classification tasks generally studied in the deep learning literature, as well as non-trivial label noise. It is shown that even in this setting deep neural nets are capable of achieving superior classification accuracy without overfitting.

Croda et al. [21] employed neural networks in a case where a company reported a short time-series given the changes in its warehouse structure.

The ANN did not fully capture the time series behavior as it did during the

(25)

CHAPTER 3. RELATED WORKS validation phase. But it was pointed out that the prediction error was not greater than 5%, which was accurate in sales forecasting instances. And the small dataset did not increase the time convergence even when they imposed a high learning rate.

Torgyn Shaikhina and Natalia A. Khovanova [22] developed a framework for the application of regression NNs to medical datasets in order to mitigate the small dataset problem. The limitations of small datasets was overcome in their work by using a novel framework which comprised a multiple-runs strategy in order to monitor the performance measures collectively across a large set of NNs, and surrogate data analysis for model validation.

(26)

(27)

Chapter 4 Artificial Neural Network

Artificial neural network (ANN) is one of the most powerful and most popular algorithms in the field of deep learning. It has been decisively proven in many cases that neural networks outperform other algorithms in accuracy and speed.

4.1 Basic Components

As its name indicates, ANN was inspired by the neural architecture in human brain, where a neuron is the basic building block. Likewise, a neuron in ANN (often called a node or unit) is the basic unit of computation. The functionality of a neuron in ANN is similar to a human neuron: taking input and firing output.

Figure 4.1: a single neuron

As shown by the example in Figure4.1, this neuron takes inputx1 and

(28)

CHAPTER 4. ARTIFICIAL NEURAL NETWORK

x2 which are associated with weightw1 andw2 respectively. In addition, there is a bias node with input value 1 and weightb. The main purpose of bias is to provide every node with a trainable constant value.

Whether a neuron should fire or be activated to give a certain output y, is decided by the activation function inside it (as f in Figure 4.1).

Activation functions are very important in ANNs, because they make the network feasible to learn complicated and complex mapping between the input and the response variables. With the help of activation functions, neural networks are able to learn complex data like image, video, audio, speech and so on. Without activation function, neural networks would just be a simple linear regress model, which has very limited power.

There are a good deal of activation functions, including linear, sigmoid, tanh, relu, softmax and so on. The use of a certain activation function depends on the specific task and problem a neural network model is built for.

Neuron and activation functions together are the basic building blocks for any neural network. An ANN is always composed of three layers with neuron nodes. A simplified structure of ANN looks like in Figure4.2: input layer, hidden layer and output layer.

Figure 4.2: a simple ANN scheme [23]

(29)

4.2 Learning Process

The goal of a neural network is to learn a map function from inputs to outputs, and it is realized by learning and adapting weights between the connected neurons. The learning happens as an iterative process of outgoing and return. Outgoing refers to the forward propagation of the information, and return refers to the back propagation of the information.

• Forward Propagation

In the phase of forward propagation, the training data is sent into the network and passes through the entire neural network. During this pass, all the neurons receive information from the neurons in its previous layer, apply their calculation on the information, and then send it further to the neurons in the next layer. When the data has come across all the layers and reached the output layer, a prediction result will be calculated for the input instances.

• Back Propagation

Unsurprisingly, the predicted result from the first forward pass has a deviation from the target value. This is due to the fact that the connecting weights between neurons are not tuned yet.

In the phase of back propagation, the error between the predicted value and the target value is first calculated and wrapped in a loss function. Loss is a quantified way of measuring how bad the error is. Then the loss is fed backward into the network in order to update the weight. That means this pass begins from the output layer, and the loss information is propagated backwards to all the neurons in the hidden layer.

Feeding forward calculation uses activation functions, and feeding backwards calculation uses the partial derivatives of these activation functions. The hidden neurons receive only a fraction of the total signal of the loss, based on how much loss in the original output each neuron is responsible for. For neurons that are accountable for more loss in its layer, they will be penalized by assigning a smaller weight value.

(30)

Figure 4.3: Forward and Back Propagation

The learning process is iterative in the way that forward and back propagation run in a loop (Figure 4.3). Input data is passed into the network to calculate a prediction, and then the error between the prediction and the target value is sent back into the network to improve its parameters. These two passes will have to repeat many times before a satisfying prediction is obtained.

4.3 Algorithm Variants

There are quite tons of variants of neural network algorithms. Each of them has their strong point in special filed for analyzing. Three algorithm variants of artificial neural network are chosen to use in this research.

• Multilayer Perceptron

Multilayer perceptron (MLP) neural network is a fully connected feed-forward network. It is the simplest form of ANN, where the input data travels towards one direction. MLP is suitable for classification problems where input data is labeled. It is also often used for regression prediction where a real value is to be predicted by giving a set of input values. MLP is very flexible, and is good at learning a mapping from inputs to outputs. This flexibility allows it to be applied to problems with different types of data, such as text data, image data and time series data.

(31)

• Convolutional Neural Network

Convolutional Neural Network (CNN) is a partially connected network. CNN’s main applications have been in signal and image processing. It has a strong ability to develop an internal representation of two dimensional images, where it learns image features by using small squares of input data. More generally, CNN works well with data that implies a spatial relationship.

The traditional two-dimensional data can be converted into one dimension sequence. This allows CNN to be used on prediction problems for other types of data, such as documents of words with ordered relationship, and time series data with ordered relationship.

• Recurrent Neural Network

Recurrent Neural Network (RNN) has feedback loops in its network.

It works on the principle of remembering the state of a layer and feeding it back to itself together with input, in order to predict outcome for the next state. RNN was designed to work on prediction problems with sequence data due to its ability to model temporal aspects of data. But RNN is traditionally considered to be difficult to train, and Long Short-Term Memory (LSTM) is extended on it by improving its memory mechanism. Both RNN and LSTM are widely used for text, video, and time series data, especially in natural language processing.

(32)

(33)

Chapter 5 Data Preparation

5.1 Data overview

The dataset used in this research is provided by Hafslund Nett - the biggest grid company in Norway. The dataset contains power consumption from 375 households within a certain district in Oslo. Due to privacy concerns, the address of these households will keep unrevealed. Consumption data for each household comes in a separate CSV-file. Timespan in the files stretches from June 8. 2018 to July 14. 2018 with one hour as interval, which means there are 852 data samples in each file in theory .

Figure 5.1: CSV data file: house 1 as an example

All the files have the same data structure as the sample in Figure5.1.

The first column is anonymized ID of smart meter in a household, and here it is used to refer to the number of the household for simplification.

The second column is timestamp when smart meter takes a record, and

(34)

CHAPTER 5. DATA PREPARATION

these records are taken at each hour. The third column is the total amount of power a household has consumed up to the current hour.

5.2 Data preprocessing

5.2.1 Missing values

The absence of certain values is present in our dataset of power consumption, which is collected by smart meters. It can be caused by equipment malfunction, communication failure, etc.

Figure 5.2: missing value

A typical situation is that the total consumption data at a certain hour is missing, while the data before and after that hour is available, as shown in Figure5.2.

By looking into the characteristic of our dataset, the total consumption increases relatively slow along with the order of time, and there are few occurrences where missing values coming after each other.

Therefore, the total consumption at a missing time point t ot al(m) is replenished by calculating the mean of total consumption at its previous and next time point.

t ot al(m)=t ot al(m−1)+t ot al(m+1) 2

5.2.2 Detrending

A trend is a continues increase or decrease in the series over time.

Apparently our dataset contains a trend, since total consumption always grows as time goes on (Figure5.3). A dataset is said to be stationary if it does not have a trend. Stationarity is the basic underlying assumption in the practical application of neural network algorithms [24]. Therefore, the trend needs to be removed from the dataset.

The method used in this research to detrend the dataset is differencing, and a new time series data of consumption will be

(35)

Figure 5.3: total consumption trend

constructed after it. The consumption value at each time point is substituted by the difference between the total consumption at this time point and the total consumption at its previous time point.

consumpt i on(t)=t ot al(t)−t ot al(t−1)

In the new data files (Figure 5.4), the value at each time point is the amount of power a household has consumed since its previous time point, rather than the total consumption up to this time point. As no difference value can be calculated for the first time point, the new dataset has one record less.

Figure 5.4: detrended data: house 1 as example

5.2.3 Visualization

To obtain a more intuitive impression of the dataset and learn more about the consuming pattern in the households, the consumption data in each

(36)

household is plotted into figures. It is found that consuming pattern varies a lot from house to house. This can be attributed to diversity of electrical appliances, consumers’ living habit, usage preference, daily schedule and so on in the houses.

4 typical consuming patterns are presented in Figure5.5. Many of the houses have similar consumption curve as in house 97 and house 33.

There is always obvious power consumption in these houses, and the consumption fluctuates at different degrees. There are also situations like in house 09 and house 10, where consumption are constantly low, or only few time points with clear consumption. For houses like this, it can be assumed that the house is not inhabited during this period.

(a) house 97 (b) house 33

(c) house 09 (d) house 10

Figure 5.5: typical consuming patterns

There are 375 households in the dataset, and it is not practical to build prediction model for each of them in this research. Therefore, only households with constantly observable consumption are chosen to build prediction model. And house 97 is taken out as a representative of these single houses to illustrate model building in the following chapters.

(37)

5.2.4 Standardization and Normalization

In neural networks, it is always a good idea to standardize and normalize data before training. As pointed out in [25], data standardization and normalization are crucial to obtain good results as well as to significantly speed up the computation.

Standardization refers to the process of rescaling data features by giving a mean value and a standard deviation. Z−scor e is a widely used method to do data standardization, and the standardized data will have the properties of a Gaussian distribution with meanµ= 0 and standard deviationσ= 1.Z−scor eof a sample dataxis calculated as:

z=x−µ σ

Standardization is essential when input data has multiple features with different units. When features are on different scales, certain weights may update quicker than others in the training process. This problem can be solved by standardizing features. In addition, as the numerical condition of the optimization problem is improved, the training process tends to be well-behaved.

Normalization is the process of rescaling input data by shrinking the data range. Min-Max scaling is the most typical normalization method which constrains the value of data between 0 and 1. Given the maximum valuex_max and the minimum value x_{mi n} in the dataset, the normalized value of sample dataxis calculated as:

x_n= x−x_{mi n} xmax−xmi n

Normalization makes the training of model less sensitive to the scale of input features, with all the values being in the range of [0, 1]. Outliers are removed, but remain visible in the normalized data. What’s more, normalization may improve the convergence of the learning by restricting the degree of variance in the convergence problem and making optimization feasible.

(38)

5.2.5 Data Aggregation

In addition to the prediction of power consumption in a single household, the prediction prospective for the group of houses in our dataset is also interesting in the context of microgrid.

Therefore, another dataset is created by summing up power consumption in all the 375 households in this research. The new dataset contains consumption data for the whole district on hourly basis in the period from June 8. 2018 to July 14. 2018 (Figure5.6).

Figure 5.6: data file of collective consumption

This dataset is also plotted to learn the consuming trend in this group of houses. As Figure5.7shows, the collective consumption in the house group is much more regular comparing with single household. There is less random fluctuation in the consumption, and it is also very clear that there are two peak hours every day, one in the morning and the other one in the evening.

5.3 Dimensionality Reduction

In the dataset we have so far, there are two data sequences: timestamp and power consumption. To predict consumption in the future, each previous consumption data can be seen as a dimension which future consumption has a correlation with. In order to construct applicable input and output

(39)

Figure 5.7: collective consumption pattern

data for training models, data features are selected and extracted from the dataset.

5.3.1 Feature Selection

The method used in this research to select input and output features for model training is called the sliding window [26]. The sliding window is the standard neural network method of performing time series prediction.

It produces sample pairs of input and output, where input is a vector with fixed number of values and output is a single value. The size of the window decides the number of values in the input vector. In order to created as many training samples as possible, the window is slid forward one step at each time to generate a new input and output pair.

[p₁ p₂ p₃. . . p_n]→p_n₊₁ [p₂ p₃ p₄ . . . p_n+1]→p_n+2

...

[p_m p_m₊₁ p_m₊₂ . . . p_n₊_m₋₁]→p_n₊_m

Given consumption sequence p and window sizen, the input in the first sample pair is the vector of consumption from 1st time point to the

(40)

n-t htime point, and the output is the consumption at the (n+1)-t htime point. Then the window is moved one time point forward to create the second sample pair, where the input is the vector of consumption from 2ndtime point to the (n+1)-t htime point, and output is the consumption at the (n+2)-t h time point. The window keeps on moving in this way to create more samples until the whole sequencepis covered.

The window size, i.e. the size of the input vector, is a parameter which impacts the prediction performance. It constrains the number of consumption time points, to which the consumption time point to be predicted has a relation. Learning from the study in [27], the window size that will be tested in this research in 12, 24, 48.

5.3.2 Feature Extraction

Another data sequence we have in the dataset is the timestamp when consumption record is taken. In the paper Factor Affecting Short Term Load Forecasting[28], it is stated that time is the most important factor in short time load forecasting, and the time factors include hour in a day, day in a week, week in a month, and month in a season.

Since the dataset in this research has a short timespan, the hour in a day and the day in a week are chosen to be explicitly extracted from the timestamp. Then after the extraction, the input will contain three sequences: power consumption, day in a week and hour in a day, as in Figure5.8.

Figure 5.8: extracted time features

Prediction models will be built by taking both univariate and

(41)

CHAPTER 5. DATA PREPARATION multivariate input sequences. The performance from two models will be compared to see if explicitly extracted time features will contribute to better prediction performance.

The sliding window is used here as well to construct sample pairs of input and output. To the difference from using only previous consumption, the new input vector at each time point contains vectors of values from power consumption sequence p, weekday sequence w and hour sequenceh.

[ [p1 w1 h1] [p2 w2 h2] . . . [pn wn hn] ]

→p_n+1

[ [p₂ w₂ h₂] [p₃ w₃ h₃] . . . [p_n₊₁ w_n₊₁ h_n₊₁] ]

→pn+2

...

[ [p_m w_m h_m] [p_m+1 w_m+1 h_m+1] . . . [p_n+m−1 w_n+m−1 h_n+m−1] ]

→p_n₊_m

(42)

(43)

Chapter 6 Modelling Tools and Techniques

6.1 Programming Tools

6.1.1 Python

Python [29] is the programming language used in this research. Python is a simple language with high-level data structure, which contributes to its readability and less complexity.

Python is almost the most popular programming language in the filed of machine learning. The essential of machine learning is to recognize patterns in a dataset, and Python provides very good hands in data processing. The raw dataset for analysis is usually large, incomplete and unstructured. And a large portion of the time used in machine learning goes to data processing before the actual model training. Therefore, to use a powerful tool for data processing is extremely important.

The key for Python to tackle this heavy work is packages. There are tons of packages in Python, which can be easily implemented and extended.

6.1.2 Python Packages

The following Python packages and libraries are used in this research for data processing.

Keras [30] is one of the most powerful and easy-to-use libraries in Python for developing and evaluating deep learning models. It runs on top of TensorFlow, Theano and CNTK, and provides a clean and convenient way to create deep learning models. Keras is the core tool in

(44)

CHAPTER 6. MODELLING TOOLS AND TECHNIQUES this research, and all the models are built by using Keras.

Scikit-learn [31] is also a popular library in Python aiming at machine learning. It is a powerful tool for data analysis and data mining. In addition, it offers an extensive range of built-in machine learning algorithms. Scikit-learn is not mainly used to build models in this research, but it is used to pre-process data, like standardization and normalization, as well as result evaluation.

Pandas [32] is another often used library in Python for data science.

Data structures provided by Pandas are flexible and expressive, and the main Pandas object used in this research is DataFrame. DataFrame is a two-dimensional labeled data structure, with columns of different types.

DataFrame can be created directly by reading in CSV-files, which is the type of source files in this research. Pandas DataFrame makes manipulating data simple and easy, from selecting or replacing indices to reshaping the whole data structure.

NumPy [33] is a library consisting of multidimensional array objects and a collection of routines for processing these arrays. NumPy arrays are much faster and more convenient, comparing with Python’s built-in arrays. When dealing with homogeneous arrays of numerical data, mathematical and logical operations can be performed directly on NumPy arrays instead of each single element in the arrays.

Matplotlib [34] is a library for data visualization in Python.

Visualization is a very important part in both working with data and presenting data. Matplotlib is used in this research to obtain a visual interpretation of the dataset, which contributes to finding data patterns and preparing for building models. Moreover, it is used to present the predicted result and the models’ performance.

6.2 Validation Technique

The main objective of machine learning is to find a computational model with high generalization ability. A model is fitted on the input data samples during the training process. After that, it is usual to test the model’s ability of prediction on data that is not seen in the training process, which is known as generalization.

(45)

CHAPTER 6. MODELLING TOOLS AND TECHNIQUES A fitted model’s generalization ability is not always satisfying. Poor prediction result on test data can be classified as underfitting or overfitting. Underfitting refers to that the trained model doesn’t catch the underlying trend in the data. This can be put down to too simple model, too few used features and so on. On the other hand, overfitting means that the model is trained too well on training data so that it does not apply to new data. When a model learns too much of the noise and random fluctuation in the training data, it will have negative impact on it’s ability to generalize. The main reasons for overfitting can be too complex model, too much noise data, or limited training data.

Figure 6.1: Underfitting and Overfitting [35]

6.2.1 Limitation of Cross-Validation

One common technique used to estimate the learning ability of a model is cross-validation. Cross-validation divides a dataset into two parts:

training set and validation set. Training set is used to fit the model, while validation set is used to evaluate the model’s performance. This evaluation on validation set is considered as an indication on the model’s performance on unseen data.

K-fold cross-validation is the best-known cross-validation variation.

The whole sample set is randomly partitioned intoK subsets with equal size, and each of these subsets is in turns used as validation set while the other ones are used as training set.

However, the randomness of partitioning in cross-validation rises problem when it is applied to time series data. Time series data is temporal sequence, and the aim of this research is to use past observation to predict values in the future. When the sample dataset is partitioned randomly, it may result in that samples in the validation set is strongly

(46)

CHAPTER 6. MODELLING TOOLS AND TECHNIQUES

Figure 6.2: K-fold Cross-Validation [36]

correlated to a part of samples in the training set. This happens due to that the model may have already seen the data in a validation set during a previous iteration where the same data appears in a training set. In this way, the indication of performance in this situation is not reliable any more.

6.2.2 Walk Forward Validation

In order to avoid the randomness of partitioning, there are different methods that can be used to back-test trained models, such as train-validation split with respect to temporal order in the observations.

Walk forward validation is chosen for this research on account of our dataset and the applying prospect. Our goal is to predict power consumption 24 hours into the future, and the predicting model is designed to output one value at each step. Power consumption data is updated nearly in real time and comes available for disposition. This will give the trained model the best opportunity to make good prediction at each time step by utilizing the updated input.

Walk forward validation solves the problem of randomness in the normal train-test split. For each split, the full sample set is restricted differently. The sample set is split into a training set and a validation set as well, with the validation set containing only one data instance. Then a model is fitted on the training set and makes prediction on the validation data. For the next iteration, the measured value in the previous validation

(47)

CHAPTER 6. MODELLING TOOLS AND TECHNIQUES set becomes available and is appended into the training set. The new training set will be used to fit the model and to make prediction on the next validation set. This process continues until the last validation data is reached.

Figure 6.3: Walk Forward Validation

When dealing with sophisticated problems, it is more accurate to train the model again each time a new data is added into the training dataset.

But this comes at a growing computational cost, since the number of models being created increases as well. In this research, we are more interested in a model’s ability in multi-step predicting. Therefore, the model used for prediction remains the same at each step though the input data is updated.

6.3 Evaluation Metrics

A model’s learning ability needs to be evaluated, and choice of evaluation metrics decides how the performance of a model is measured and compared. There are a wide range of evaluation metrics out there for different types of problems. Given the nature of regression problem in this research and the domain of problem being power consumption, the following metrics are chosen.

6.3.1 Root Mean Squared Error

Root Mean Squared Error (RMSE) is the most popular evaluation metrics for regression problems. It is the standard deviation of the residuals, i.e.

prediction errors. For each prediction value, the squared difference between the predicted value and the target value is calculated. And then the square root is taken on the mean of these values.

(48)

R M SE= s1

n

X

i=1

(yp−yt)²

RMSE is a negative-oriented measurement, which means lower values are better. It is easy to understand and practical. RMSE is used more in problems where large errors are particularly undesirable, due to that it is sensitive to large errors and penalizes larger errors harder than the smaller ones.

However, the output from RMSE is just a value of error function. It does not provide a direct indication about how good or bad a result is.

Then RMSE is often used together with other measurements to strengthen evaluation.

6.3.2 R

²

R², also called coefficient of determination, is a goodness-of-fit measure for regression models. It measures the strength of the relationship between a target model and a dependent variable that is used as a baseline.

R²=1− M SE(mod el) M SE(basel i ne)

R² is very closely related to mean squared error (MSE). The MSE of baseline is the value a possibly simplest model will produce, which is the average value of all samples:

M SE(basel i ne)= 1 n

n

X

i=1

(y_i−y)¯ ²

The output of R² is scale-free, in the sense that its value is always between−∞and 1, no matter the scale of the prediction. R²with output value close to 1 indicates a target model with little errors, while a R² output close to 0 means that the target model performs closely to the baseline model. R² can output negative values, then the target model is doing worse than predicting the mean value of samples.

R2 is the ratio between the target model and the naive mean model, and it gives an indication on how good the target model performs.

(49)

6.3.3 Demand Peaks

Demand peak, or peak load, is used to describe the time point when power consumption is significantly higher than the average consumption level within a given period. As shown in the figures of consumption pattern in Section 5.2, there are usually two peak hours of power consumption on daily cycles. Load peaks are a major challenge for power operators, including the requirement for the power grid to support any level of load to avoid power outages and grid damages. On the micro level, power consumers are also affected by their own demand peaks, such as the power price may become higher during the peak hours.

It is generally difficult to predict demand peaks due to their random occurrence, and such researches are usually conducted with analysis of user behavior and demand response [37]. To accurately predict peak hours and peak demand is not the major focus in this research, but a qualitative comparison will be made between the predicted values at peak hours and the target values at peak hours.

In order to define peak hours, the method proposed by Jie Liu and Enrico Zio [38] is used. A window with certain size is defined around the peaks in the target data sequence, and the time-slots within the window range can all be seen as peak hours. Both the occurrence of peak hours and the values of peak load will be taken into comparison.

6.4 Hyperparameter Tuning

There exists a number of parameters when building neural network models. There are parameters concerning the structure of the models, such as the number of layers, and the number of hidden nodes; there are also parameters regarding the learning process, such as learning rate, dropout rate and so on. Optimizing hyperparameters is considered to be the trickiest part in building neural network models. It is seldom that the initial parameters produce the best result. To find the optimized hyperparameters is an iterative process, and it can be both time consuming and resource consuming.

There are several techniques for hyperparameter tuning, and the often used ones includes manual search, grid search and random search.

(50)

• Manual Search

Manual search is straightforward and easy to understand, by which hyperparameters are selected and tuned manually. It is an extremely heavy work, especially for complex models. Therefore, manual search is usually used together with other search techniques.

• Grid Search

In grid search, a range of values is defined for each parameter to be tuned. And then all the combinations of values from each range will be tried. By doing this, it is guaranteed that the best parameters in the predefined range can be found. However, this strategy is not feasible for models with multiple to-be-tuned parameters. The more parameters there are and the wider their range is, the faster the time complexity explodes in the search. Usually, grid search is not preferred when the dimensions are greater than 4.

• Random Search

To solve the problem of time complexity in grid search, random search is proposed as its alternative [39]. The construction of random search is similar to grid search, with predefined ranges for parameters. But instead of trying all possible combinations from value ranges, combinations are generated randomly to find the best parameters for the models.

Figure 6.4: Grid Search vs Random Search

Despite of its randomness, random search has proven to produce better results than grid search. Since values are chosen at random

(51)

CHAPTER 6. MODELLING TOOLS AND TECHNIQUES for each combination, it is highly likely the whole search space has been reached at a short time because of the randomness. In order to do so in grid search, it would take much longer time.

A combination of manual tuning and random search will be applied in this research. The range for parameters to be tuned will be decided by manual tuning with domain concern. And random search is used to find the optimized hyperparameters from the defined ranges.

(52)

(53)

Chapter 7 Multilayer Perceptron Model

Multilayer Perceptron Neural Network is arguable the most basic and simplest form of neural networks. The general structure and the essential components in MLP are shared by other types of neural networks.

Therefore, MLP is chosen to build the first prediction model in this research.

7.1 Architecture in MLP

Generally speaking, a MLP network is comprised of three layers: an input layer, one or more hidden layers, and an output layer.

Figure 7.1: MLP architecture [40]

Input layer is the first layer in a MLP model. The number of nodes in

(54)

CHAPTER 7. MULTILAYER PERCEPTRON MODEL

this layer is determined by training data, and it equals to the number of elements in input vector. Input layer distributes all the elements in the input vector into all the nodes in the next layer.

In neural networks, such a fully connected layer is called a dense layer.

It means that all the neuron nodes in a layer are connected to all the neuron nodes in the layer coming after. A MLP neural network composes exclusively dense layers, that will say all the layers are fully connected to its next layer.

Output layer is the last layer in a MLP model, where a classification or prediction decision is made based on the input data. The number of nodes in an output layer equals to the number of results that is desired from the model. When the model is a classifier, the output layer may have one single node where a class label is returned, or it may have one node for each class label. When the model is a regressor, there is usually just one single node in the output layer that returns a value.

In between the input layer and the output layer, there are hidden layers. Hidden layers are the core in a neural network, and they work as computational engine in a model. As described in the Section 4.2, the goal of a neural network is to learn a function that maps input data to the target output. The learning process happens in these hidden layers by that neuron nodes take in a set of weighted input and produce an output through an activation function. These outputs are then in turns used as inputs to the next hidden layer or the output layer for further computation.

The number of hidden layers in a neural network, as well as the number of neuron nodes in hidden layers are parameters that need to be set when defining a MLP model, and the optimal numbers always depend on particular problems. In the Figure 7.1 showing a typical MLP architecture, there are two hidden layers. But even there is only one hidden layer, it is also considered to be a MLP. Usually, the more hidden layers there are and more hidden nodes in the hidden layers, the more complex model can be trained to learn complex problems. However, it is not always the case that models with more hidden layers are better. More complex models often take longer time to train, and they can easily lead to overfitting. According to the universal approximation theorem [41], MLPs with one single hidden layer containing a finite number of neurons

(55)

CHAPTER 7. MULTILAYER PERCEPTRON MODEL are capable of approximating any continuous function.

7.2 Model Building

Neural network models can be built in Keras by using Sequential class which is a liner stack of layers. A common way to do this is to first create a Sequential model (Code snippet 7.1) and then add the desired layers in the order of the computation.

from keras.models import Sequential model = Sequential()

model.add(...) ...

model.add(...) model.compile(...) model.fit(...)

Code snippet 7.1: sequential model

7.2.1 Hidden Layers and Nodes

The building of a MLP model begins with deciding the number of hidden layers. It is suggested by Gaurang et al.[42] that one hidden layer is sufficient for nearly all problems. Since the training dataset in our research is relatively small, our model is initialized with one hidden layer.

Then it comes to the number of neuron nodes in the hidden layer.

Both too large number and too small number of hidden nodes will lead to unstable output. Jinchuan and Xinzhe [43] proposed a formula for calculating the number of hidden nodes that was tested on 40 cases:

N_h=N_{i n}+p N_p L

N_h is the number of nodes in the hidden layer, N_{i n} is the number of input neuron,Np is the number of input sample, andLis the number of hidden layer.

By using this formula to initialize our model, the number of hidden nodes is about 50 when univariate sequence (consumption) is used as

(56)

input, and the number of hidden nodes is about 100 when all three sequences (consumption, weekday and hour) are used as input.

7.2.2 Activation Function

After defining hidden layer and hidden nodes in MLP, it comes to the activation function for neuron nodes in both the hidden layer and the output layer.

There are quite many activation functions that are used in neural network models. In this research, the following functions are chosen and tested.

• Linear Function

Figure 7.2: Linear Function

Like shown in the Figure7.2, linear function is a line. The equation of linear function is f(x)=x, and the output from this function is proportional to the input.

As the goal of activation functions in the hidden layers is to introduce nonlinearity into the model, linear function appears less in the hidden layers. However, with the characteristic of unconfined output range (-∞, +∞), linear function is often used in the output layer to loosen up the range of output values.

• ReLu Function

ReLu function has equation f(x)=max(0,x). At the first glance of Figure7.3, ReLu function seems similar to Linear function, but it is a nonlinear function in nature. The output range for ReLu is [0, +

∞): it is zero when input is less than zero, and it is equal to the input when the input is greater or equal to zero.

(57)

Figure 7.3: ReLu Function

• Sigmoid Function

Figure 7.4: Sigmoid Function

Sigmoid is one of the most used activation functions in neural networks. It is calculated by the equation f(x)= _1+e¹−x, and this function gives a ‘S’-shape. It generates a set of probability output between 0 and 1 when fed with a set of inputs.

• Tanh Function

Figure 7.5: Tanh Function

(58)

The mathematical equation of Tanh function is f(x) = ₁₊_e²₋2x −1.

Tanh function is a scaled Sigmoid function, which can be expressed ast anh(x)=2si g moi d(2x)−1.

The same as Sigmoid function, the output range of Tanh function is also (-1, 1), and it’s an S-shaped curve (Figure7.5). The difference in the Tanh graph lies in that negative inputs will be mapped much more negative and the input zero will be mapped near zero.

In the initial model, Sigmoid function is chosen to use in the hidden layer, and linear function is used in the output layer.

7.2.3 Regularization

Neural networks tend to overfit on small data sets quickly, which will consequently lead to bad generalization and unsatisfying prediction. To tackle this problem, the strategy of regularization is used in this research.

Regularization is a method to tune and adjust the complexity level of a neural network model so that the model gets better at generalization. Put in other words, regularization suppresses slightly the learning of a more complex model in order to avoid the risk of overfitting.

There are different types of technique to implement regularization, such as lasso regularization, ridge regularization and dropout. Among them, dropout is probably the most effective and simple one. As shown in Figure 7.6, dropout deals with ignoring randomly selected neuron nodes during the training process. These nodes are dropped out at random in the sense that they are not taken into consideration during a particular forward pass or backward pass. By doing this, the model becomes less sensitive to certain nodes, and it is forced to learn more robust features under reduced bandwidth.

Figure 7.6: Dropout [44]

Neural Networks Used for Time Series Prediction of Power Consumption

Neural Networks Used for Time Series Prediction of Power Consumption

Zuitao Ma

Thesis submitted for the degree of

Master in Informatics: Programming and Networks 60 credits

Department of Informatics

Faculty of mathematics and natural sciences UNIVERSITY OF OSLO

Neural Networks Used for Time Series Prediction of Power Consumption

Zuitao Ma

Abstract

Acknowledgment

Contents

List of Figures

Chapter 1 Introduction

1.1 Motivation

1.2 Goal

1.3 Thesis Structure

Chapter 2 Background

2.1 Smart Meters

2.2 Time Series Data Mining

2.3 Machine Learning

Chapter 3

Related Works

3.1 Neural Networks for Time Series Data

3.2 Neural Networks for Power Consumption

3.3 Neural Networks for Small Dataset

Chapter 4

Artificial Neural Network

4.1 Basic Components

4.2 Learning Process

4.3 Algorithm Variants

Chapter 5

Data Preparation

5.1 Data overview

5.2 Data preprocessing

5.2.1 Missing values

5.2.2 Detrending

5.2.3 Visualization

5.2.4 Standardization and Normalization

5.2.5 Data Aggregation

5.3 Dimensionality Reduction

5.3.1 Feature Selection

5.3.2 Feature Extraction

Chapter 6

Modelling Tools and Techniques

6.1 Programming Tools

6.1.1 Python

6.1.2 Python Packages

6.2 Validation Technique

6.2.1 Limitation of Cross-Validation

6.2.2 Walk Forward Validation

6.3 Evaluation Metrics

6.3.1 Root Mean Squared Error

6.3.2 R

6.3.3 Demand Peaks

6.4 Hyperparameter Tuning

Chapter 7

Multilayer Perceptron Model

7.1 Architecture in MLP

7.2 Model Building

7.2.1 Hidden Layers and Nodes

7.2.2 Activation Function

7.2.3 Regularization