Temporal Convolutional Neural Networks - / Technical Background 7

Part II / Technical Background 7

7.5 Temporal Convolutional Neural Networks

A temporal convolutional network is a variant of convolutional neural networks, first presented by (Lea et al., 2017), that combines the main concepts of CNNs and RNNs into one model; TCNs are able not only to identify and learn meaningful repeating patterns within the input data, identical to CNNs, but also to capture temporal dependencies within the sequence, and map input sequences to output sequences of equal length, just as RNNs. TCNs are mainly used for sequential data, where the convolutional kernel is designed in such a way that important local patterns within the sequence is extracted and learned. CNNs are, as stated in section 7.4, mostly used for input data in the form of arrays, e.g., image analysis, and the convolutions are therefore spatial. The convolutions in TCNs are in contrast computed across time, oftentimes with a causal constraint and increasing dilation factors, both explained below in Section 7.5.1. There is no implicit requirement for the convolutions in a TCN to be causal, but non-causal convolutions result in TCNs not being suitable for real-time applications.

The receptive field of networks containing convolutional layers increases with dilation factor, kernel size, and network depth, and the use of increasingly dilated causal convolu-tions results in large receptive fields, without the need for very deep networks and large filters. In time series forecasting, large receptive fields are desired, since only information extracted from observations contained within the network’s receptive field are considered when constructing the predictions. To aid network training when the size and depth of TCNs increases, residual blocks are commonly utilized, taking the place of the ordinary convolutional layers, and containing at least one convolutional layer.

The characteristic pooling layers in CNNs are optional in TCNs, since the use of dilated convolutional kernels achieves a similar effect as the pooling operations; Increasing dilation factors result in the convolutional layers operating on a smaller number of time steps sampled at time intervals that increase throughout the layers, ensuring the capturing of important fine details and repetitive patterns, while the number of considered time steps is kept moderately small (Lea et al., 2017). Additionally, increasing the dilation factor with each layer increases the receptive field of the network exponentially, while the number of trainable parameters grows linearly. Dilated convolutions, together with the nature of convolutional layers, i.e. local connectivity and parameter sharing, result in temporal convolutional networks having a relative small number of trainable network parameters, which is favorable.

From the information presented above is it clear that TCNs are well suited for the problem of time series forecasting. However, notable disadvantages exist (Bai, Kolter, & Koltun, 2018). Contrary to RNNs, where each internal state represents a ”summary” of the entire preceding history of the input sequence in the form of previous internal states, TCNs have access only to the information contained within their receptive field. For a TCN to include more observations within its receptive field, the size of the window that covers the historical observations must increase either by increasing the kernel size, dilation factor, or number of convolutional layers. The choice of kernel size and dilation factors is generally highly dependent on the length of the input sequence and the variable being predicted, and are additional hyperparameters that must be tuned specifically for the problem at hand.

7.5.1 Dilated Causal Convolutions

A convolution is said to be causal if the output at the timet only depends on inputs from previous time steps t−1, t−2, t−3, . . ., meaning that no information from future time steps is available when computing the internal representations. Dilated convolutions are convolutions where the kernel is designed in such a way that it has a field of view larger than its actual size. A dilated kernel increases its field of view by ignoring a specified number of steps between each input value. For example, a dilated convolution kernel with size 2 and a dilation factor 4 will only include input values at every fourth step, and is analogous to a kernel with size 5 where the weights for the steps between the first and the last entry are zero. A kernel with dilation factor of 1 is equivalent to an ordinary convolution. Figure 7 illustrates causal and non-causal dilated convolutions.

For a univariate time series, the output, s(t), of a causal convolution with a kernel with dilation factor d and kernel size K at timet can be expressed as (Chen et al., 2020):

s(t) = (x∗_dw)(t) =

K−1

k=0

w(k)x(t−d·k) (18)

where ∗_d denotes the dilated convolution with dilation factor d.

Dilated convolutions differ from strided⁶ convolutions, since the input resolution is pre-served throughout the convolution, meaning that the input and output dimensions are equal. Stacking dilated convolutions significantly increases the receptive field of a network with a small number of layers (Oord et al., 2016), enabling shallow networks to have very large receptive fields while still preserving the input resolution. In time series forecasting, the use of dilated convolutions in the convolutional layers in TCNs reduces the volume of considered input data for each time step, while still including a wide selection of historical observations when predicting future values of the time series. Stacking multiple dilated convolutional layers ensures that only important patterns within the input are extracted, and unwanted noise is discarded. However, some fine details and fast dynamics within the time series might be overlooked.

6Stride is the distance between spatial locations where the convolution kernel is applied (Gonzalez &

Woods, 2018)

(a) Causal convolution

(b) Dilated causal convolution

Figure 7: Illustration of dilated causal convolutions

In document Probabilistic Load Forecasting with Deep Conformalized Quantile Regression (sider 44-47)