Sequence Learning - Backpropagating to the Future

Modelling data sequences is a classical problem in stastics and machine learn-ing. There exist various methods within machine learning that model se-quences of data. Recurrent neural networks (RNN) are an example of such

methods for time series regression and classification.

2.5.1 Recurrent Neural Networks

While deep feed-forward networks may be considered universal function ap-proximators, recurrent neural networks are universal approximators of dy-namical systems (Sch¨afer & Zimmermann, 2006). RNNs differ from fully connected networks in the way that they share parameters across different parts of the model, a property similar to the one found in CNNs (section 2.3.2). This sharing of parameters allows an RNN to learn from and gener-alise across sequences of arbitrary lengths (Goodfellow et al., 2016, p. 363).

Consider, for example, a traffic environment, in which the signs, regulations, and structuring of lanes are approximately the same across various parts of a sequence. A model trying to learn such an environment may benefit from owning a limited set of parameters that contain information about these gen-eral rules, applicable to all steps in the sequence. Equation 2.16 represents a simple recurrent function

x_t+t=f_θ(x_t) (2.16)

wherefθ(xt) is a function of an inputxat stept, with a set of parameters θ shared over every time step. A recurrent neural network extends the above function and works by receiving as input not only the current input example but also the internal states from when it processed previous examples. These internal states, normally called hidden states, represent information from all previous steps, making RNNs good at modelling rather long sequences. The general form of an RNN may be expressed as follows

h_t =f_θ(ht−1, x_t) (2.17) where h_t is the hidden state at the current step, ht−1 is the previous hidden state, and x_t is the current input. As a result, recurrent networks can re-cognise, predict, and generate dynamical patterns, and are commonly used in tasks where data occurs as time-series events, such as in natural language

processing (Vinyals et al., 2015) or videos (Srivastava et al., 2015).

Figure 2.9: The structure of a recurrent neural network whips maps an input x_t to a hidden state ht. All units share the same set of parameters. Figure from Colah’s Blog, by Christopher Olah, August 27 2015, retrieved from https://

colah.github.io/posts/2015-08-Understanding-LSTMs/.

The hidden stateshmay be used for subsequent tasks directly, or transformed to outputs y by an output layer, or weight matrix W_hy. A traditional recur-rent neural network multi-dimensional data may be more precisely expressed as follows

h_t= tanh(W_hhht−1+W_hxx_t+b)

y_t=W_hyh_t (2.18)

x_t: Input vector

ht: Hidden state vector y_t: Output vector W : Weight matrices

b: bias

where Whh,Whx and Why are the weight matrices used to transform the previous hidden state ht−1, the input xt and obtain the output yt, respect-ively. The hyperbolic tangent function applies a nonlinear transformation to the RNN and scales the hidden states within the value range [-1,1].

The main challenge with regular RNNs is that they struggle to preserve long-range dependencies. The cause of this problem is related to what is calledexploding gradients andvanishing gradients. These effects appear when the number of steps in a sequence increases, and as a result the RNN’s gradi-ent values progressively amplify or decrease when backpropagating through time. The consequence could be that early time steps yield gradients that

either ruin or do not contribute to learning. It is therefore said that RNNs suffer from short term memory and can only learn sequences of limited length, which is why there exist various sub-classes of RNNs, designed specifically to deal with these issues. The most common sub-classes are the long short-term memory network (LSTM) (Hochreiter & Schmidhuber, 1997) and gated recurrent units (GRU) (Cho et al., 2014).

2.5.2 Long Short-Term Memory Networks

One of the most effective types of sequence models used in practical ap-plications are called gated RNNs (Goodfellow et al., 2016, p. 397). Long short-term memory networks are such a type of RNNs, explicitly designed to deal with the challenges related to learning long-term dependencies. Ho-chreiter and Schmidhuber introduced the LSTM in 1997, which has since then been improved and popularised in subsequent work. LSTM networks are shown to work well on a large variety of problems, such as handwriting recognition (Graves et al., 2009), machine translation (Sutskever, Vinyals,

& Le, 2014) and image captioning (Vinyals et al., 2015). LSTM units differ from traditional RNNs in the way that they contain cells that control the flow of gradients, which leads to faster learning and more successful runs (Hochreiter & Schmidhuber, 1997). Each cell has an internal recurrence in addition to the outer recurrence of the RNN (Goodfellow et al., 2016, p. 399).

Figure 2.10: The structure of a LSTM network. Figure from Colah’s Blog, by Christopher Olah, August 27 2015, retrieved from https://colah.github.io/

posts/2015-08-Understanding-LSTMs/.

The core component of the cell is the cell state, which is controlled by

three gating units. By adding and removing information to this cell state, the LSTM network may learn what aspects of the data that are essential to remember in order to preserve long-range dependencies. First, it decides what information to discard from the cell state, using aforget gate. Following this, aninput gatecreates candidate values for an updated cell state. Finally, an output gate uses the cell state to output a new hidden state. Through this process of four steps, the LSTM network determines what parts of past events that are useful to remember.

Forget gate Decides which parts and how much of the cell state to forget, ft

f_t=σ(W_f ·[ht−1, x_t] +b_f) (2.19)

Input gate Decides which parts and how much of the cell state to update, i_t, and creates candidate values for the new cell state, ˆC_t

i_t =σ(W_i·[h_t−1, x_t] +b_i) Cˆ_t = tanh(W_c·[ht−1,xt] +b_c]

(2.20) Cell state update Applies f_t to the cell state, making it forget certain information, and updates the cell state with i_t and the candidate values ˆC_t

C_t =f_t·Ct−1+i_t·Cˆ_t (2.21)

Output gate Decides what information the new hidden stateh_twill contain

o_t=σ(W_o·[ht−1, x_t] +b_o) ht=ot·tanh(Ct)

(2.22)

Figure 2.11: The different gating functions in an LSTM unit. Adapted from Colah’s Blog, by Christopher Olah, August 27 2015, retrieved from https://

colah.github.io/posts/2015-08-Understanding-LSTMs/.

where theW’s denote the weight matrices for the forget gate, input gate, candidate cell values and output gate, respectively. The b’s are the gating functions’ biases, and h_t and x_t is the hidden state and the input at time step t. The functions σ and tanh are the sigmoid and hyperbolic tangent activation functions (section 2.3.4). Due to the LSTM’s increased complex-ity compared to regular RNNs, they possess a greater number of learnable parameters, meaning they are somewhat more computationally expensive.

In document Backpropagating to the Future (sider 40-45)