Proof of Theorem 3.2.2 - Robustness and Stability of Long Short-Term Memory Recurrent Neural Ne

Proof. Inspired by the work of [3], we start by considering the Lyapunov function candi-date in eq. (A.2). This Lyapunov candicandi-date enjoys some useful properties, among them that

°x_k°

°1=°

°c_k°

°1+°

°h_k°

°1is true, which will come in handy in the analysis.

V(x)= kxk1 (A.2)

In order for us to conclude on anything with regard to input-to-state stabilty, the two conditions given in Definition2.5.4ought to be satisfied. The proof is as such split into two parts, in which part 1 treats condition 1 and part 2 treats condition 2.

Satisfying condition 1 (eq.(2.60)) of Definition2.5.4

The Lyapunov candidate given in eq.(A.2) must first and foremost satisfy the condition given in eq. (2.60).

Condition 1, given in eq. (2.60), is satisfied by the equivalence of the 1-norm and 2-norm:

for an arbitrary vectorw ∈Rⁿ, we havekwk2≤ kwk1≤p

nkwk2[25, Chapter 2.2]. We thus 81

have specifically that, 2.5.1, we may show that both comparison functions belong to the class by reason of:

• α1(0)=α2(0)=0

• ^∂α_∂r¹ >0

• ^∂α_∂_r² >0

• Fora₁→ ∞,α1(r)→ ∞asr→ ∞

• Fora₂→ ∞,α2(r)→ ∞asr→ ∞

Satisfying condition 2 (eq.(2.61)) of Definition2.5.4

We precede with investigating the condition given in eq. (2.60). Note that from now, k·k denotes theL1norm (1-norm). In cases of different norms, it will be specifically denoted. By definition of eq. (2.61) and common norm function properties (in particular properties in Definition (2.1.1)), we may derive the following bound,

V(x_k+1)−V(x_k)=° common norm properties, particularly that°

°x+y° we want to get rid of the Hadamard product. In accordance with the relation given in eq.

(2.12), define the diagonal matrices given in (A.9)-(A.12), f¯_f_k=diag(f_f⁽¹⁾ from Section2.2.4thatHdenotes the hid. Thed i ag-operator is defined in Section2.1.3.

A.2. PROOF OF THEOREM 3.2.2 83 With these terms defined, we may abstract away the Hadamard product and work on eq.

(A.4) to achieve eq. (A.13)-(A.20).

V(x_k+1)−V(x_k)≤

where Lemma3.2.1is applied in the transition between ineq. (A.13)-(A.14) and ineq. (A.14)-(A.15). Moreover, the state vector components are retrived by utilising the fact that theL1 -norm is by definition the sum of the absolute value of each vector component. As such,

°1 by equality. We now simplify the notation by defining the two terms a and b in eq. (A.21).

and insert for ¯f_g_k (given in eq. (3.25)) in order to recover the inputs. These changes are reflected in eq. (A.22)-(A.23). where the expression in (A.28) is included due to ISS-Lyapunov definition, given in Definition 2.5.4, in which the state-related comparison function is defined as −α(kxk2). The

expres-sions appearing in eq. (A.28)-(A.29) may be lower-bounded as shown in eq. (A.30)-(A.34),

where the first inequality of (A.30) is satisfied if and only if eq. (A.35) and eq. (A.36) are satisfied, The expressions in eq.(A.30)-(A.34) results in the definition of the comparison functions given in eq. (A.37)-(A.41),

These are based on the equivalence of the 1-norm and 2 norm: for an arbitrary vectorw∈Rⁿ, we havekwk2≤ kwk1≤p

nkwk2[25, Chapter 2.2]. This conversion is done to get the com-parison functions in the same form as Definition2.5.4. Lastly, the bias vector is augmented to the input space, following the work of [3]. This is done since the inputs ought to be bounded by classK functions according to Definition 2.5.3. For a classK-function, f, it must be true that f(0)=0. As we see, this cannot be ensured for any of the defined functions in eq.

(A.38)-(A.41) by adding the norm of the bias vector. Alternatively, one may restrict the bias vector of the candidate gate to 0 during training. The first mentioned option is, however, likely to be less restricting on the network during training.

Inserting these expressions into eq. (A.22)-(A.29) yields eq. (A.42) V(x_k+1)−V(x_k)≤ −α3

A.2. PROOF OF THEOREM 3.2.2 85 Forα3, we have thatα3(0)=0. Moreover, we need that it is strictly increasing. This is true if the conditions given in eq. (A.35) and eq. (A.36) are satisfied. Lastly, in order to conclude that α3 is indeed a classK∞, we have to ensure that if a → ∞, α(r)→ ∞ asr → ∞. We observe that this indeed is the case forα3. By Definition2.5.1,α3is a classK∞function.

We now establish conditions for the functions given in eq. (A.38)-(A.41) to be classK functions. The definition of this class of functions is given in the first part of Definition2.5.1.

We see by inspection thatρn(0)=0 forn={u_k,d_u_k,d_w_k,b_g_k}. The next condition that ought to be satisfied is that the functions are strictly increasing. Sinceb ≥0, D>0 and H>0, this is achieved by the Definition of a norm function. As such, combining that all functions, ρn(0)=0 forn ={u_k,d_u_k,d_w_k,b_g_k} and the fact that the function is strictly increasing, the functions given in (A.31)-(A.33) are indeed classK functions by Definition2.5.1.

Finally, by Definition2.5.4, we may conclude that the Lyapunov function, given in eq.

(A.2), for the state-space model, given in eq. (3.18), is indeed an ISS-Lyapunov function.

By Theorem2.5.1, we may conclude that the state-space model describing the persistently exciting LSTM network is input-to-state stable with respect to the defined inputs when the conditions in eq. (A.35)-(A.36) are satisfied.

Appendix B

Hyperparameters

B.1 Hyperparameters: Experiment 1

Table B.1: Experiment 1 hyperparameter selection for models trained at predicting the liquid height of tank 1 (h₁). Note that for the hidden layers hyperparameter, the array-like structure represents how many artificial neurons (nodes) reside in each hidden layer, starting with first hidden layer.

Hyperparameter Model

LSTM`2 PE LSTM Opt-1 PE LSTM Opt-2

Optimiser Adam Adam Adam

Learning rate 0.005 0.005 0.003

Batch size 16 24 24

Epoch 40 40 40

Hidden layers [4, 1] [4, 1] [4, 1]

`2-penalty 0.01 -

-Perturbation epoch - 4 4

Perturbation radius - 0.02 0.02

Table B.2: Experiment 1 hyperparameter selection for models trained at predicting the liquid height of tank 2 (h₂). Note that for the hidden layers hyperparameter, the array-like structure represents how many artificial neurons (nodes) reside in each hidden layer, starting with first hidden layer.

Hyperparameter Model

LSTM`2 PE LSTM Opt-2 PE LSTM Opt-1

Optimiser Adam Adam Adam

Learning rate 0.0007 0.0007 0.0007

Batch size 32 32 32

Epoch 40 40 40

Hidden layers [2, 2, 2] [2, 2, 2] [2, 2, 2]

Time steps 15 15 15

`2-penalty 0.008 -

-Perturbation epoch - 4 4

Perturbation radius - 0.02 0.02

B.2 Hyperparameters: Experiment 2, Experiment 3 and

In document Robustness and Stability of Long Short-Term Memory Recurrent Neural Networks (sider 99-106)