• No results found

I Data simulation

One way of assessing the method is to simulate a process with a known relationship between input and output and see whether the fitted model is similar. In order for the simluation to be relevant, it must mimic the actual process. In the ElMet project we are interested in how well

48

the model describes a furnace. Hence, the simulated data should be similar to furnace vaiables.

First we will need to decide the distribution of the input variables. Let the input sequencexj be a realization of a multivariate random variableXj=(Xj,1, . . . ,Xj,n)Tgoverned by either of three mechanisms, the normal distribution, the uniform distribution or an ARMA process. These data generating processes are chosen because they frequently arise in nature. The idea is to test the performance of the method for standard distributions with a wide range of shape and scale variations. The parameters in each mechanism are chosen at random for eachXj, but constant in time. The input sequenceXj is generated as eitherV,W orZ with equal probability, where

Vt∼N(µ,σ2v), Wt∼U(−1, 1),

Zt=1xt−1+1²t−1+²t,

²t∼N(0, 1) µ,σ2v,φ1,θ1∼U(−1, 1).

The value of (p,q) is either (1, 0), (0, 1) or (1, 1) with equal probability. This corresponds to AR(1), MA(1) and ARMA(1,1) processes. The simluation approach above is used to draw input sequencesx1,x2, . . . ,xp. The input sequences are then standardized.

Next, we must decide the lags between input and output. We assume that y depend on xj on consequtive lagsmj,mj+1, . . . ,Mj. The minimum and maximum lags are integers drawn ran-domly from discrete uniform distributions. Specifically,

mj∼U{0, 7},

Mj∼U{mj,mj+3}.

The distribution of the lag interval is chosen by consideration of a real furnace process. The maximum lag isK =maxjMj. Furthermore, the coefficientsβj,k corresponding to the regres-sors are drawn uniformly from the interval [−3,−1]∪[1, 3]. Note that coefficients close to zero are excluded. The smallest coefficients are not of of interest with standardized variables. Fur-thermore, standardization means that only therelativesizes of the coefficients are of interest.

Finally, we express the output by the linear combination of the regressors and added i.i.d. nor-mal noise²t ∼N(0,σ2), i.e.

yt=

p

X

j=1 Mj

X

k=mj

βj,kxj,t−k+²t, t=K+1, . . . ,n. (5.1)

The output variable is standardized the same way as the input. The generated data now consists of input variables x1,x2, . . . ,xp and the output variable y. Next, we will discuss how well the methods described in chapter4identifies the relationship between the simulated variables.

II Performance

In this section we will describe the performance of CSE as defined in sectionIIIof chapter 4.

We will fit the simulated data by the model in4.11, where the coefficients are estimated by least squares. Recall that the algorithm first choose a large set of regressors which is then subsetted.

The number of regressors in the large set is denotedr. Recall that we discussed two approaches for finding the optimal subset ˆI. An exhaustive search is used forr≤25 and an iterative search otherwise. The final estimates for the regression coefficients are the OLS estimates of the opti-mal subset.

If we want to know the accuracy of CSE, we must first define a measure. It is reasonable to compare the true coefficients in (5.1) with the least squares estimates of (4.11). Let ˆβj,kdenote the estimates. The two models does not necessarily include the same terms. However, we can simply consider regressors not included to have a coefficient equal to zero. Letbbe the number of pairs (j,k) such that etiher ˆβj,korβj,kis non-zero. The distance between the true coefficients and the estimated ones may be defined as the mean square difference, i.e.

δ= 1 b

p

X

j=1

X

k0

¡βj,kβˆj,k

¢2

. (5.2)

When a simulation has a smallδ, the estimation was successful. Consider the simulation proce-dure in sectionI. There are a few parameters to be decided for each simulation. We must choose

the numberpof input variables, the lengthnof the sequences and the varianceσ2in the output defined in (5.1). The values of these parameters may affect the errorδ. This will be clear if we run multiple simulations with various parameter values. We will discuss an example where we have a few values to choose from for each parameter. Then, for every combination of parameter values we computeδ. However, as the simulations are stochastic, one thousand replications of each experiment is performed, and the average ¯δis computed. The levels ofnis 500, 1000 and 5000. Forp the levels are 2, 5, 10 and 30. The levels of the noise varianceσ2are 0.1, 0.5, 1, 1.5 and 2. We can number the sixty different combinations bys=1, 2, . . . , 60. The total number of experiments is 60000.

Furthermore, we can do regressions with ¯δas output to see which parameters are important for the performance of CSE. Intially, the included regressors arep,nandσ2, pairwise interactions, squares and square roots. Then, backward elimination is performed with Akaike information criterion (AIC) as the model fit criterion (Venables and Ripley,2002). For each model, let ˆLbe the maximized value of the likelihood function. Then the model fit criterion is defined as AIC

=2(p−L). The backward elimination remove regressors that are not important for explaining ¯ˆ δ.

The resulting model for a specific parameter combinationsis

δ¯s=β0+β1ns+β2ps+β3σ2s+β4nsps+β5n2s+β6p

ps+es,

whereβi are regression coefficients andesis a zero-mean normally distributed regression error.

The OLS estimates of the regression coefficients are given in table5.1. The right column states thep-values of a two-sided t-test with a null hypotheses that the coefficient is equal to zero. All regressors are significant on a 95 percent significance level.

Table 5.1: The effect of a parameter on the performance of CSE. The table is a summary of a regression with the estimation error ¯δas the output. The left column names the regressors. The center column displays the OLS coefficient estimates, and the right column states thep-values of a two-sided t-test for the coefficient being zero.

Param. βˆ p -value

1 6.1×10−1 <2.0×10−16 n −6.0×10−5 7.0×10−5 p 1.1×10−2 1.0×10−11 σ2 6.6×10−3 3.7×10−2 np −8.5×10−7 4.6×10−12

n2 1.1×10−8 1.8×10−5 pp −5.6×10−2 3.7×10−8

Table 5.1provide useful information about when CSE is reliable. The results concur with ex-pectations. The error ¯δis low for high n as we have more information. When the number of variables increases, the error increases because the model is more complex. Noise also reduce the accuracy of the estimation. The error is at the highest whenpis close ton. This is the gen-eral case for any regression model. The significance ofn2andp

psuggest that the error is not linear innandp.

It is clear that the performance of CSE is heavily incluenced by the parameters in the observed data set. We should keep this in mind in applications. Next, we will consider CSE and MCE applied to an example.