I Data simulation - Statistical Analysis of Metal Processes

One way of assessing the method is to simulate a process with a known relationship between input and output and see whether the fitted model is similar. In order for the simluation to be relevant, it must mimic the actual process. In the ElMet project we are interested in how well

the model describes a furnace. Hence, the simulated data should be similar to furnace vaiables.

First we will need to decide the distribution of the input variables. Let the input sequencex_j be a realization of a multivariate random variableX_j=(X_j,1, . . . ,X_j,n)^Tgoverned by either of three mechanisms, the normal distribution, the uniform distribution or an ARMA process. These data generating processes are chosen because they frequently arise in nature. The idea is to test the performance of the method for standard distributions with a wide range of shape and scale variations. The parameters in each mechanism are chosen at random for eachX_j, but constant in time. The input sequenceX_j is generated as eitherV,W orZ with equal probability, where

V_t∼N(µ,σ²_v), W_t∼U(−1, 1),

Z_t=pφ1x_t₋₁+qθ1²t−1+²t,

²t∼N(0, 1) µ,σ²_v,φ1,θ1∼U(−1, 1).

The value of (p,q) is either (1, 0), (0, 1) or (1, 1) with equal probability. This corresponds to AR(1), MA(1) and ARMA(1,1) processes. The simluation approach above is used to draw input sequencesx₁,x₂, . . . ,x_p. The input sequences are then standardized.

Next, we must decide the lags between input and output. We assume that y depend on x_j on consequtive lagsm_j,m_j+1, . . . ,M_j. The minimum and maximum lags are integers drawn ran-domly from discrete uniform distributions. Specifically,

m_j∼U{0, 7},

M_j∼U{m_j,m_j+3}.

The distribution of the lag interval is chosen by consideration of a real furnace process. The maximum lag isK =maxjMj. Furthermore, the coefficientsβj,k corresponding to the regres-sors are drawn uniformly from the interval [−3,−1]∪[1, 3]. Note that coefficients close to zero are excluded. The smallest coefficients are not of of interest with standardized variables. Fur-thermore, standardization means that only therelativesizes of the coefficients are of interest.

Finally, we express the output by the linear combination of the regressors and added i.i.d. nor-mal noise²t ∼N(0,σ²), i.e.

y_t=

X

j=1 M_j

X

k=m_j

βj,kx_j,t_−k+²t, t=K+1, . . . ,n. (5.1)

The output variable is standardized the same way as the input. The generated data now consists of input variables x₁,x₂, . . . ,x_p and the output variable y. Next, we will discuss how well the methods described in chapter4identifies the relationship between the simulated variables.

II Performance

In this section we will describe the performance of CSE as defined in sectionIIIof chapter 4.

We will fit the simulated data by the model in4.11, where the coefficients are estimated by least squares. Recall that the algorithm first choose a large set of regressors which is then subsetted.

The number of regressors in the large set is denotedr. Recall that we discussed two approaches for finding the optimal subset ˆI. An exhaustive search is used forr≤25 and an iterative search otherwise. The final estimates for the regression coefficients are the OLS estimates of the opti-mal subset.

If we want to know the accuracy of CSE, we must first define a measure. It is reasonable to compare the true coefficients in (5.1) with the least squares estimates of (4.11). Let ˆβj,kdenote the estimates. The two models does not necessarily include the same terms. However, we can simply consider regressors not included to have a coefficient equal to zero. Letbbe the number of pairs (j,k) such that etiher ˆβj,korβj,kis non-zero. The distance between the true coefficients and the estimated ones may be defined as the mean square difference, i.e.

δ= 1 b

X

j=1

X

k≥0

¡βj,k−βˆj,k

¢2

. (5.2)

When a simulation has a smallδ, the estimation was successful. Consider the simulation proce-dure in sectionI. There are a few parameters to be decided for each simulation. We must choose

the numberpof input variables, the lengthnof the sequences and the varianceσ²in the output defined in (5.1). The values of these parameters may affect the errorδ. This will be clear if we run multiple simulations with various parameter values. We will discuss an example where we have a few values to choose from for each parameter. Then, for every combination of parameter values we computeδ. However, as the simulations are stochastic, one thousand replications of each experiment is performed, and the average ¯δis computed. The levels ofnis 500, 1000 and 5000. Forp the levels are 2, 5, 10 and 30. The levels of the noise varianceσ²are 0.1, 0.5, 1, 1.5 and 2. We can number the sixty different combinations bys=1, 2, . . . , 60. The total number of experiments is 60000.

Furthermore, we can do regressions with ¯δas output to see which parameters are important for the performance of CSE. Intially, the included regressors arep,nandσ², pairwise interactions, squares and square roots. Then, backward elimination is performed with Akaike information criterion (AIC) as the model fit criterion (Venables and Ripley,2002). For each model, let ˆLbe the maximized value of the likelihood function. Then the model fit criterion is defined as AIC

=2(p−L). The backward elimination remove regressors that are not important for explaining ¯ˆ δ.

The resulting model for a specific parameter combinationsis

δ¯s=β0+β1ns+β2ps+β3σ²_s+β4nsps+β5n²_s+β6p

p_s+es,

whereβi are regression coefficients ande_sis a zero-mean normally distributed regression error.

The OLS estimates of the regression coefficients are given in table5.1. The right column states thep-values of a two-sided t-test with a null hypotheses that the coefficient is equal to zero. All regressors are significant on a 95 percent significance level.

Table 5.1: The effect of a parameter on the performance of CSE. The table is a summary of a regression with the estimation error ¯δas the output. The left column names the regressors. The center column displays the OLS coefficient estimates, and the right column states thep-values of a two-sided t-test for the coefficient being zero.

Param. βˆ p -value

1 6.1×10⁻¹ <2.0×10⁻¹⁶ n −6.0×10⁻⁵ 7.0×10⁻⁵ p 1.1×10⁻² 1.0×10⁻¹¹ σ² 6.6×10⁻³ 3.7×10⁻² np −8.5×10⁻⁷ 4.6×10⁻¹²

n² 1.1×10⁻⁸ 1.8×10⁻⁵ pp −5.6×10⁻² 3.7×10⁻⁸

Table 5.1provide useful information about when CSE is reliable. The results concur with ex-pectations. The error ¯δis low for high n as we have more information. When the number of variables increases, the error increases because the model is more complex. Noise also reduce the accuracy of the estimation. The error is at the highest whenpis close ton. This is the gen-eral case for any regression model. The significance ofn²andp

psuggest that the error is not linear innandp.

It is clear that the performance of CSE is heavily incluenced by the parameters in the observed data set. We should keep this in mind in applications. Next, we will consider CSE and MCE applied to an example.

In document Statistical Analysis of Metal Processes (sider 56-60)