Bayesian Calibration and Inference for Multiple Machines

(1)

Bayesian Calibration and Inference for Multiple Machines

July 2019

Master's thesis

Hong-Tan Lam

2019Hong-Tan Lam NTNU Norwegian University of Science and Technology Faculty of Information Technology and Electrical Engineering Department of Mathematical Sciences

(2)

(3)

Bayesian Calibration and Inference for Multiple Machines

Hong-Tan Lam

Applied Physics and Mathematics Submission date: July 2019

Supervisor: Ingelin Steinsland

Norwegian University of Science and Technology Department of Mathematical Sciences

(4)

(5)

”If you ever think that you can only be one thing in life, think again! A prediction point can be both an interpolation point and an extrapolation point at the same time,

depending on the machine you consider. Perspective is truly everything.”

(6)

(7)

Abstract

In this thesis, simulation models, also called simulators, are used to perform predictions on machines. One of the challenges with performing predictions using simulation models, is that they do not fully describe the true process of the machine. Often, there exists a model discrepancy: the difference between the simulator output and the true process.

The working hypothesis is that, by using multiple machines with some parameters and hyperparameters in common, it is possible that the model learns from all the machines, and perform better predictions.

There are three different models used to perform inference. The first model, is a model from Brynjarsd´ottir and O’Hagan (2014), slightly modified, with individual parameters and an individual discrepancy term. It is referred to as the individual model. Two new model frameworks are introduced in this thesis, referred to as Model 1 and Model 2. Model 1 assumes common parameters for the machines and has an individual discrepancy term.

Model 2 assumes common parameters for the machines, and has a common discrepancy term and an individual discrepancy term. All three models are Latent Gaussian models, with their discrepancy terms following Gaussian fields with Mat´ern covariance functions.

The last two models evaluates the multiple machines simultaneously, and are used to create observations, resulting in two new types of machines, referred to as the Ideal machines 1 and the Ideal machines 2 respectively. These ideal machines are created to learn more about the predictive perfomances of Model 1 and Model 2. Additionally, the three models are applied on multiple simple machines, the same machine described by Brynjarsd´ottir and O’Hagan (2014).

There are four types of prediction points that are tested: interpolation points, extrapolation points, pseudo extrapolation points (prediction points that are extrapolation points for some machines and interpolation points for others) and prediction points that are located far from the observations. Different designs of where the observations created are located, are used to test the models predictive performance on the four types of prediction points.

The prediction is performed using the R-package INLA, which is less computationally expensive than using Markov chain Monte Carlo algorithms.

For all four types of prediction points, evaluating the machines simultaneously gives better results than evaluating the machines individually. Model 2 has the best predictive performance for all three machines and all four types of prediction points. Model 1 has equally good predictive performance for the Ideal machines 1 as Model 2, and in general better predictive performance than the individual model. There is not one type of prediction point for any of the multiple machines, where the individual model has the best performance.

The results are found using only synthetical machines and only when the calibration parameter has the same value for all machines. Suggestions for further research, are to use real machines or to vary the value of the calibration parameter for all machines.

(8)

Sammendrag

I denne oppgaven, blir simuleringsmodeller, ogs˚a kalt simulatorer, brukt til utføre prediksjon p˚a maskiner. En av utfordringene med ˚a gjøre prediksjon med simuleringsmodeller, er at de ikke fullt beskriver den sanne prosessen til maskinen. Ofte eksisterer det et modellavvik: forskjellen mellom simulatoren og den sanne prosessen. Hypotesen, er at ved ˚a bruke flere maskiner med noen parametere og hyperparametere til felles, er det mulig at modellen lærer fra alle maskinene, og utfører bedre prediksjon.

Det er tre forskjellige modeller brukt til inferens. Den første modellen, er en modell fra Brynjarsd´ottir and O’Hagan (2014), litt modifisert, med individuelle parametere og et individuelt avviksledd. Den er referert til som den individuelle modellen. To nye modell- rammeverk er introdusert i denne oppgaven, referert til som Modell 1 og Modell 2. Modell 1 antar felles parametere for maskinene, og har et individuelt avviksledd. Modell 2 antar felles parametere for maskinene, og har et felles avviksledd og et individuelt avviksledd.

Alle tre modellene er latente Gaussiske modeller, med avviksledd som følger Gaussiske felt med Mat´ern kovariansfunksjoner. De siste to modellene evaluerer maskinene samtidig, og er brukt til ˚a lage observasjoner, noe som resulterer i to nye typer maskiner, som henholdsvis er referert til som de Idelle maskiner 1 og de Idelle maskiner 2. Disse idelle maskinene er lagd for ˚a lære mer om prestasjonene p˚a prediksjonene til Modell 1 og Mod- ell 2. I tillegg, er de tre modellene brukt p˚a de multiple enkle maskinene, den samme maskinen beskrevet av Brynjarsd´ottir and O’Hagan (2014).

Det er fire typer prediskjonspunkter som er testet: interpolasjonspunkter, ekstrapolasjonspunkter, pseudo-ekstrapolasjonspunkter (prediksjonspunkter som er ekstrapolasjonspunkter for noen maskiner og interpolasjonspunkter for andre) og prediksjonspunkter som lig- ger langt fra observasjonene. Forskjellige designer for hvor observasjonspunktene skal ligge, er brukt til ˚a teste modellenes prestasjoner p˚a prediksjonene p˚a alle fire typer prediksjonspunkter. Prediksjonene er utført ved ˚a bruke R-pakken INLA, som er mindre bereg- ningskrevende enn ˚a bruke Markov-kjede-Monte-Carlo-algoritmer.

For alle fire typer prediksjonspunkter, gir evalueringene av maskinene samtidig bedre re- sultater enn ˚a evaluere maskinene individuelt. Modell 2 har best prestasjon p˚a prediksjonene for alle tre maskiner og alle fire typer prediksjonspunkter. Modell 1 har like god prestasjon p˚a prediksjonene for de Idelle maskiner 1 og de Idelle maskiner 2, og generelt bedre prestasjon p˚a prediksjonene enn den individuelle modellen. Det finnes ikke en eneste type prediksjonspunkt for noen av de multiple maskinene hvor den individuelle modellen har best prestasjon. Resultatene er funnet bare ved ˚a bruke syntetiske maskiner og bare n˚ar kalibreringsparameteren har den samme verdien for alle maskinene. Forslag til videre forskning, er ˚a bruke ekte maskiner eller ˚a variere verdien til kalibreringsparameteren for alle maskinene.

(9)

Preface

I remember the first three months working on my Master’s thesis. It was mostly getting to know how INLA works, which was difficult, as it is a package with little documentation, and getting errors is very easy. I am not going to lie and say that it was all a walk in the park. At one point, I was stuck on the same bug for a month. In those couple of weeks, I was just forcing myself to sit at campus, watching the time pass by, as there was no progression in my work. I was really hating my thesis at this point, and my goal was just to deliver something acceptable, as I had already managed to get a job, and I did not bother too much with the grade.

However, there was this one day when everything changed. It was at the end of April, two months before the deadline. I decided to take a closer look at my code, trying to read the documentation for everything I had written, hoping that the documentation existed. And there, after a month, I finally found my bug. From there on, it kind of was like a walk in the park. That is just how good I am. No, but seriously, that is the first big lesson I have learned from my Master’s thesis: that patience matters a lot. To stop and take a closer look, whether it is in a piece of code or if it is in life in general, is one of the things that is easy to neglect in our daily lives. And sometimes, a closer look, is all you need to resolve some of the most difficult problems you may face.

The second thing I have learned from my Master’s thesis, is more on how research is done.

For the last two years I have been preparing myself for a data analytics job, but I have been very curious about how it is to do research. That my Master’s thesis gave me insight on how research is done, is something I have been very grateful for. It was this appreciation combined with the patience I gained, that finally made me enjoy working on my thesis.

There are several people I want to thank. Thank you to Ingelin Steinsland, who have been my supervisor. She has told me which experiments are interesting to perform, a little about how different designs can help produce informative results, as well as helping me improve my writing and the structure of this thesis. I also want to thank my friends and family.

They are truly everything to me.

Hong-Tan Lam, Trondheim, July 2019.

(10)

(11)

Chapter 1 Introduction

Many physical sciences use simulation models to describe physical processes. According to Winsberg (2003), a simulation model is a model of the underlying physics, developed on a computer, and by using computationally intensive methods, it is possible to learn about the physical process. The applications of simulation models keeps growing. One of the reason for this, is that a simulation model can be a virtual model of what an Internet of Things (IoT) device is tracking, and the number of IoT devices installed worldwide is increasing (Statista, 2016). Such devices, can for example, track the performance of certain processes on a boat, the lifetime of a bridge or the blood pressure in a human.

However, although simulation models are widely used, there can occur many uncertainties with simulation models. To illustrate what uncertainties might occur, the example of a simple machine is used.

The simple machine is a theoretical machine created by Brynjarsd´ottir and O’Hagan (2014).

It is a machine producing work as a function of effort and efficiency. Although the simple machine is a theoretical machine, and thus, the relation between work, effort and efficiency is known, the machine is pretended to be a real machine, where this relation is unknown.

This relation is only used to create the observations, and is referred to as the true process of the machine. After the observations are created using the true process, parts of the true process is pretended to be unknown. Work is considered as the output, the quantity to be predicted, effort is considered as the control variable, a quantity controlling the output, and efficiency is considered as the calibration parameter, a parameter with unknown value.

In this case, the simulation model is a computer model with incomplete information about the true process, and is referred to as simulator for the rest of the thesis. The simulator tries to learn about the true process by using (synthetical) observations, a process called calibration.

According to Kennedy and O’Hagan (2001), there are six types of uncertainties that can occur. For this thesis, only the three most relevant uncertainties are described:

(14)

Chapter 1. Introduction

• Model discrepancy: difference between the true process’ mean value of output, and the simulator output with true input values. In the case of the simple machine, using the true value for efficiency, this is the difference between the prediction of work from the simulator and what the work really is (true process).

• Observation error: difference between the observations measured and the true output value. In the case of the simple machine, this is the difference between the work observed and the true work (true process). Since the true process is used to generate the observations for the simple machine, normally distributed errors are added to create synthetical observation errors.

• Parametric variability: uncertainty in the simulator output due to lack of information about some parameters. For the simple machine, this uncertainty occurs due to letting the efficiency follow a distribution with little to no information (e.g. by letting it follow a non-informative prior).

The main goal of the simulator, is to predict work, but this is difficult if the simulator can- not account for the abovementioned uncertainties. Kennedy and O’Hagan (2001) suggests a model they claim takes account of all the abovementioned uncertainties, as well as the three types of uncertainties not described in this thesis. The model suggests that work is the sum of three terms: the work predicted by the simulator (also called simulator output), the model discrepancy and the observation error. It uses a Bayesian framework, by setting priors for the parameteres and hyperparameters. Then, observations are used for calibration, i.e. to learn more about the unknown parameters and hyperparameters.

Brynjarsd´ottir and O’Hagan (2014) uses the model of Kennedy and O’Hagan (2001) to perform prediction for the simple machine. The model they use that is of interest in this thesis, succeeds at interpolation. For extrapolation however, the posterior distribution does not center around the true value.

In this thesis, the main focus is to introduce Bayesian calibration model frameworks that considers multiple machines simultaneously, apply them to study predictions of multiple machines, and to study the differences between the models’ predictive performances. The multiple machines studied in this thesis, all have the same value for the calibration parameter, while the true processes are different. Although the true processes for the machines are different, the multiple machines’ parameters are still the same. If a parameter is different for the multiple machines, then the distribution generating this parameter, have the same hyperparameters.

Three models are used in this thesis. The first model is similar to the model Brynjarsd´ottir and O’Hagan (2014) use, referred to as the individual model. The second model is a similar model, as it assumes that the machines have individual discrepancy. However, it also assumes that the machines have the same values for the same parameters and hyperparameters. It is referred to as Model 1. The third model, is similar to Model 1, but in addition to assuming that the machines have the same values for the parameters and hyperparameters and have an individual discrepancy term, it assumes the machines have a common discrepancy term that is identical for all the machines. It is referred to as Model 2. All of these models are latent Gaussian models, with the model discrepancy term(s) following

(15)

zero-mean Gaussian fields with Mat´ern covariance functions. For such models, it is possible to use the R-package INLA by Rue et al. (2009b). This package is used for this thesis, as it has a shorter computation time for these models, compared to Markov chain Monte Carlo algorithms.

Model 1 and Model 2 are two new model frameworks introduced in this thesis. They evaluate the machines simultaneously, unlike the individual model, which is fitted to the observations for each machine separately. To learn more about Model 1 and Model 2 and how well they perform predictions, the models are tested on observations generated by the models themselves. The theoretical machines generating these observations, are referred to as the Ideal machine 1 and the Ideal machine 2. Thus, there are three models applied on three types of machines in this thesis: the Ideal machines 1, the Ideal machines 2 and the multiple simple machines. The prediction points are divided into four different categories:

interpolation points, extrapolation points, pseudo extrapolation points and points that are located far from the observations (at least far from the observations on one side). Pseudo extrapolation points is a term used by the author as prediction points that are extrapolation points for some machines, and interpolation points for other machines, all of which are evaluated simultaneously by Model 1 and Model 2 for prediction.

To evaluate the predictive performance of the models on these four types of prediction points, and in which case predictive performance is better for a specific type of prediction point, different designs deciding where the observations are located are needed. Hence, three designs are created. They are referred to as Design 1, Design 2 and Mixed design.

The aim of this thesis, is to introduce Bayesian calibration model frameworks that consider multiple machines simultaneously to perform predictions. These models can be seen as an extension of the framework introduced by Kennedy and O’Hagan (2001), and it is studied how and when it is beneficial to consider the multiple machines simultaneously with regards to predictive performance. For which of the four types of prediction points, and for which of the three machines are predictive performance improved?

Chapter 2 explains the simple machine in depth. In the end of the chapter, a descrip- tion of multiple simple machines is given. Chapter 3 gives background information about Bayesian statistics, latent Gaussian models, Gaussian fields and how the predictive quality can be measured. It also includes a summary of the relevant results obtained by Bryn- jarsd´ottir and O’Hagan (2014) for the simple machine. Chapter 4 describes the different models, the different machines, the different designs and an overview of how many times a model is applied to which machine for which design. Chapter 5 describes the results and finally, chapter 6 gives a discussion and conclusion of the results.

(16)

Chapter 1. Introduction

(17)

Chapter 2 Case: Multiple simple machines

2.1 The simple machine

One of the physical systems that is studied in this thesis, is the simple machine. It is an example found in Brynjarsd´ottir and O’Hagan (2014), so the model, the supplied termi- nology, the experimental setup and the results presented in this section, are based on the paper of Brynjarsd´ottir and O’Hagan (2014).

The simple machine is a physical system, where the work is modeled, and depends on the amount of effort put into the machine and the efficiency of the machine. Both the efficiency and effort put into the machine are inputs. The efficiencyθis a calibration parameter, a parameter with an unknown value, while the effort put into the machine is a control variablex, a variable controlling the output.

True process and data generation:

The workζ(x)done by the simple machine is given by ζ(x) = θx

1 +x/a, (2.1)

where

• ζ(x)represents the work done by the machine,

• xrepresents the effort,

• θ= 0.65represents the efficiency,

• a= 20is a constant.

The observations are synthetic and are generated from zi=ζ(xi) +i, i= 1, ..., n

_i∼ N(0,0.01²) (2.2)

(18)

Chapter 2. Case: Multiple simple machines where

• z_irepresents the work of thei’th observation,

• x_i∈[0.2,4]represent the effort of thei’th observation,

• ζ(xi)represents true work of the simple machine (equation (2.1)),

• i’s represents the observation errors and are independent and identically distributed (i.i.d.) normal random variables,

• n= 11,31and61represents the number of observations.

The experiment is performed three times. One time with 11 observations, one time with 31 observations and one time with 61 observations. For each experiment, the effortxi’s of the observations are evenly spaced over the interval[0.2,4].

After the observations are created, θ is treated as unknown, and is estimated later on.

Notice how the mean of (2.1) is the true process of the simple machine, as the observation errors_i’s have a mean of zero. The goal of predicting the work with a simulator (the simulation model used to learn about the physical process), is to predict the work from the true process.

Simulator:

The simulator for the simple machine, is set up so that effortxis proportional to the work, with the efficiencyθ as the proportionality constant. With the simulator outputη(x, θ) representing work, the equation for the simulator can be written as

η(x, θ) =θx. (2.3)

An important detail to notice, is that the efficiencyθin the simulator (2.3) represents the sameθas in the true process (2.1). From this fact, it is very obvious that there is a difference between the work predicted by the simulator and the work performed by the machine in the true process. Note thatζ(x)always has a smaller value than the linear equation (2.3) for the same value ofθ. This is due to the fact that the denominator of (2.1) is greater than or equal to1forx ≥0. As the denominator becomes larger with increasingx, the difference between the simulator outputη(x, θ)and the workζ(x)done by the true process, becomes larger. This difference is called the model discrepancy, and can be seen as loss of energy (e.g. due to friction). InFig. 2.1, one can see the model discrepancy as the difference between the blue and red curve. The figure shows the work versus effort for the true process using equation (2.1) (blue curve) and the simulator using equation (2.3) (red curve). (2.2) is used to plot the observations (n= 11, black points) and the two red points atx= 1.5andx= 6are the prediction points. The curve of the true process, the simulator and the black points of the observations in the figure, all used the true valueθ= 0.65.

(19)

2.2 Multiple simple machines

0 1 2 3 4

0 2 4 6

x (effort)

Work

Observations Prediction points Simulator True process

Work delivered by machine

Figure 2.1:Work vs. effort for the simulator (red curve), the true process (blue curve) andn= 11 observations evenly spaced over the interval[0.2,4](black points). The red points on the curve of the true process represent the prediction points. The calibration parameter is set toθ= 0.65for both the simulator and the true process, anda= 20for the true process.

Prediction:

Brynjarsd´ottir and O’Hagan (2014) perform interpolation at pointx= 1.5and extrapolation at pointx= 6.

2.2 Multiple simple machines

One of the goals of this thesis, is to perform inference for multiple simple machines simultaneously, and see if the predictions of work becomes better than if the inference is performed on the simple machines individually. The simple machines have the same fixed true value for the efficiencyθ, but different values fora. I.e. the machines are still simple machines, but with individual model discrepancies. The workζj(x)done by machine numberj, is given by

ζ_j(x) = θx 1 +x/aj

,

aj∼ N(0, σ²_a), j = 1,2, ..., N

(2.4)

whereθandσ²_ahave the same values for all theN machines, and theaj’s are generated from a normal distribution. For the multiple simple machines, the same simulator used for

(20)

Chapter 2. Case: Multiple simple machines the simple machine (equation (2.3)) is used.

Fig. 2.2, shows work vs. effort for 100 different simple machines. The machines have the same value for the efficiencyθ = 0.65, but thea_j’s have been generated from a normal distributionN(0,2²), and differ for all the machines. The red line represents the simulator θx, with the true value ofθ= 0.65. Notice that, not only do the machines differ more from the simulator with larger values ofx, but the simple machines differ more from each other with larger values ofxas well. This can also be seen from the prediction points, which are the black points in the figure. The prediction points get more spread with larger values ofx. In this thesis, observations are generally generated in the intervalx∈[0.2,4], so the prediction points atx= 1,2and3are interpolation points, and the points atx= 6and8 are extrapolation points.

0 2 4 6 8

012345

Work vs. effort for 100 simple machines

x (effort)

Work

Simulator True processes Prediction points

Figure 2.2: Work versus effort for 100 simple machines. The grey curves represent the true processes for the different simple machines, the red curve represent the simulatorθxand the black points represent the prediction points. The efficiency has the same value for all the machinesθ= 0.65and is also the same value that is used for the simulator in this plot.avaries for all the true processes of the different machines, and is generated fromN(0,2²).

(21)

Chapter 3 Background

The models used in this thesis are called Latent Gaussian models (LGM), where one or two of the models’ terms follow a Gaussian field (GF). To perform inference, Bayesian statistics is needed coupled with Markov chain Monte Carlo (MCMC) algorithms. However, there is a challenge with using MCMC algorithms, as they are computationally expensive, especially for a large system with many machines. To get around this, a package called INLA (stands for Integrated nested Laplace approximations) by Rue et al. (2009b) is used.

INLA offers an alternative to MCMC algorithms and for the models used in this thesis, is much less computationally expensive. In the following sections, the concepts of Bayesian statistics, LGM and GF are explained, as well as a brief introduction to INLA.

3.1 Bayesian statistics

The brief overview in this section is based on Robert (2007).

The main idea of Bayesian statistics, is to merge prior knowledge about an event, and the data obtained, to make inferences. This is done to predict new data points, or to estimate parameters.

Parameter estimation:

Consider data points z(this can be a vector of values), with the pdf f(z|θ) whereθ is a vector of parameters. Note that θ andz are general vectors of parameters and data points respectively, not necessarily the θandz used in the case of the simple machine.

Bayesian statistics is concerned with estimation ofθgiven the dataz. The value ofθcan be estimated with Bayes’ theorem:

f(θ|z) = f(z|θ)f(θ) R

Θf(z|θ)f(θ)dθ ∝f(z|θ)f(θ), (3.1) where

(22)

Chapter 3. Background

• zrepresents the data

• θrepresents the parameter(s)

• f(θ|z)is the posterior distribution (also called posterior) forθ, which is the distribution used to estimateθ.

• f(θ)is the prior distribution (also called prior) ofθ. This represents the knowledge aboutθbefore observing the dataz.

• f(z|θ)is the distribution for the data given the model. It is called the likelihood function, and is the same likelihood function used in frequentist statistics.

The integral in the denominator integrates over the domain ofθ, and because the integral becomes a constant, the posterior becomes proportional to the prior distribution and likelihood function. Notice that, as the posterior is proportional to a product of the prior and the likelihood function, it contains both information of what was thought to be the distribution ofθbefore obtaining the data, and information from the data itself. This is how the idea of merging prior knowledge and the information from the data to make inference (the main idea of Bayesian statistics) shows up in Bayes’ theorem.

Prediction:

Bayesian statistics uses the posterior predictive distribution, to predict a new data pointz.˜ The posterior predictive distribution has the following form

f(˜z|x) = Z

Θ

f(˜z|θ)f(θ|z)dθ, where

• z˜is the data point to be predicted.

• f(˜z|z)is the posterior predictive distribution, the distribution used to predict the pointz.˜

• f(θ|z)is the posterior distribution found in equation (3.1).

• f(˜z|θ)is the likelihood function.

Notice that the integral integrates over the domain ofθ, accounting for the uncertainty in θ. Both the prior knowledge ofθand the information from the data can be found in the posterior distributionf(˜z|x).

Credible interval (CI):

Credible intervals are the Bayesian statistics’ analogue to frequentist statistics’ confidence intervals. A100(1−α)% credible interval is an interval covering100(1−α)% of the posterior distribution. For prediction, it can be expressed as(˜zL,z˜U), where

P(˜zL< θ <z˜U|z) = Z z˜U

˜ zL

f(˜z|z)dθ= 1−α.

In this thesis, the mean and the 95% credible intervals using the 2.5th percentile to the 97.5th percentile of the posterior distributions, are used for predictions.

(23)

3.2 Latent Gaussian Models (LGM)

This section is based on Rue et al. (2009a) and (Blangiardo and Cameletti, 2015).

Letz = (z1, ..., zn)be the observed data, wherenis the number of data points. A distribution for the data pointszis specified, and in this thesis, the specified distribution is a Gaussian distribution. When the specified distribution is a Gaussian distribution, the mean E[z_i]is equal to the additive predictorη_i. It can be written as

ηi=β0+

M

X

m=1

βmxmi+

L

X

l=1

fl(yli), where

• iis the index for thei’th data point

• β0represents the intercept,

• β= (β1, ..., βM)represents the linear effects of explanatory variables,

• x1i, ..., xM iare covariates for data pointi,

• f = (f1(·), ..., fL(·))are functions on covariatesy1i, ..., yLi, that can take various forms (including non-linear effects),

• M andLrepresent the number of linear effects and number of functions respectively.

The model is a latent Gaussian model iff a Gaussian prior is assigned toβ₀, theβ_k’s and the fl(·)’s. In this thesis τ = (β0,β,f)is referred to as the latent field, and the other parametersψ = (ψ1, ..., ψK)are referred to as hyperparameters, whereKis the number of hyperparameters. It is assumed that the observations are conditionally independent, and thus, thenobservations have the following likelihood

p(z|τ,ψ) =

n

Y

i=1

p(z_i|τ_i,ψ).

Here, eachziis connected to aτi, only one element inτ.

τ follows a multivariate Normal priorτ ∼ N(0,Q(ψ)⁻¹), where the density function is given by

p(τ|ψ) = (2π)^−n/2|Q(ψ)|^1/2exp (−1

2τ⁰Q(ψ)τ),

where⁰ denotes the transpose and| · |the discriminant. Supposedly, the components of τ are conditionally independent, and this implies thatQ(ψ)is a sparse matrix. With this specification, the latent Gaussian field becomes a Gaussian Markov random field (GMRF).

Due to the precision matrix Q(ψ)being sparse, INLA saves considerable computation time as numerical linear algebra for sparse matrices can be used.

The objective is to find the marginal posterior distributions for each element inτ andψ, which can be calculated as

p(τi|z) = Z

p(τi,ψ|z)dψ= Z

p(τi|ψ,z)p(ψ|z)dψ, (3.2)

(24)

Chapter 3. Background and

p(ψ_k|z) = Z

p(ψ|y)dψ_−k (3.3)

respectively. Here,τiis one element in the latent fieldτ and the indexdψ_−kmeans that the integrand is integrated over all ofψexceptψk. Thus, to find the marginalsp(τi|z)and p(ψk|z),p(ψ|z)andp(τi|ψ,z)need to be computed. To do this, the method of Laplace approximation is used.

Laplace approximation

The objective of Laplace approximation is to approximate the following integral Z

f(x)dx= Z

exp(log(f(x))dx. (3.4)

Here,f(x)is the density function of a random variableX.log(f(x))can be represented as a Taylor series expansion

log(f(x)) = log(f(x^∗)) +(x−x^∗)² 2

∂²log(f(x))

∂x² _x=x_∗

(3.5) evaluated atx = x^∗, which is the mode off(x)(x^∗ = argmax_xlog(f(x)). Thus, the integral can be written as

Z

f(x)dx= exp(log(f(x^∗)) Z

exp

(x−x^∗)² 2

∂²log(f(x))

∂x² x=x^∗

dx, (3.6) which is found by substitutingf(x)by the exponential of the right side of (3.5). By letting σ²^∗ = −1/^∂²^log_∂x2^f(x)

_x=x_∗

, the integrand becomes similar to the density of a normal distribution, and (3.6) can be written as

Z

f(x)dx= exp(log(f(x^∗)) Z

exp

−(x−x^∗)² 2σ²^∗

dx.

Note that the integrand can be seen as the kernel of a normal distribution, wherex^∗is the mean andσ²^∗is the variance. Hence, the integral evaluated in the interval(α, β)becomes

Z β α

f(x)dx=f(x^∗)√

2πσ²^∗(Φ(β)−Φ(α)),

whereΦ(·)represents the cumulative density function of a normal distribution with mean x^∗andσ²^∗. Note that, although Laplace approximation is an approximation method, only equality signs are used. This is because for Gaussian likelihoods, the expressions become equal.

Inference onτandψ:

To find the marginalsp(τi|z)andp(ψk|z),p(ψ|z)andp(τi|ψ,z)need to be computed

(25)

3.2 Latent Gaussian Models (LGM) (this can be seen in equation (3.2) and (3.3)). The joint posterior of the hyperparameters can be calculated as follows

p(ψ|z) = p(τ,ψ|z) p(τ|ψ,z)

= p(z|τ,ψ)p(τ,ψ) p(z)

1 p(τ|ψ,z)

= p(z|τ,ψ)p(τ|ψ)p(ψ) p(z)

1 p(τ|ψ,z)

∝ p(z|τ,ψ)p(τ|ψ)p(ψ) p(τ|ψ,z)

= p(z|τ,ψ)p(τ|ψ)p(ψ)

˜

p(τ|ψ,z) _τ_=τ_∗_(ψ)

=: ˜p(ψ|z),

wherep(τ˜ |ψ,z)is the Laplace approximation described in the last paragraph, andτ^∗(ψ) is the mode givenψ.

To compute p(τi|ψ,z)is more complicated, as there are generally more elements inτ than inψ. One option is to rewriteτ = (τi,τ_−i), whereτ_−iare all the components ofτ exceptτi, and then calculate the posterior conditional distributionsp(τi|ψ,z)as follows

p(τi|ψ,z) =p((τi,τ−i)|ψ,z) p(τ_−i, τi,ψ,z)

=p(τ,ψ|z) p(ψ|z)

1 p(τ_−i|τ_i,ψ,z)

∝ p(τ,ψ|z) p(τ_−i|τi,ψ,z)

= p(τ,φ|z)

˜

p(τ_−i|τi,ψ,z) _τ

−i=τ^∗_−i(τi,ψ)

:= ˜p(τi|ψ,z).

(3.7)

Here,p(τ˜ _−i|τ_i,ψ,z)is the Laplace approximation described in the last paragraph, and τ^∗_−i(τ_i,ψ)is its mode. However, the standard option to calculatep(τ_i|ψ,z)is to perform the simplified Laplace approximation, which is related to a Taylor’s series expansion of

˜

p(τi|ψ,z)from (3.7). A more detailed explanation of this, can be found in Blangiardo and Cameletti (2015).

Oncep(τ˜ i|ψ,z)andp(ψ|z)are obtained,p(τi|z)can be approximated by

˜

p(τi|z)≈ Z

˜

p(τi|ψ,z)˜p(ψ|z)dψ, which are solved numerically through

˜

p(τi|z)≈X

j

˜

p(τi|ψ^(j),z)˜p(ψ^(j)|z)∆j,

for some integration points {ψ^(j)}with the related set of weights{∆j}. To find each marginal posterior p(ψ˜ k|z), an interpolation algorithm based on the density ofp(ψ,˜ z)

(26)

evaluated at the integration points {ψ^(j)} can be used. More on how INLA finds the integration points{ψ^(j)}, can be found in Blangiardo and Cameletti (2015).

3.3 Gaussian fields (GF) and Stochastic partial differen- tial equation (SPDE)

This section is based on Lindgren et al. (2015) and (Blangiardo and Cameletti, 2015).

A spatial process can be denoted as {ξ(x), x ∈ D ∈ R¹}, where xis a spatial index andD is the fixed domain whichxvaries continuously on. If for anyn ∈ Nand any set of locations (x1, ..., xn) the following vector (ξ(x1), ..., ξ(xn)) follows the multivariate normal distribution Nn(µ,Σ), where µis the mean and Σ is a spatially struc- tured covariance matrix, the spatial process is called a Gaussian field (GF). The elements of the covariance matrix Σare defined by the covariance function C(·,·) where Σij =Cov(ξ(xi), ξ(xj)) =C(ξ(xi), ξ(xj)).

A spatial process is called second-order stationary ifµ(xi) =µfor alli’s, andΣis only dependent on the distance vector(xi−xj), such that Cov(ξ(xi), ξ(xj)) =c(xi−xj). If the covariance function only depends on the euclidean distancekxi−xjk(which is the absolute value as only one dimension is considered in this thesis), then the spatial process is called isotropic.

Gaussian fields can be seen as solutions to the linear fractional stochastic partial differen- tial equation (SPDE)

(κ²−∆)^α/2(τ ξ(x)) =W(x), (3.8) whereκ > 0 is the scale parameter, ∆ is the Laplacian operator,αandτ control the smoothness and the variance respectively andW(x)is a Gaussian white noise process.

The solution to this SPDE is a GFξ(x)with the mat´ern covariance matrix Cov(ξ(x_i), ξ(x_j)) =Cov(ξ_i, ξ_j) = σ₀²

Γ(λ)2^(λ−1)(κ|x_i−x_j|)^λK_λ(κ|x_i−x_j|), (3.9) where| · |is the absolute value (euclidean distance in one dimension),σ₀²is the marginal variance and Kλ is the modified Bessel function of orderλ > 0 and second kind. λ measures the degree of smoothness of the process andκ >0is a scaling parameter. It is related to the rangerin the following way:r=

√ 8λ κ .

To link the SPDE from (3.8) with the parameters in the mat´ern covariance matrix (3.9), the following equations are used

λ=α−1/2 r=

√8λ κ

σ²₀= Γ(λ) Γ(α)(4π)^1/2κ^2λτ².

(3.10)

(27)

3.4 Bayesian calibration and inference for a single simple machine The default choice forαis 2 in R-INLA, and this is the value that is used in this thesis.

According to Rasmussen (2004), Whenα→ ∞, the mat´ern covariance matrix converges to the squared exponential covariance matrix with the form

Cov(ξ(x_i), ξ(x_j)) =σ₀²exp

−2d² r²

, (3.11)

whered=|xi−xj|. In R-INLA,0≤α≤2. As 2 is the maximum value forαin R-INLA, it is set toα= 2, so that the covariance matrix gets as similar to a squared exponential covariance matrix as possible. Hence, λ= 3/2(as seen from (3.10)), and according to Rasmussen (2004), the covariance matrix obtains the following form

Cov(ξ(xi), ξ(xj)) =σ²₀ 1 + 2√ 3d r

!

exp −2√ 3d r

!

. (3.12)

The solution to the SPDE (the GFξ(x)) can be approximated in the following way ξ(x) =

G

X

g=1

ϕ_g(x) ˜ξ_g.

Here,Grepresents the number of B-spline knot locations, the set{ξ˜g}are Gaussian distributed weights with mean zero and{ϕg}is the set of deterministic basis functions. The precision matrixQforξ˜={ξ˜1, ...,ξ˜G}is given by

Q=τ²(κ⁴C+ 2κ²G+GC⁻¹G), where the elements of the diagonal matrixCis given byC_ii=R

ϕ(x)dsand the elements of the sparse matrixGis given byG_ij =R

∇ϕ_i(x)∇ϕ_j(x)ds, where∇is the gradient operator. As the precision matrix Q is sparse, ξ is a GMRF and has the distribution N(0,Q⁻¹). It is an approximation to the solution of the SPDE.

3.4 Bayesian calibration and inference for a single simple machine

Model

One of the models that Brynjarsd´ottir and O’Hagan (2014) use to perform inference on the simple machine, the model that is of most interest in this thesis, is a model from Kennedy and O’Hagan (2001). After adjusting the model to fit the framework of the simple machine, the following model is obtained

z_i=θx_i+δ(x_i) +_i, (3.13)

whereziis the work for thei’th observation,θis the efficiency andi represents the observation error.δ(xi)represents the model discrepancy and follows a zero-mean Gaussian

(28)

process with a squared exponential covariance function. Mathematically, it can be written as

δ(·)∼ GP 0, σ₀²c(·,·|ψ) , where

c(x₁, x₂|ψ) = exp −

x1−x2

ψ

²!

. (3.14)

This is equivalent to a GF with a mat´ern covariance function, whereα → ∞(equation (3.11), where2/r²= 1/ψ²). They leti∼ N(0, σ²),σ²andσ₀²follow a inverse gamma distribution,ψfollow a truncated gamma distribution andθhave a non-informative prior.

By using Gibbs sampling and the Metropolis-Hastings algorithm, they perform inference to perform the prediction atx = 1.5 andx = 6, and the estimation of the calibration parameterθ.

Results:

The results of Brynjarsd´ottir and O’Hagan (2014) that is of interest for this thesis, can be summarized as follows:

• The model succeeded at interpolatingx= 1.5, as the posterior distribution centered around the true value, and the 90% credible interval became smaller with more observations.

• For extrapolation atx= 6, the posterior distribution did not center around the true value. However, the 90% credible interval covered the true value.

3.5 Evaluation of predictions

To measure the predictive performance, mean squared error (MSE), continuous rank probability score (CRPS), coverage probability and length of credible intervals are used.

Mean squared error (MSE):

According to Saigal and Mehrotra (2012), the MSE measures how accurate the predictions are and has the numerical value

MSE= 1 N

N

X

j=1

(ˆzj−zj)².

Here,zˆjandzj represent the prediction (the mean of the predictive distribution) and the true value respectively. In this thesis,N is the number of machines and the indexjrep- resents the machine number. The closer the predictions are to the true values, the smaller MSE becomes. Thus, a smaller MSE indicates better predictions.

Continuous rank probability score (CRPS):

According to Gneiting and Raftery (2007), the CRPS can be defined as follows CRPS(F, z) =

Z ∞

−∞

(F(y)−1{y≥z})²dy, (3.15)

(29)

3.5 Evaluation of predictions whereF(·)is the cumulative density function of the predictive distribution,z is the true value of what is being predicted and1{y≥z}represents the Heaviside function: the term takes value 1 ify ≥ zand0 otherwise. According to Gneiting and Raftery (2007), the CRPS of a normal predictive distribution can be calculated as follows

CRPS(N(µ, σ²), z) =σ z−µ

σ

2Φ

z−µ σ

−1

+ 2φ z−µ

σ

− 1

√π

,

whereµandσare the mean and standard deviation of the predictive distribution respectively, andφandΦare the probability density function and cumulative density function of a standard normal distribution respectively. A small CRPS indicates better predictions, as the term(F(y)−1{y ≥ z})²is smaller, which can intuitively be seen as the difference between the cumulative distribution function of the prediction and the cumulative distribution of the true value squared. CRPS can also be viewed as the sum of two terms, one representing sharpness and one representing calibration. According to Gneiting and Raftery (2007), sharpness is the concentration of the distribution, and calibration is the consistency between the predictive distribution and the true value.

Coverage probability:

According to Casella and Berger (2002), coverage probability is the probability that the credible interval contains the true value. Although the credible intervals of interest in this thesis are set to 95%, coverage probability is the true probability that the credible interval covers the true value, which might differ from 95%. To calculate this percentage, predictions are performedN times (one time for each machine) per prediction point. Then, by dividing the number of times the credible intervals cover the true value byN, an approximation of the coverage probability is found for the specific prediction point.

To evaluate the coverage probability found, the corresponding p-value is used. A p-value is the probability of obtaining the coverage probility found or a more extreme coverage probability, given that the true coverage probability is 95%. The p-value is calculated using only one tail. That is, if a coverage probability is smaller than 95% is found, the p- value is the probability of obtaining this specific coverage probability or a smaller coverage probability. If the coverage probability found is larger than 95%, then the p-value is the probability of obtaining this specific coverage probability or a larger coverage probability.

If the p-value is smaller than 5%, it is concluded that the credible interval is not a 95%.

Otherwice, it is concluded that the credible interval is 95% credible interval.

Length of CI:

The length of the credible intervals are also considered in this thesis. A smaller credible interval is considered as better than a larger credible interval, as the prediction is more certain of the true value. However, it is important to check if the credible intervals cover the true values for a method. Smaller credible intervals are not necessarily better if the intervals always miss the true values. Thus, it is important to also evaluate coverage probability, to check if the method producing shorter credible intervals does cover the true values the amount of times it is supposed to. To do this, the p-values of the coverage probabilities are evaluated.

(30)

(31)

Chapter 4 Bayesian calibration and inference for multiple machines

4.1 Multiple ideal machine models

There are three models that are used to perform inference on the multiple simple machines.

The first model, referred to as the individual model, is the model from Brynjarsd´ottir and O’Hagan (2014) with some modification to make it possible to run in INLA. This model assumes the machines have individual model discrepancy and individual parameters. The second model, referred to as Model 1, assumes the machines have individual discrepancy, while the parameters are the same. It is an extension of the individual model, as it is the same model, with the assumption that the parameters are identical included. The third model, referred to as Model 2, assumes the machines have an individual discrepancy term, common parameters and a common discrepancy term that is identical for all the machines.

It is an extension of Model 1, as it is the same model with a common discrepancy term added. The individual model evaluates the machines individually, as it does not assume that the machines have anything in common. Model 1 and Model 2 evaluates the machines simultaneously, as they assume that the machines have common parameters. Additionally, Model 2 assumes that the machines have a common discrepancy term.

Individual model: Individual discrepancy, Individual parameters.

The individual model is the model from Brynjarsd´ottir and O’Hagan (2014), and is the model from equation (3.13). However, it is modified to fit the framework of INLA: the Gaussian process is modeled with a Gaussian field (GF) with Mat´ern covariance function.

(32)

Chapter 4. Bayesian calibration and inference for multiple machines Mathematically, the individual model is written as

z_ij=θ_jx_ij+ξ_j(x_ij) +_ij,

ξj(·)∼GF(τξj, κξj), i= 1, ...nj, j= 1, ..., N, ij∼ N(0, σ_j²),

where

• zrepresents work,

• xrepresents effort,

• θjrepresents efficiency (calibration parameter),

• ξ_j(·)represents the model discrepancy, and follows a Gaussian field with mean 0 and an exponential covariance function (equation (3.12)),

• ijrepresents the observation error, and follows a normal distribution,

• τ_ξjandκ_ξjrepresent the parameters of the Gaussian field,

• σ_j²represents the variance of the observation errors,

• iandjrepresent observation number and machine number respectively,

• nj andN represent the number of observations for machinej and the number of machines respectively.

Note that all the parametersθj,τξj,κξjandσ_j²have different values for each machine.

The discrepancy termsξj(·)’s are also different for each machine, and as it has an exponential covariance function, the model is different than the model of Brynjarsd´ottir and O’Hagan (2014) (which uses a squared exponential covariance function). For the observation errors_ij, the values vary for each machine and for each observation, but follow the same distribution for the same machine (asσ²_jhave the same value for the same machine).

Model 1: Individual discrepancy, Common parameters.

Model 1 is constructed by extending the individual model to consider all the machines simultaneously. Thus, it assumes that all the parameters have the same values accross all the machines. Mathematically, Model 1 is written as

zij =θxij+ξj(xij) +ij,

ξ_j(·)∼GF(τ_ξ, κ_ξ), i= 1, ...n_j, j= 1, ..., N, ij ∼ N(0, σ²),

(4.1)

which is the exact same model as the individual model, except that the parametersθ,τ_ξ, κ_ξandσ²have the same value for all the machines, and thus, do not have the subscriptj.

The model discrepancy termξ_j are different for all the machines, following a GF with mean of 0, and an exponential covariance function (equation (3.12)). However, unlike the individual model, the model assumes the same parameter valuesτξ andκξ for all the machines. The observation errorsijare i.i.d. for all observations for all machines.

(33)

4.1 Multiple ideal machine models

Model 2: Common discrepancy, Individual discrepancy, Common parameters.

Model 2 is constructed by adding a common model discrepancy term to Model 1. Hence, it has two model discrepancy terms: one common and one individual. Mathematically, Model 2 is written as

zij=θxij+δ(xij) +ξj(xij) +ij,

δ(·)∼GF(τδ, κδ), i= 1, ...nj, ξj(·)∼GF(τξ, κξ), j= 1, ..., N,

ij∼ N(0, σ²), where

• δ(·)represents the model discrepancy term that is identical for all the machines, and follows a Gaussian field with mean0and an exponential covariance function (equation (3.12))

• τ_δandκ_δ represent the parameters of the Gaussian field thatδ(·)follows,

• The other terms are exactly the same as the terms in Model 1 (equation (4.1)).

Thatδ(·)is a common discrepancy term, means that not only are the values ofτδ andκδ

the same for all the machines, but δ(x)has the same value for all the machines for the same value ofx. Other than the added common discrepancy term, Model 2 is identical with Model 1.

The motive behind constructing the model with two model discrepancy terms, is thatδ(·) might represent the common discrepancy that all the simple machines have (the difference between a simple machine and the simulator θx), while the ξj(·)’s might represent the difference between each simple machine.

Multiple Ideal machines:

Before inference of the simple machines is performed using the individual model, Model 1 and Model 2, the models are tested on datasets generated from Model 1 and Model 2 themselves. The datasets generated from Model 1 and Model 2, represent work and effort for a machine referred to as the Ideal machine 1 and the Ideal machine 2 respectively.

For the Ideal machines 1, N different GFs (ξ_j(·)) are generated, one for each of theN machines. For the Ideal machines 2,N+ 1different GFs are generated, one for each of theN machines (ξ_j(·)), and one common for all the machines (δ(·)). More on how the GFs are generated, can be found in section A.1.

One thing to note, is that these machines are conceptional. The machines might produce negative work for some values of effort, which is not realistic in the real world. The motive behind generating observations from Model 1 and Model 2, and performing inference on them, is to see if the predictions on these machines are successful, and if it is possible to learn something about the inference methods applied to those machines. Then, the three models are used to perform inference on the multiple simple machines.

(34)

Chapter 4. Bayesian calibration and inference for multiple machines

4.2 Case studies:

There are three machines that are of interest in this thesis: the Ideal machine 1 and the Ideal machine 2 (both created using the framework of Model 1 and Model 2 from section 4.1 respectively), and the simple machine from Brynjarsd´ottir and O’Hagan (2014).

4.2.1 True process, observations and simulator

The Ideal machine 1: Individual discrepancy, Common parameters

The Ideal machines 1 are created using the framework of Model 1. Their work are given by

ζ_j(x) =θx+ξ_j(x),

ξj(·)∼GF(τ, κ), j= 1, ..., N, (4.2) where

• ζj(x)represents work from the true process of the Ideal machines 1,

• xrepresents effort,

• θrepresents the efficiency,

• ξj(·)represents the model discrepancy. It follows a GF and have a mean of0and an exponential covariance function (equation (3.12)),

• τandκrepresent the parameters of the Gaussian field,

• jandNrepresents the machine number and the number of machines respectively.

Note that, as with Model 1, the parametersθ, τ andκhave the same values for all the machines, and the GFsξj(·)are different for each machine.

Fig. 4.1shows work vs. effort for 100 different Ideal machines 1 represented by the grey lines, the simulatorθxrepresented by the red line and the prediction points represented by the black dots. All the machines have the same value for the efficiencyθ = 0.65and the same value is used for the simulator as well. As theξ_j(·)’s (model discrepancies) are different for all the 100 machines, the grey lines do not overlap. τ ≈0.46andκ≈2.45 for all the Gaussian fields used to generate the model discrepancies.

The Ideal machine 2: Common discrepancy, Individual discrepancy, Common parameters.

The Ideal machines 2 are created using the framework of Model 2. Their true processes depend on a common model discrepancy term that is identical for all theN machines as well as a model discrepancy term that varies for each machine. The work from the true processes of the Ideal machines 2, are given by

ζ_j(x) =θx+c₁δ(x) +c₂ξ_j(x), δ(·)∼GF(τ, κ),

ξ_j(·)∼GF(τ, κ), j= 1, ..., N,

(4.3)

Bayesian Calibration and Inference for Multiple Machines

Bayesian Calibration and Inference for Multiple Machines

Master's thesis

Hong-Tan Lam

Bayesian Calibration and Inference for Multiple Machines

Hong-Tan Lam

Abstract

Sammendrag

Preface

Table of Contents

Chapter 1

Introduction

Chapter 2

Case: Multiple simple machines

2.1 The simple machine

Work delivered by machine

2.2 Multiple simple machines

Chapter 3

Background

3.1 Bayesian statistics

3.2 Latent Gaussian Models (LGM)

3.3 Gaussian fields (GF) and Stochastic partial differen- tial equation (SPDE)

3.4 Bayesian calibration and inference for a single simple machine

3.5 Evaluation of predictions

Chapter 4

Bayesian calibration and inference for multiple machines

4.1 Multiple ideal machine models

4.2 Case studies:

4.2.1 True process, observations and simulator