Importance Sampling - Monte Carlo Methods

2.2 Monte Carlo Methods

2.2.2 Importance Sampling

whereρt is the autocorrelation function at lagt. In a practical setting, the upper bound of the sum is a finite number t= T, where the autocorrelation is close to zero.

To achieve numerical results of tasks (2.6) and (2.8), the samples are gener-ally placed into bins according to their sample values, where each bin is weighted according to the number of samples within. Thereby, the kernel of the target distri-bution can easily be approximated with these points, yielding a solution to (2.6).

The optimization of the target distribution is simply solved by picking the bin containing the highest number of samples. It is important to note that these are approximations and will, in a practical manner, carry some numerical errors. How-ever, if the number of samples grows to infinity, and the bin-size tends towards zero, the approximations intuitively becomes exact.

2.2.2 Importance Sampling

Importance sampling (IS) may be considered a precursor to MCMC methods, and was first introduced by Kahn (1950) to estimate the probability of nuclear particles penetrating shields. The method is based on the identity

f(z)π(z|y)dz=

Z f(z)π(z|y)

q(z) q(z)dz, (2.12)

and is commonly used to compute (2.7) in situations where the domain f(z)lies in the area of low probability of the target distributionπ(z|y). In this setting, the classical Monte Carlo approach, which generates samples from π(z|y), would obtain poor approximations of (2.7) because most of its samples would be in a re-gion where f(z) =0. It is apparent that, similar to MCMC, a simpler distribution (or proposal distribution)q(z)must be employed to generate samples that eclipse the important region, the region where f(z)π(z|y)6=0. Then, by taking advant-age of the identity in (2.12), the estimate of (2.7) is adjusted to account for the use of this proposal distribution. The IS method is commonly used in high energy physics, rare event simulation, and rendering in computer graphics. Moreover, it can also substitute the accept-reject design in MCMC, and be used for Bayesian inference.

In Bayesian inference, the interest lies in approximating the target distribution or a particular moment about it, such that the important region must overweigh the sample space of the target distribution instead of the domain of f(z). The question then arises; why invoke this proposal distribution to generate samples as the classical Monte Carlo method could be used on this problem? However, similar to the setting in MCMC, our target distribution is unknown, such that samples are unobtainable from the target distribution itself.

Consider the M generated samples{z}^M_j=1 from the proposal distribution; an unbiased and consistent estimator (Bugallo et al., 2017) of the expected value of a function f(z)with respect to the target distributionπ(z|y), can be expressed as

Here, q(z) is a multivariate proposal distribution, where q(z) > 0 whenever f(z)π(z|y) 6= 0, such that the tail of the proposal is heavier than the target.

The choice of proposal is important, as the variance of the estimator in (2.13) directly depends on the dissimilarity between the shape of the proposal and the target distribution (Robert et al., 2004). This dissimilarity, representing the signi-ficance of one sample in approximating the target, is generally referred to as the importance weight.

ω^(j)= π(z^(j)|y)

q(z⁽^j⁾) . (2.14)

To compute the importance weight in (2.14) the normalizing constant ofπ(z|y) needs to be obtainable. This is not the case in many practical situations and, in gen-eral, is not the case for LGMs. An alternative is then to employ theself-normalized

importance weights:

where the unknown normalizing constant conveniently cancels out. The resulting estimator for the quantity of interest is

E˜_π[f(z)] =

j=1

ω¯⁽^j⁾f(z⁽^j⁾). (2.16) It can be shown that (2.16) is biased for finite M but consistent (Geweke, 1989).

Considering the importance weight ¯ω^(j)represents the target distribution evalu-ated at z^(j), i.e.π(z^(j);z| y)' ω¯^(j), and assuming that the number of samples M → ∞, the approximation of the target distribution, and the solution of (2.6), can be found with the expression

π(˜ z|y) = XM

j=1

ω¯⁽^j⁾δ(z−z⁽^j⁾),

where δ(·)is the Dirac delta function. In a practical manner, whereM → ∞is infeasible, the target distribution can approximated with non-parametric kernel density estimation (see Silverman, 1986), using the self-normalized weights and their corresponding samples. Furthermore, the mode can be found with the argu-ment of the maximum value of the now approximated kernel density. The mode can also be found by viewing the corresponding sample value of the maximum weight, but the accuracy of this method is highly dependent on the number of samples within the probability mass of the target distribution.

In extreme settings, some weights might be significantly larger than others, achieving a limited number relevant samples; similarly, all weights might be zero, ruling all samples insignificant; in other settings, the conclusion about the quality of the samples might be difficult to draw. In this latter case, a common diagnostic is theeffective sample size:

which is the number of independent samples generated from the target distribu-tion required to obtain the same estimator variance of (2.7) as the self-normalized

estimator (2.16) using the M generated samples from the proposal distribution.

The theoretical development of (2.17) will not be detailed here, and the reader is referred to Martino et al. (2017) for a thorough account.

Compared to MCMC, IS has the advantage to be easily parallelized since the samples are drawn independently from each other. On the other hand, the per-formance of the basic IS algorithm highly depends on the choice of proposal dis-tribution, which remains constant during the whole simulation.

In document Integrated Nested Laplace Approximations within Monte Carlo Methods (sider 22-25)