Simulation Study for Non-Bayesian Multiple Imputation.

(1)

Simulation Study for

Non-Bayesian Multiple Imputation

Yunsung Lee

Master’s Thesis, Spring 2017

(2)

This master’s thesis is submitted under the master’s programme Modelling and Data Analysis, with programme optionStatistics and Data Analysis, at the Department of Mathematics, University of Oslo. The scope of the thesis is 60 credits.

The front page depicts a section of the root system of the exceptional Lie group E8, projected into the plane. Lie groups were invented by the Norwegian mathematician Sophus Lie (1842–1899) to express symmetries in differential equations and today they play a central role in various parts of mathematics.

(3)

Abstract

Rubin (1987)’s combination formula for variance estimation in multiple imputation (MI) requires a imputation method to be Bayesian-proper. How- ever, many census bureau have heavily relied on non-Bayesian imputations.

Bjørnstad (2007) suggested an inflated factor (k1) in Rubin (1987)’s combination formula for non-Bayesian imputations. This paper aimed to verify the theoretical derivation of Bjørnstad (2007) in computer simulation. Within Bjørnstad (2007)’s pre-assumed environment, the inflated factor,k₁, closely approached the simulated true value, E(k), irrespective of sample size and missing rate. With California schools’ data, confidence intervals using k₁ also achieved a desired coverage, (1−α)%, across varying sample size and missing rate, except in case of MNAR because of biased imputation.

Key words: Multiple imputation; non-Bayesian method; variance estimation; inflated factor; simulation.

(4)

Acknowledgements

I would like to express my appreciation to professor Jan F. Bjørnstad for his endless support. It was such a great honor to be his last master student.

I would also like to thank to my family for their supports through my whole life.

Last, but most importantly, I could never have made it without God’s guidance.

Phillipians 4:6

Do not be anxious about anything, but in every situation, by prayer and petition, with thanksgiving, present your requests to God. 7 And the peace of God, which transcends all understanding, will guard your hearts and your minds in Christ Jesus.

(5)

List of Figures

1.1 Process of statistical inference in census bureau . . . 8

1.2 Difference between unit or item non-response . . . 9

1.3 Scatter plots for each case of missing mechanism . . . 14

2.1 Two track process . . . 30

2.2 Checking imputed values . . . 32

4.1 Structure of the simulation . . . 46

4.2 Characteristics of sample data . . . 53

4.3 Confidence intervals for hot-deck under MAR . . . 58

4.4 Confidence intervals for hot-deck under MNAR . . . 58

4.5 Confidence intervals for residual imputation under MNAR . . 59

B.1 Convergence of E(k) for population average under MCAR . . 77

B.2 Convergence of E(k) for population average under MAR . . . 78

B.3 Convergence of E(k) for population average under MNAR . . 79

B.4 Convergence of E(k) for a ratio model under MCAR . . . 80

B.5 Convergence of E(k) for a ratio model under MAR . . . 81

B.6 Convergence of E(k) for a ratio model under MNAR . . . 82

B.7 Convergence of E(k) for simple regression under MCAR . . . 83

B.8 Convergence of E(k) for simple regression under MAR . . . . 84

B.9 Convergence of E(k) for simple regression under MNAR . . . 85

(8)

List of Tables

1.1 Artificial sample data for understanding missing mechanism . 14

2.1 Options of the mice function . . . 31

2.2 Modelling method used for each data type. . . 42

4.1 Fixed and changeable factors . . . 48

4.2 E(k) for population average . . . 49

4.3 E(k) for ratio model . . . 50

4.4 E(k) for simple regression. . . 52

4.5 Fixed and changeable factors. . . 56

4.6 Coverage for California schools. . . 57

4.7 E( ¯Ys) for California schools . . . 57

(9)

Chapter 1

Sample surveys with non-response

Historically, many census bureaus (CB) have collected sample data through a format of survey in pursuit of estimating population parameters, such as a total sum, mean or standard deviation. In this process of data collection, non-response has always been one of the biggest challenges. Non-response indeed indicates unexpected information loss. Badly, this loss of information can lead us biased parameter estimation as well as large uncertainty in estimates. Accordingly, many remedies for missingness have been developed to achieve valid statistical inference. As widely known, imputation and weighting are typically used in practice.

The main process from sampling to inference can be depicted as shown in Figure 1.1. CB draws some samples based on a certain sampling method, such as simple random sampling, stratified random sample or etc. When non-respondents are detected in sample, the type of missingness has to be found to determine the most adequate remedy (e.g. imputation or weighting). More details will be discussed in the next sub-topic, "Types of missingness".

If imputation is the case, we need to go through some preliminary work in

(10)

Figure 1.1: Process of statistical inference in census bureau

order of successful imputation. The first work is called missing mechanism.

For some imputation methods, we need to assume how non-response possibly depends on other variables. In reality, people are more or less likely to respond certain questions relying on their personal profile. Secondly, we need an assumption for a variable of interest. Many imputation methods require the stochastic nature of a variable of interest, and model-based approach plays its role in assuming this nature. The later two sub-topics will also detail the assumption and missing mechanism.

1.1 Types of missingness

Non-response can be recognized in two different types, unit or item non- response. Unit non-response is failure of obtaining any information from participants for some reasons. Briefly speaking, people could/did not answer any of questionnaire. Data is usually stored in a form of rectangle or matrix, and then we usually recognize each row as a single unit/individual, while columns refer to the questioning items of survey. Here, we have empty rows for those who were expected to answer but did not actually.

(11)

Figure 1.2: Difference between unit or item non-response

questionnaire were not answered. Again, for some reasons, people are more or less likely to answer some certain types of questioning items. As an example, a questioning item asking people’s gross annual income has a lower response rate. Because, people may feel uncomfortable when a survey asks for their private information. In this case, we keep rows for those who partially answered a questionnaire, but still have some NA’s in columns (questioning items) that they did not want to respond.

There are two typical remedies, one for each case. For unit non-response, weighting method can be applied. Specifically, an under-represented group can be compensated by giving more weight on it, or vice versa. For item non-response, a method of imputation is widely used in modern statistics. A basic idea is to impute estimated values into missing spots so that completed data improves representativeness of sample. If the imputed sample turns out to be a well-representative miniature of the population, inference will gain more confidence. The estimates are more likely to get close to corresponding population parameters.

(12)

1.2 Design-based approach

In survey sample, there are two primary approaches for statistical inference of population parameters(e.g. population total, PN

i=1y_i, or mean,

1 N

PN

i=1yi). "Design-based" approach is a traditional method based on probabilistic unit selection. Define an indicator variable such that for all units, i= 1,2, ..., N,

I_i =







1, ifi∈sample s 0, otherwise

Then, the inclusion probability of jth unit can be written as P(j ∈ s) = P(Ij = 1) =πj. Here, strictly speaking, the indicator variable of each unit becomes probabilistic, and a variable of interest, y_i, is regarded fixed. For convenience, some literature simply describes a samplesas stochastic, since a sample constituted by probabilistic units is also probabilistic. As widely known, simple random sampling, stratified random sampling, systematic random sampling or etc are typical examples.

1.3 Model-based approach

On the other hand, "model-based" approach can be introduced when Y_i is assumed to be probabilistic and sample s is fixed. Särndal et al.

(1978) argued that a variable of interest,Yi, appears to be a random variable with a certain distribution (e.g. Gaussian, Binomial, Poisson, Weibull or so forth), and yi is the realized value of the corresponding random variable, Y_i. If necessary,Y_ican be statistical-modeled conditioned on other observed variables. For better understanding, the following will describe some of

(13)

1. Ratio model

Assume that an auxiliary variable, X is always available over all observations.

Yi=βxi+i where E(i) = 0,V ar(i) =σ²xi and Cov(i, j) = 0 βˆopt =

P

i∈sYi

P

i∈sxi

Tˆpred=P

i∈sYi+P

i /∈sYˆi = ˆβoptP

i∈sxi+ ˆβoptP

i /∈sxi = ˆβopttx

V ar( ˆTpred−T) =N^{2 1}^−f_n ^x^¯_x_¯^r^x^¯

s σ² 2. Simple linear regression

Y_i=β₁+β₂x_i+_i whereE(_i) = 0,V ar(_i) =σ² and Cov(_i, _j) = 0 βˆopt =

P

i∈sYi/xi

n

Tˆ_pred=P

i∈sYi+P

i /∈sYˆi =P

i∈sYi+ (_n¹P

i∈s Yi

xi)P

i /∈sxi

V ar( ˆT_pred−T) =σ²(⁽

P

i /∈sxi)²

n +P

i /∈sxi2) 3. Common mean model

Y_i=β+_i whereE(_i) = 0,V ar(_i) =σ² and Cov(_i, _j) = 0 βˆ_opt = _n¹P

i∈sY_i = ¯Y_s Tˆ_pred=P

i∈sY_i+P

i /∈sY¯_s=NY¯_s V ar( ˆT_pred−T) =N²(1−f)^σ_n²

Many imputation methods are heavily dependent on this model-based approach. The reason is that imputation toward non-respondent is basically a form of prediction. We have partial lack of information about a variable of interest, and imputation essentially fills it out with scientifically reasonable guess. And, the distributional assumption on Yi from a model-based approach makes possible to obtain plausible prediction for missing values.

For instance, imputations based on regression or random selection (e.g. hot- deck) all start with the distributional assumption.

(14)

1.4 Missing data mechanism

Rubin (1976) first introduced missing data mechanism that structured associations between missing and other variables. The links are recognized in three different ways, based on which variables affect missing occurrence.

When none of the variables has any portion in missing occurrence, we call it "missing completely at random"(MCAR). Here, a variable of interest Yi

and auxiliary variableX_i are fixed. AndR_i is a stochastic indicator variable for missingness such that

Ri =







1, ifi∈sr, 0, otherwise

wheresr denotes an observed sample. Thus,Ri has a binomial distribution with response rate. In mathematical notations, the response rate under MCAR can be expressed as

pi =P(Ri = 1|Y_i, Xi) =P(Ri = 1)

However, if the link between an auxiliary variable Xi and unobserved Y_i is found, we assume "missing at random"(MAR) mechanism. In reality, MAR has been detected quite frequently. The lower response rate of lower socio-economic class to a political poll, for instance, is the typical case of MAR. Likewise, the response rate of MAR mechanism can be written as

pi =P(Ri= 1|Y_i, Xi) =P(Ri= 1|X_i)

The third mechanism is called "missing not at random"(MNAR), when evenY_i, as well asX_i, influences missing. In addition toX_i, evenY_ipossibly affects its own missing occurrence. In the case thatY is missing, both X

(15)

Then, ther_i of MNAR can be simply written as p_i =P(R_i= 1|Y_i, X_i)

As a matter of fact, MNAR is widely known as the most tricky case to handle. J. L. Schafer and Graham (2002) also argued that it is almost impossible to detect MNAR in real world data, and one even needs the specified distribution of missing.

To visualize the pattern of mechanism introduced above, artificial sample data was generated. Two variables, Y_i and X_i for i=1, ..., 500, were present in the sample. And some of Y_i were missed out on purpose, based on the three mechanism respectively. In order for the intentional missing- out process, response rates,p_{M CAR,i},p_{M AR,i} and p_{M N AR,i} for all i, were in required. First, p_{M CAR,i} was simply set to be a fixed constant. In order for a clear association between missing and variables, the framework of logistic regression was used to define pM AR,i and pM N AR,i, such as

p_{M CAR,i}=c, p_{M AR,i} = e^β¹^Xⁱ

1 +e^β¹^Xⁱ, p_{M N AR,i} = e^β²^Yⁱ^+β³^Xⁱ 1 +e^β²^Yⁱ^+β³^Xⁱ

where c = 0.5, β1 = −0.005, β₂ = −0.003, and β3 = −0.005. All these parameters are arbitrarily chosen in this simulation. Then, Bernoulli random number generator creates the realized values of RM CAR,i, RM AR,i

and R_{M N AR,i}, based on the given response rates. Figure 1.1 shows the illustration of the sample discussed above.

(16)

Table 1.1: Artificial sample data for understanding missing mechanism

Figure 1.3: Scatter plots for each case of missing mechanism

The red dots indicate missing values under each mechanism, while the black ones are observed values. The left panel (MCAR) clearly has non- respondents over the whole x and y plain, since missing does not depend on any of variables. In the middle panel (MAR), more missing was found as

(17)

influence missing occurrence. The right panel (MNAR) showed the majority of non-responses appeared on the upper right part where both Yi and Xi

are high.

This mechanism truly matters in deciding what to do with missing. Be- cause, defining mechanism also means understanding of how missing hap- pened in data. As an example, the type of assumed mechanism determines specific details of imputation method. Depending on the defined mechanism, we decide whether to stratify our data, manipulate chances of random selection or more. We can fairly handle dependency of missing through these several techniques. Then, imputation will lead us to well-representing sample and, further, unbiased statistical inference.

(18)

Chapter 2

Imputation methods and multiple imputation

Intuitively, imputed values can be generated with the estimated conditional distribution of non-respondent,

f_θ,_ˆ_ψ_ˆ(y_i|R_i= 0, x_i), (2.1) where the θ and ψ denote the distributional parameter vector of Y and R.

Because, the original values at missing cells had been also generated through the distribution with the true parameters θ and ψ, before they became missing. This method artificially regenerates those numbers at missing cells based on an estimated distribution, not the true distribution. Monte Carlo Markov Chain (MCMC) can be a technical solution for this regeneration.

However, this method requires distributional assumptions for each random variable and reasonable parameter estimations. Of course, a R-package called "Amelia" successfully actualized this method by relying on missing at random mechanism, multivariate normality and EM algorithm. But, if one of the assumptions is wrongly addressed, this method possibly mislead

(19)

2.1 Imputation in MCAR/MAR model

Under MCAR or MAR, an attempt to impute missing cells with observed sample distribution has been done. The rationale for this attempt can be readily understood by doing some algebra with the conditional density from (2.1). Note thatP(R_i= 0|Y_i, X_i) =P(R_i= 0|X_i) under MAR.

f_θ,ψ(y_i|R_i = 0, x_i) = fθ(yi|x_i)Pψ(Ri = 0|y_i, xi) P_θ,ψ(R_i= 0|x_i)

= fθ(yi|x_i)

((Pψ(R((i = 0|x(((_i) ((((((((

P_θ,ψ(R_i = 0|x_i)

=f_θ(y_i|x_i)

Likewise, we can find that f_θ,ψ(yi|R_i = 1, xi) = f_θ(yi|x_i). This indicates that observed and missing cells all share the same distribution.

Hot-deck is heavily based on the fore-mentioned framework. Each missing cell is replaced with one random draw from the entire observed data.

Other advanced methods also go through a random draw from a group of donors (observed values), with a same probability. Only the way of form- ing donors’ group makes difference among each methodology. The following details several non-Bayesian imputations.

1. Residual regression imputation.

A core process is to drawe_i^∗randomly, with replacement, from the observed residuals,ej, based on a corresponding statistical model.Therefore, the very first job is to fit the most adequate model with observed data.

After fitting a model, fitted values and randomly drawn residuals con- stitute imputed values. So to speak,y_i^∗= ˆy_i+e_i^∗, whereyˆ_i is a fitted value. Here, an auxiliary variable ,Xi’s are all available. The following describes how to obtain imputed values using a ratio model.

- Ratio model

(20)

Fit a model and obtain parameter estimates based on observed data. i.e. βˆr=

P

i∈srYi

P

i∈srxi and ˆσ² = _n−1¹ P

i∈sr

1

xi(Yi−βxˆ i)². Stan- dardized residuals,e_j can be derived in a following fashion.

e_j = (y_j −βˆ_rx_j)/√

x_j, wherej ∈s_r

For i∈ s−s_r, we randomly select e_i^∗ from observed e_j for j ∈ sr, and then compute our imputed values yi∗ = ˆβrxi+ei∗√

xi

backward.

2. Nearest neighborhood with predictive mean matching

The basic idea is to down-size the pool of observed y_j for each non- respondent, where hot-deck is being processed. Theyj’s in the down- sized pool, also being called nearest neighbors, are selected through predictive mean matching. If the predictive means of some observed values are close to the one of a non-respondent, then the observed values belong to the down-sized pool. So, imputed valuesyi∗ are not selected from all observedy_j, but some close neighbors y_j instead.

Little (1988) suggested this operation with a clear notation. Assuming thati∈s−sr,

y_i^∗=y_j,

where (ˆµ_i−µˆ_j)² ≤ (ˆµ_i−µˆ_k)² for all observed j and k. And µˆ_i, µˆ_j andµˆk are fitted values of Y fori,j and k.

3. Stratified hot-deck

First, a population with size N is stratified into H strata. Then, hot- deck is implemented for each non-respondent, depending on which strata the non-respondents belong to. In hth strata, for instance,

(21)

withinhth strata,y_hi’s. In principle, stratified hot-deck resembles the nearest neighborhood method, in a way that random draws occurs in the reduced pool of observed valuesy_j.

2.2 Imputations in MNAR models

If both MCAR and MAR are not the case, we assume that MNAR holds.

MNAR is often called non-ignorable non-response and it is barely detectable whether it holds in data. However, MNAR occasionally happens in reality.

Those with severe depression are less likely to report their depression level, and if they are male, then the response rate drops more sharply. Individuals with higher income also typically have a lower response rate to the questions for annual earning. Unobserved values are somewhat associated with their own missing rate.

Imputed value, yi∗, is drawn at random from the estimated conditional distribution,f_θ,_ˆ_ψ_ˆ(y_i|R_i = 0, x_i). As far as probability density/mass is available, random numbers can be easily generated through basic functions of statistical packages or Monte Carlo Markov Chain (MCMC), if necessary.

More generally, we may also use yi∗ = Eθ,ˆψˆ(yi|R_i = 0, xi). With multiple imputation, we repeat the above-mentioned random draw many times and possibly obtain the empirical average over the multiple imputation.

If derivation of the conditional distribution matters here, we need to take a closer look at it.

f_θ,ψ(y_i|R_i = 0, x_i) = f_θ(yi|x_i)P_ψ(Ri= 0|y_i, xi)

Pθ,ψ(Ri = 0|x_i) , (2.2) whereP_θ,ψ(R_i = 0|x_i) =R

f_θ(y_i|x_i)P_ψ(R_i = 0|y_i, x_i)dy_i and this denominator can be ignored, since it can be regarded as a constant. Then, the two terms in the numerator, f_θ(yi|x_i) and P_ψ(Ri|y_i, xi), require distributional assumptions respectively. A typical case is thatR is assumed to be logistic-

(22)

distributed and continuous Y_i is normally distributed. One can also use a logistic regression for binaryYi.

Furthermore, the parameters of the conditional distribution, θ and ψ, has to be estimated. The θ and ψ denote a parameter vector for Y and R respectively and maximum likelihood estimation (MLE) is widely used in estimation process. The likelihood function ofθ andψ can be written as

L(θ, ψ|y_i, R_i, x_i) =f(y_obs,r|x)

= Y

i∈s_r

f_θ(y_i|x_i)P_ψ(R_i = 1|x_i, y_i)Y

i /∈s_r

P_ψ(R_i = 0|x_i)

θˆ and ψ, the MLE estimators that maximize the likelihood function areˆ plugged back into the probability density/mass function of the conditional distribution. Then,

f_θ,_ˆ_ψ_ˆ(y_i|R_i = 0, x_i)

Nowyi∗ is ready to be generated through the density/mass function.

To clarify the flow of MNAR modelling, a simple but straightforward example from Bjørnstad (2015) will be studied for extra. Consider thatY_i is assumed to be a binomial case,

P(Y_i = 1) =θ And, we also assumed that

P(R_i = 1|y_i) =







ψ, ify_i= 0 2ψ, ifyi= 1

whereψ≤ 1 2

Then, according to the equation (2.2), the conditional probability func-

(23)

tion of the missing values can be written as

P_θ,ψ(Yi|R_i = 0) = P(Yi = 1)P(Ri = 0|Y_i = 1)

P(Y_i= 1)P(R_i= 0|Y_i = 1) +P(Y_i = 0)P(R_i = 0|Y_i = 0)

= θ(1−2ψ)

θ(1−2ψ) + (1−θ)(1−ψ) = θ(1−2ψ) 1−ψ(1 +θ)

Now, as explained earlier, θ and ψ have to be estimated through maximum likelihood method.

L(θ, ψ|y_j, Ri) = Y

j∈sr

f_θ(yj)P_ψ(Rj = 1|y_j) Y

i∈s−sr

P_ψ(Ri= 0)

=θ^v(1−θ)ⁿ^r^−v(2ψ)^vψⁿ^r^−v(1−ψ−θψ)ⁿ⁻ⁿ^r where v = P

i∈srI(Y_i = 1), the number of successes, and n and n_r is the size ofsandsr. Like any other procedures of deriving closed form MLE, we find derivatives of log likelihood with respect toθandψ, and set them zero.

∂l

∂θ = 0 ⇔ v

θ −n_r−v

1−θ −ψ n−n_r

1−ψ(1 +θ) = 0

⇔ θˆ= v 2nr−v

∂l

∂ψ = 0 ⇔ nr

ψ −(1 +θ) n−nr

1−ψ(1 +θ) = 0

⇔ ψˆ=









 n_r

n(1 + ˆθ), if n_r

n(1 + ˆθ) ≤ ¹₂ 12, otherwise

We plug the MLE,θˆand ψ, back into the conditional probability,ˆ P_θ,_ˆ_ψ_ˆ(Y_i|R_i = 0) = θ(1ˆ −2ψ)

1−ψ(1 + ˆˆ θ)

Then, imputed valuesy_i^∗can be drawn randomly from this conditional prob-

(24)

ability. Otherwise,

y_i^∗=E_θ,_ˆ_ψ_ˆ(Y_i|R_i= 0)

=P_θ,_ˆ_ψ_ˆ(Y_i|R_i = 0) This form can be directly used as imputed values.

2.3 Bayesian imputation

Bayesian methods draw a parameter vector(θ^∗) from the posterior distribution of θ that is commonly defined in any other Bayesian context. In general, the posterior distribution can be defined as

p(θ|Y) = p(Y|θ)p(θ)

p(Y) (2.3)

Then, we draw imputed values with data distribution using the drawn parameter vector(θ^∗).

p(Y|θ^∗)

Bayesian Bootstrap Imputation - discrete data

Rubin et al. (1981) and Rubin and Schenker(1986) assumed that(d₁, ..., d_K), the possible values of Y, have probabilities (θ1, ..., θK) respectively. Here, this method draws a parameter vector(θ^∗) with posterior densityQ_K

k=1θ_k^q^k⁻¹, where qk = P

i∈srI(Yi=dk), when the prior of the parameter(θ) is set to be proportional to Q_K

k=1θ_k⁻¹. The first part of numerator of (2.3) is the probability mass function of multinomial distirbution,

p(Y|θ)∝ Y

i∈sr

P r(Y_i =y_i)

=

K

Y

k=1

θ^q_k^k

(2.4)

(25)

We plug the density of prior distribution into the second part of (2.3) to obtain the posterior distribution mentioned above.

p(θ)∝

K

Y

k=1

θ_k⁻¹ (2.5)

Having θ^∗ drawn with the posterior distribution,QK

k=1θ^q_k^k⁻¹, imputed values are then generated withp(Y|θ^∗).

Fully Normal Imputation - continuous data

Rubin and Schenker(1986) also discussed imputation for a continuous case that heavily relies on normality. Assuming that Y_i ∼iid N(µ, σ²) and f(θ)∝ 1

σ², whereθ = (µ, σ²), then the posterior distribution can be readily derived. We simply plug the probability density ofY|µ, σ² andθ into (2.3).

We replaceµ with the empirical sample meany¯=P

i∈sryi. f(σ²|Y,y) =¯ f(Y|σ²,y)f¯ (σ²|¯y)

f(Y|¯y)

∝(σ²)⁻ⁿ^r^/2exp n−1

2σ² X

i∈sr

(yi−y)¯ ² o

(σ²)⁻¹

= (σ²)⁻⁽ⁿ^r^/2+1)expn−1 2σ²

X

i∈sr

(y_i−y)¯ ²o

This posterior distribution ofσ² becomes Inverse-Gamma withα= nr

2 and β= ¹₂P

i∈s_r(y_i−y)¯², which can be rewritten asσ²|Y ∼ (nr−1)s²₁

χ²_n_r₋₁ , where s² = 1

nr−1 P

i∈sr(y_i−y)¯². The one value ofσ²_∗ is drawn with the posterior density.

By using σ_∗², we find another posterior distribution of µ this time. If Y¯ ∼N(µ,σ_∗²

nr) is already known, the posterior density ofµ can be obtained

(26)

in the same way right above.

f(µ|Y¯) = f( ¯Y|µ)f(µ) f( ¯Y)

∝expn −1

2σ²_∗/n_r(µ−Y¯)²o This also infers that the posterior density of µ is N( ¯Y ,σ²_∗

nr). Likewise, we draw one value forµas before. Those two selected parameters construct the conditional distribution, f(Y|µ^∗, σ_∗²), which imputed values are generated with.

Imputation Adjusted for Uncertainty in the Mean and Variance - continuous data

When the normality is uncertain in data, the idea of hot-deck can be employed here, taking the sample distribution of observed values into non- respondents. The parameters,µ^∗andσ²_∗can be drawn, as described in "fully normal imputation". Rubin and Schenker(1986) randomly drawsX ={y_i: i = 1, ..., n−n_r} from donors {Y_j: j = 1, ..., n_r} with replacement. In design-based approach

E(yi) = ¯Y = 1 nr

X

i∈sr

Yi

V ar(y_i) = (n_r−1)s²_r n_r = 1

n_r X

i∈sr

(Y_i−Y¯)²

If a new variable Zi is introduced such that Zi = y_i−Y¯ p(nr−1)s²_r/nr

, then it has zero expected value and one variance. Then, we calculate back imputed values that satisfyYi =µ^∗+σ∗Zi.

(27)

2.4 Missing mechanism and imputation

The type of missing mechanism is worthy to know prior to implementing imputation, if possible. At least, we can decide between an imputation method for "ignorable non-response"(MCAR and MAR) and "non- ignorable non-response"(MNAR). However, once the decision is made, missing mechanism no longer influences the calculation of imputed values unless non-ignorable non-response is the case. For imputations to ignorable non-response, the probability of non-response, P_ψ(Ri = 0|y_i, xi) does not contribute to finding imputed values.

For non-ignorable non-response, P_ψ(Ri = 0|y_i, xi), however, has a portion in deriving the conditional distribution, where random number genera- tion is based on:

Z

f_θ(y_i|x_i)P_ψ(R_i = 0|y_i, x_i)dy_i

Simply, the denominator term is a normalizing constant. The second term of the numerator can be determined by missing mechanism. To finalize the conditional distribution, the distributional assumption ofRi as well as dependency among R_i,y_i andx_i has to be determined.

2.5 Advantage and disadvantage of methods

Residual regression imputation is encouraging in a way that imputed values can come closer to real data points. Here, each imputed value includes the predictive mean E(Yi) as well as the corresponding residual e^∗_i. Given that the regression is adequately fitted, the imputed valuesy_i^∗= ˆE(Y_i) +e^∗_i will resemble the true values of non-respondents. However, the assumptions of regression are not easily met in practice. Particularly, in case that the true

(28)

distribution ofY_i is skewed, this method cannot precisely reflect skewness in the imputed values. Additionally, hot-deck over the entire observed residuals might be inefficient.

Nearest neighbor and stratified hot-deck imputation cover these draw- backs somehow. Their main idea is to pursue hot-deck within the pool of similar profiles to each non-respondent. First, hot-deck can roughly project the sample distribution of y_j to imputed values y_i^∗. Secondly, they rea- sonably modified the pool of donors through predictive mean matching or stratification. However, as expected, the precision of predictive mean or stratification still remains questionable.

Imputation in MNAR directly estimates the assumed true distribution of non-respondent Yi,fθ,ψ(yi|R_i = 0, xi). Since non-respondent is of interest, imputed values have to be drawn with the distribution of non-response, not those of observed(donors). Of course, estimated parameters,θˆandψˆare still based on observed values. Distributional assumptions, such as the specific form offθ(yi|x_i)andPψ(Ri|y_i, xi)are necessary. Examining whether MNAR actually underlies in data appears very challenging, almost impossible.

A common drawback across all introduced methods is that there is no consideration for uncertainty of imputed values. As long as imputed values were obtained through the random draw process, we should consider their uncertainty (variation). Because, the random draw results in a very different set of imputed values, even if we run the same imputation method. It is impossible to measure this variation unless we generate the same imputation many times. Hence, the absence of this variation term leads to the underestimated variance of our estimates in further analysis with complete data.

(29)

2.6 Multiple imputation

Multiple imputation (MI), developed by Rubin (1987), properly handles variance underestimation of single imputation (SI). As discussed earlier, the underestimated variance of SI originally comes from ignorance of imputation uncertainty. The problem is that the uncertainty (variation) between SI’s is not measurable, because SI basically runs a single time. Rubin (1987) suggested to run SImtimes, and this suggestion made it possible to measure the variation between imputations. This is called multiple imputation (MI).

Assume that

y={y_i:i∈sr, yj :j ∈s−sr}, (2.6) wheresandsr denote a full sample and observed sample respectively. After conducting MI, we derive mdifferent completed samples,

y_(k)^∗ ={y_i :i∈sr, y_j(k)^∗ :j∈s−sr}, k= 1, ..., m (2.7) Suppose that our aim is to estimate a population quantityθwith a given θ. Here, we findˆ θˆwith the multiply-imputed data instead of the original one with non-respondents. Apparently,mdifferent estimatesθˆ^∗₁, ..., θˆ^∗_mcan be computed according to each of the m sets of imputed data, y_(k)^∗ from (2.7). Thereafter, the grand mean and variance estimate of the grand mean can be defined as following.

θ¯^∗ = 1 m

m

X

i=1

θˆ^∗_i (2.8)

W = ¯V^∗+ (1 + 1

m)B^∗, (2.9)

whereW = ˆV(¯θ^∗),V¯^∗ = _m¹ Pm

i=1Vˆ_i^∗ and B^∗ = _m−1¹ Pm

i=1(ˆθ^∗_i −θ¯^∗)². Rubin (1987) also stated that a student-t approximation can be used for the dis-

(30)

tribution ofθˆ^∗,

W^1/2(¯θ^∗−θ)∼tv, (2.10) where v= (m−1)

h V¯^∗ (1 + 1/m)B^∗

i2

For the full validity of the (2.8), (2.9) and (2.10), Rubin (1987) required a "proper" imputation method to be processed. According to Rubin (1996), several conditions for being a "proper" imputation were briefly mentioned.

First, the MI estimators withm→ ∞, θ¯^∗ and V¯^∗ should be approximately unbiased for θ(y)ˆ and Vˆ(y), where they are estimators based on the full sampley in (2.6):

E(¯θ^∗|Y_obs,y)≈θ(y)ˆ and

E( ¯V^∗|Y_obs,y)≈Vˆ(y)

Likewise,B^∗ has to be approximately unbiased for the true variance of the MI estimatorθ¯^∗, when m→ ∞:

E(B^∗|Y_obs,y)≈V ar(¯θ^∗|Y_obs)

Theθ¯^∗ should, moreover, follow approximate normality with expectationθˆ^∗ and variance B^∗:

θ¯^∗|Y_obs,y∼N(ˆθ(y), B^∗).

Yet, Bjørnstad (2007) pointed out that many imputation methods used by national statistics bureaus barely meet these requirements. If it is the case, the latter term in (2.9) that mainly accounts for the between-variance of m different θˆ^∗ is believed to be under-measured. Bjørnstad (2007) compensated the under-measured term by finding the right k in the following fashion.

1

(31)

wherek= 1/(1−f)andf is non-response rate. However, Bjørnstad (2007)’s argument was built upon specific assumptions such as missing mechanism and the type of imputation method.

We still need to see whether thek= 1/(1−f)uniformly holds in diverse environments that we may come across in practice. Simulation studies can contribute to examining the validity ofk. Since the true values ofW,V¯^∗and B^∗ in (2.11) are all known in a setting of simulation, k can be accordingly calculated back.

k= W −V¯^∗ B^∗ − 1

m

One of the advantages being expected from simulation study is that we can observek values under various conditions. For example, we generate many cases by changing sample size (=n), non-response rate (= f), imputation method, the number of iteration in multiple imputation (=m) and so forth.

This operation will be actualized in the upcoming chapters of simulation study.

2.7 R-packages for multiple imputation

2.7.1 MICE

The package called "multivariate imputation by chained equations (mice)"

is widely used in R. In this "mice" package, the "mice" function performs multiple imputation (MI). The package and functions can be easily down- loaded and active in R, by using the following R commands.

> install.packages("mice")

> require(mice)

Including the "mice" function, a series of R commands is needed to finalize statistical analysis in presence of non-respondents. Depending on the

(32)

type of parameters of interest, we can take either of two different tracks. The first track aims to estimate a population total or mean, as census bureaus have done traditionally. The other track is used to find coefficient estimates from the modelling process. The figure below depicts how they flow.

Figure 2.1: Two track process

1. Track 1: Estimation of population total/mean.

2. Track 2: Estimation of coefficients from a model.

The specific details of the "mice" function will be given first. Then, follow-up procedures will be discussed for each of those two tracks. For clear understanding, some exemplary R codes will be enclosed as well.

1. mice()

This function pursues MI and returns complete data. The options in Table 2.1 mainly specify the number of imputations as well as the type of methodology to be used. The following table details the options, based on Buuren and Groothuis-Oudshoorn (2011).

(33)

Option Description

data A dataframe or matrix that contains imputation

target as well as auxiliary variables.

m The number of imputations to be pursed.

method "pmm" Predictive mean matching.(numeric)

"sample" Hot-deck.(any)

"norm.nob" Non-Bayesian linear regression.(numeric)

"norm" Bayesian linear regression.(numeric)

"mean" Grand mean imputation.(numeric)

"logreg" Logistic regression.(factor, 2 levels)

"polyreg" Multinomial regression.(factor, >2 levels)

"lda" Linear discriminant analysis.(factor)

predictorMatrix A p×p square matrix for defining which variables will be used as preditors in modelling, where p is the number of columns.

e.g. When variablesX₁, ..., X₃ are present in data, a 3×3 matrix can be derived, such as





0 1 0 1 0 1 1 1 0





Each row corresponds to a target variable, while each column does variables used in a model. And the element 1 means the column variable is used as a predictor.

With the given matrix above,X1∼X2,X2∼X1+X3

and X₃ ∼X₁+X₂ hold.

Table 2.1: Options of the mice function

2. Exemplary study, Track 1: Estimation of population mean

Assuming that our final aim is to estimate population mean, we need to develop several extra coding, as well as the "mice" function. It is indeed worthy to clarify how other functions of the "mice" package can be used in estimation process. An artificial sample data with non-respondent is created first, where Y_i and X_i are our target and auxiliary variable.

> n=2000 # The number of observations

(34)

> f=0.4 # Non-response rate

> y.i=runif(n,0,5)

> x.i=runif(n,0,5)

> sample.data=data.frame(y.i,x.i)

> missing=sample(1:n,f*n,replace=F) #Select where to miss out.

> sample.data[missing,1]=NA

Then, imputations are processed using the "mice" function. We specify the number of imputations,m= 3, and use predictive mean matching.

> require(mice)

> imputed.sample.data=mice(data=sample.data, m=3, method="pmm") It is good to check if the imputations are well processed by looking at

the small part of sample data and completed data.

> head(sample.data)

> head(data.frame(

m1=complete(imputed.sample.data,1)[,1]

,m2=complete(imputed.sample.data,2)[,1]

,m3=complete(imputed.sample.data,3)[,1]))

Figure 2.2: Checking imputed values

(35)

To obtain Rubin (1987)’s equation (2.8) and (2.9) in the Multiple Imputation section, "for" iteration and "pool.scalar" function can be used.

> N = 100000 ### The number of population observation

> m = imputed.sample.data$m

> theta_star_hat=c()

> V_star_hat=c()

> for (i in 1:m) {

theta_star_hat[i] = mean(

complete(imputed.sample.data, i)$y.i) V_star_hat[i] = (1-n/N)*

(var(complete(imputed.sample.data, i)$y.i) / n)

### (variance of estimate) }

> pool.scalar(Q=theta_star_hat, U=V_star_hat ,method = "rubin")$qbar

[1] 2.52628

> pool.scalar(Q=theta_star_hat, U=V_star_hat ,method = "rubin")$t

[1] 0.0015755

Note that the value of "qbar" and "t" correspond toθ¯^∗ and W in the equation (2.8) and (2.9).

3. Exemplary study, Track 2: Estimation of coefficient in modelling In practice, modelling also becomes our aim of analysis. Here, we assume that the main interest is to estimate coefficient of an ordinary regression under MI. For simplicity, the regression is to projectY_i on Xi. The code for creating artificial sample data and implementing

(36)

imputations is identical to that of Track 1. But, "with" and "pool"

function are used in estimating.

> fit = with(data=imputed.sample.data,exp=glm(y.i~x.i))

> pool(fit)$qbar

(Intercept) x.i

2.50004980 0.01040729

> pool(fit)$t

(Intercept) x.i

(Intercept) 0.02439523 -0.010614243 x.i -0.01061424 0.004830097 2.7.2 Amelia

According to Honaker, King, Blackwell, et al. (2011), "Amelia" was developed under the assumptions of multivariate normality and missing at random (MAR). As an example, let us assume that we have data matrix as follows,

Y =





 y₁ y₂ y₃ y₄ y₅ y₆







=







y11 y12 y13 y14 y15

y₂₁ y₂₂ y₂₃ y₂₄ y₂₅ y31 y32 y33 y34 y35

y₄₁ y₄₂ y₄₃ y₄₄ y₄₅ y51 y52 y53 y54 y55

y₆₁ y₆₂ y₆₃ y₆₄ y₆₅







M issing

=⇒







y11 y12 y13 − y15

y₂₁ − − y₂₄ y₂₅ y31 y32 y33 y34 −

− y₄₂ y₄₃ y₄₄ y₄₅ y51 − − − y55

− y₆₂ y₆₃ y₆₄ −







The ith row vector, Yi = y_i ∼ N5(µ,Σ), i = 1,2, ...,6. Also, we can split Y_i into two components such as Y_i = (Y_i,obs,Y_i,mis). For instance, Y2 = (Y_2,obs = (y21, y24, y25),Y2,mis = (y22, y23)). The main idea of this

(37)

package is to draw Y_i,mis with a conditional probability, P(Y_i,mis|Y_i,obs; ˆµ,Σ),ˆ

Derivation of the maximum likelihood estimates µ,ˆ Σˆ can be achieved through estimation maximization(EM) algorithm. It is widely known that EM algorithm performs well in estimating parameters with the presence of unobserved data. For the E-step, the conditional expectation ofY_mis given Yobs,E(Ymis|Y_obs; ˆµ_old,Σˆold), replaces the missing spots. The initial values of µˆ_old and Σˆ_old are arbitrarily chosen to find the conditional expectation.

The M-step is to find the updated parameter estimates, µˆ_new and Σˆ_new, which maximizes the likelihood of Y_obs and E(Ymis|Y_obs; ˆµ_old,Σˆ_old). Then, we go back to the E-step with the newly obtained parameter estimates and find another updated estimates and repeat this iteration over again until the parameter estimates converge.

More importantly, Amelia generatesmbootstrapped sub-samples of size nbefore implementing EM algorithm and imputation procedure mentioned above. That is why the entire algorithm built in Amelia is called expectation maximization with bootstrapping (EMB). Themdifferent sub-samples results in mdifferent imputed values for one missing unit. We can combine them sets of imputed data as we did in multiple imputation. Honaker and King (2010) provided further theoretical details behind Amelia package.

> ## Import "Amelia"

> install.packages("Amelia")

> require(Amelia)

> ## Import MASS to generate sample data

> install.packages("MASS")

> require(MASS)

(38)

> sigma=matrix(3,4,4)

> diag(sigma)=c(10,8,8,5)

> raw=mvrnorm(200, rep(0,4),sigma)

> ## Make NA’s

> for(i in 1:dim(raw)[2]){

+ miss_out=sample(1:200,40) + raw[miss_out,i]=NA

+ }

> ## Run Amelia

> output_raw=amelia(raw, m=5)

> ## Call imputed datasets

> output_raw$imputations$imp1

> output_raw$imputations$imp5 2.7.3 Hmisc

The functions, impute()and aregimpute(), from "Hmisc" package developed by Alzola and Harrell (2006) can be used for multiple imputation.

Theimpute()simply imputes mean, median, or max into missing, or im- plements hot-deck. Let us use the old sample data with non-respondent from the MICE section.

> ## Generate sample data

> n=2000

> f=0.4 # Non-response rate

(39)

> x.i=runif(n,0,5)

> sample.data=data.frame(y.i,x.i)

> ## Make NA’s

> missing=sample(1:n,f*n,replace=F)

> sample.data[missing,1]=NA

> ## Process imputation

> imputed.sample.data=impute(sample.data$y.i,fun = mean)

> imputed.sample.data=impute(sample.data$y.i,fun = median)

> imputed.sample.data=impute(sample.data$y.i,fun = "random")

> ## Result

> head(sample.data$y.i)

[1] NA NA 2.506376 NA 4.378280 3.932805

> head(imputed.sample.data)

1 2 3 4 5 6

2.553923* 2.553923* 2.506376 2.553923* 4.378280 3.932805 The checkpoint here is that an user should specify a single target variable to be imputed and a method to be used in the impute() function.

Then, it returns entire data including imputed values with an asterisk mark.

If multiple imputation is the case, one should build his/her own iteration coding manually.

The function aregimpute()based on bootstrapping, additive regression and predictive mean matching is also widely used in practice.

1. Bootstrapping : Among the non-missing part of a target variable being imputed, m different sub-samples are randomly drawn with replacement from the entire data.

2. Additive regression: In each sub-sample, we find one optimal additive

(40)

model with the largestR² than any other models. Then, each opitmal model calculates the predicted values of the target variable.

3. Preditive mean matching: Each missing spot is now filled with an observation whose predicted value is closest to the predicted value of the each missing spot.

> ## Get sample data with missing ready

> esoph_NA=esoph

> esoph_NA=data.frame(esoph_NA)

> for(i in 1:ncol(esoph)){

+ missing_spot=sample(1:nrow(esoph),round(nrow(esoph)*0.2)) + esoph_NA[missing_spot,i]=NA

+ }

> ## Run multiple imputation through aregImpute

> set.seed(500)

> (result=aregImpute(~agegp + alcgp + tobgp + ncases + ncontrols

+ ,n.impute=5

+ ,type = "pmm"

+ ,data=esoph_NA

+ ))

If we call the object "result", R displays summary information about multiple imputation processed by aregimpute()(Listing 2.1). The information includes, for instance, the list of target variables being imputed, the number of impuations(m), the number of knots(nk) used in additive models, theR² of each model and so on.

result$imputed

(41)

The upper command returns a data matrix of size n_mis ×m for each target variable being imputed. Also, we can extract the imputed values of a certain target variable by using the second command. Listing 2.2 shows us how the resulting matrix looks like. Since this package returns imputed values only, as mentioned earlier, some additional coding for replacing NA’s with imputed values will be necessary.

Listing 2.1: R output

M u l t i p l e I m p u t a t i o n u s i n g B o o t s t r a p and PMM

a r e g I m p u t e ( f o r m u l a = ~agegp + a l c g p + tobgp + n c a s e s

+ n c o n t r o l s , d a t a = esoph_NA , n . impute = 5 , t y p e = "pmm" )

n : 88 p : 5 I m p u t a t i o n s : 5 nk : 3

Number o f NAs :

agegp a l c g p tobgp n c a s e s n c o n t r o l s

18 18 18 18 18

t y p e d . f .

agegp c 5

a l c g p c 3

tobgp c 3

n c a s e s s 2

n c o n t r o l s s 1

T r a n s f o r m a t i o n o f T a r g e t V a r i a b l e s Forced t o be L i n e a r

R−s q u a r e s f o r P r e d i c t i n g Non−M i s s i n g V a l u e s f o r Each V a r i a b l e Using L a s t I m p u t a t i o n s o f P r e d i c t o r s

(42)

agegp a l c g p tobgp n c a s e s n c o n t r o l s 0 . 7 6 2 0 . 7 3 4 0 . 7 6 9 0 . 6 5 8 0 . 7 5 8

Listing 2.2: R output

$agegp

[ , 1 ] [ , 2 ] [ , 3 ] [ , 4 ] [ , 5 ]

19 2 4 6 3 2

25 6 4 1 6 1

28 5 2 1 2 1

30 3 2 4 4 5

32 3 3 6 2 6

37 3 6 1 1 2

41 1 2 6 6 2

$ a l c g p

[ , 1 ] [ , 2 ] [ , 3 ] [ , 4 ] [ , 5 ]

5 2 1 2 1 1

7 3 2 2 4 3

8 1 1 2 1 1

9 1 4 2 2 2

19 2 1 2 1 1

20 1 1 1 1 1

33 3 4 4 1 1

2.7.4 VIM

The "VIM" package also includes several well-developed functions for handling imputation procedure. Not merely hotdeck() and kNN(), which implement hot-deck and k-nearest neighbor imputation respectively, but the

(43)

Kowarik and Templ (2016) detailed the actual applications of the functions mentioned above. According to Templ, Kowarik, and Filzmoser (2011), the irmi() heavily relies on modelling methods that fill in missing cells with predicted values plus noise. Interestingly, it is capable of handling outliers through robust methods as well as semi-continuous variables on top of other ordinary data type.

Templ et al. (2011) also detailed several steps programmed in irmi():

Step 1. The irmi() function simply pursues either mean-imputation or k-nearest neighbor imputation for all the missing cells at the very beginning.

Step 2-1. One variable, called response, is regressed on the rest of variables, called predictors, from the entire data set. But, the variable with the most missing cells should be modelled first. Then, the one with the next most missing cells comes and so on.

Step 2-2. Model development is completely dependent upon observations whose response variable being regressed is non-missing. Cer- tainly, we may have observations whose response is non-missing but predictors are missing. Then, we use the initially imputed values in the step 1 instead in modelling.

Step 2-3. The irmi() automatically detects the data type of each variable being imputed and selects a correct modelling method accordingly.

Please refer to the following table for further details:

Step 3. The derived model updates the mean or kNN imputed cells with the sum of predicted values and noise. Then, it repeats the Step2- 1 to Step3 until the imputed values converge.

Since the actual usage of this function is quite straight forward, the brief introduction of available options in irmi() will be given here.

(44)

The type of data Modelling method

Continuous Robust regression with the identity link Categorical Generalized linear model(robust is optional)

Binary Logistic regression(robust is optional)

Semi-continuous 1. Logistic regression for imputing a constant(mostly zero) 2. For continuous, robust regression

Count Generalized linear model with Poisson and the log link Table 2.2: Modelling method used for each data type.

result=irmi(mydata,mixed=c("x1","x2")

,modelFormulas=list(x1=c("x2","x3"), x2=c("x1"),

x3=c("x1","x2")) ,trace=TRUE)

)

Even though irmi() find the right model for each variable type automatically, the "mix" option can still force irmi() to recognize the stated variables as semi-continuous. "ModelFormulas" is an option for specifying which re- gressors to be used in modelling, and "trace" shows us detailed information for each imputation.

Simulation Study for Non-Bayesian Multiple Imputation.