• No results found

Simulation Study for Non-Bayesian Multiple Imputation.

N/A
N/A
Protected

Academic year: 2022

Share "Simulation Study for Non-Bayesian Multiple Imputation."

Copied!
90
0
0

Laster.... (Se fulltekst nå)

Fulltekst

(1)

Simulation Study for

Non-Bayesian Multiple Imputation

Yunsung Lee

Master’s Thesis, Spring 2017

(2)

This master’s thesis is submitted under the master’s programme Modelling and Data Analysis, with programme optionStatistics and Data Analysis, at the Department of Mathematics, University of Oslo. The scope of the thesis is 60 credits.

The front page depicts a section of the root system of the exceptional Lie group E8, projected into the plane. Lie groups were invented by the Norwegian mathematician Sophus Lie (1842–1899) to express symmetries in differential equations and today they play a central role in various parts of mathematics.

(3)

Abstract

Rubin (1987)’s combination formula for variance estimation in multiple imputation (MI) requires a imputation method to be Bayesian-proper. How- ever, many census bureau have heavily relied on non-Bayesian imputations.

Bjørnstad (2007) suggested an inflated factor (k1) in Rubin (1987)’s combi- nation formula for non-Bayesian imputations. This paper aimed to verify the theoretical derivation of Bjørnstad (2007) in computer simulation. Within Bjørnstad (2007)’s pre-assumed environment, the inflated factor,k1, closely approached the simulated true value, E(k), irrespective of sample size and missing rate. With California schools’ data, confidence intervals using k1 also achieved a desired coverage, (1−α)%, across varying sample size and missing rate, except in case of MNAR because of biased imputation.

Key words: Multiple imputation; non-Bayesian method; variance esti- mation; inflated factor; simulation.

(4)

Acknowledgements

I would like to express my appreciation to professor Jan F. Bjørnstad for his endless support. It was such a great honor to be his last master student.

I would also like to thank to my family for their supports through my whole life.

Last, but most importantly, I could never have made it without God’s guidance.

Phillipians 4:6

Do not be anxious about anything, but in every situation, by prayer and petition, with thanksgiving, present your requests to God. 7 And the peace of God, which transcends all understand- ing, will guard your hearts and your minds in Christ Jesus.

(5)

Contents

1 Sample surveys with non-response 7

1.1 Types of missingness . . . 8

1.2 Design-based approach . . . 10

1.3 Model-based approach . . . 10

1.4 Missing data mechanism . . . 12

2 Imputation methods and multiple imputation 16 2.1 Imputation in MCAR/MAR model . . . 17

2.2 Imputations in MNAR models . . . 19

2.3 Bayesian imputation . . . 22

2.4 Missing mechanism and imputation . . . 25

2.5 Advantage and disadvantage of methods . . . 25

2.6 Multiple imputation . . . 27

2.7 R-packages for multiple imputation . . . 29

2.7.1 MICE . . . 29

2.7.2 Amelia . . . 34

2.7.3 Hmisc . . . 36

2.7.4 VIM . . . 40

3 The History of Multiple Imputation 43 3.1 Theoretical development . . . 43

3.2 Early application: projects and software . . . 44

(6)

4 Simulation studies using R 45

4.1 Estimating correctk . . . 45

4.1.1 Population Average with Hot-deck Imputation . . . . 48

4.1.2 The Regression Coefficient in the Ratio Model with Residual Imputation . . . 49

4.1.3 The Regression Coefficient in Simple Regression with Residual Imputation . . . 51

4.2 A survey of California schools . . . 53

4.2.1 Investigation of data . . . 53

4.2.2 Simulation . . . 54

5 Conclusion and discussion 60 5.1 Conclusion . . . 60

5.2 Discussion . . . 61

Appendices 62 A R code 63 A.1 Population average . . . 63

A.2 Ratio model . . . 66

A.3 Simple regression model . . . 68

A.4 California schools . . . 71

B The convergence plot of E(k) 76

References 86

(7)

List of Figures

1.1 Process of statistical inference in census bureau . . . 8

1.2 Difference between unit or item non-response . . . 9

1.3 Scatter plots for each case of missing mechanism . . . 14

2.1 Two track process . . . 30

2.2 Checking imputed values . . . 32

4.1 Structure of the simulation . . . 46

4.2 Characteristics of sample data . . . 53

4.3 Confidence intervals for hot-deck under MAR . . . 58

4.4 Confidence intervals for hot-deck under MNAR . . . 58

4.5 Confidence intervals for residual imputation under MNAR . . 59

B.1 Convergence of E(k) for population average under MCAR . . 77

B.2 Convergence of E(k) for population average under MAR . . . 78

B.3 Convergence of E(k) for population average under MNAR . . 79

B.4 Convergence of E(k) for a ratio model under MCAR . . . 80

B.5 Convergence of E(k) for a ratio model under MAR . . . 81

B.6 Convergence of E(k) for a ratio model under MNAR . . . 82

B.7 Convergence of E(k) for simple regression under MCAR . . . 83

B.8 Convergence of E(k) for simple regression under MAR . . . . 84

B.9 Convergence of E(k) for simple regression under MNAR . . . 85

(8)

List of Tables

1.1 Artificial sample data for understanding missing mechanism . 14

2.1 Options of the mice function . . . 31

2.2 Modelling method used for each data type. . . 42

4.1 Fixed and changeable factors . . . 48

4.2 E(k) for population average . . . 49

4.3 E(k) for ratio model . . . 50

4.4 E(k) for simple regression. . . 52

4.5 Fixed and changeable factors. . . 56

4.6 Coverage for California schools. . . 57

4.7 E( ¯Ys) for California schools . . . 57

(9)

Chapter 1

Sample surveys with non-response

Historically, many census bureaus (CB) have collected sample data through a format of survey in pursuit of estimating population parameters, such as a total sum, mean or standard deviation. In this process of data collection, non-response has always been one of the biggest challenges. Non-response indeed indicates unexpected information loss. Badly, this loss of informa- tion can lead us biased parameter estimation as well as large uncertainty in estimates. Accordingly, many remedies for missingness have been devel- oped to achieve valid statistical inference. As widely known, imputation and weighting are typically used in practice.

The main process from sampling to inference can be depicted as shown in Figure 1.1. CB draws some samples based on a certain sampling method, such as simple random sampling, stratified random sample or etc. When non-respondents are detected in sample, the type of missingness has to be found to determine the most adequate remedy (e.g. imputation or weight- ing). More details will be discussed in the next sub-topic, "Types of miss- ingness".

If imputation is the case, we need to go through some preliminary work in

(10)

Figure 1.1: Process of statistical inference in census bureau

order of successful imputation. The first work is called missing mechanism.

For some imputation methods, we need to assume how non-response possibly depends on other variables. In reality, people are more or less likely to respond certain questions relying on their personal profile. Secondly, we need an assumption for a variable of interest. Many imputation methods require the stochastic nature of a variable of interest, and model-based approach plays its role in assuming this nature. The later two sub-topics will also detail the assumption and missing mechanism.

1.1 Types of missingness

Non-response can be recognized in two different types, unit or item non- response. Unit non-response is failure of obtaining any information from participants for some reasons. Briefly speaking, people could/did not answer any of questionnaire. Data is usually stored in a form of rectangle or matrix, and then we usually recognize each row as a single unit/individual, while columns refer to the questioning items of survey. Here, we have empty rows for those who were expected to answer but did not actually.

(11)

Figure 1.2: Difference between unit or item non-response

questionnaire were not answered. Again, for some reasons, people are more or less likely to answer some certain types of questioning items. As an example, a questioning item asking people’s gross annual income has a lower response rate. Because, people may feel uncomfortable when a survey asks for their private information. In this case, we keep rows for those who partially answered a questionnaire, but still have some NA’s in columns (questioning items) that they did not want to respond.

There are two typical remedies, one for each case. For unit non-response, weighting method can be applied. Specifically, an under-represented group can be compensated by giving more weight on it, or vice versa. For item non-response, a method of imputation is widely used in modern statistics. A basic idea is to impute estimated values into missing spots so that completed data improves representativeness of sample. If the imputed sample turns out to be a well-representative miniature of the population, inference will gain more confidence. The estimates are more likely to get close to corresponding population parameters.

(12)

1.2 Design-based approach

In survey sample, there are two primary approaches for statistical in- ference of population parameters(e.g. population total, PN

i=1yi, or mean,

1 N

PN

i=1yi). "Design-based" approach is a traditional method based on prob- abilistic unit selection. Define an indicator variable such that for all units, i= 1,2, ..., N,

Ii =





1, ifi∈sample s 0, otherwise

Then, the inclusion probability of jth unit can be written as P(j ∈ s) = P(Ij = 1) =πj. Here, strictly speaking, the indicator variable of each unit becomes probabilistic, and a variable of interest, yi, is regarded fixed. For convenience, some literature simply describes a samplesas stochastic, since a sample constituted by probabilistic units is also probabilistic. As widely known, simple random sampling, stratified random sampling, systematic random sampling or etc are typical examples.

1.3 Model-based approach

On the other hand, "model-based" approach can be introduced when Yi is assumed to be probabilistic and sample s is fixed. Särndal et al.

(1978) argued that a variable of interest,Yi, appears to be a random variable with a certain distribution (e.g. Gaussian, Binomial, Poisson, Weibull or so forth), and yi is the realized value of the corresponding random variable, Yi. If necessary,Yican be statistical-modeled conditioned on other observed variables. For better understanding, the following will describe some of

(13)

1. Ratio model

Assume that an auxiliary variable, X is always available over all obser- vations.

Yi=βxi+i where E(i) = 0,V ar(i) =σ2xi and Cov(i, j) = 0 βˆopt =

P

i∈sYi

P

i∈sxi

pred=P

i∈sYi+P

i /∈si = ˆβoptP

i∈sxi+ ˆβoptP

i /∈sxi = ˆβopttx

V ar( ˆTpred−T) =N2 1−fn x¯x¯rx¯

s σ2 2. Simple linear regression

Yi12xi+i whereE(i) = 0,V ar(i) =σ2 and Cov(i, j) = 0 βˆopt =

P

i∈sYi/xi

n

pred=P

i∈sYi+P

i /∈si =P

i∈sYi+ (n1P

i∈s Yi

xi)P

i /∈sxi

V ar( ˆTpred−T) =σ2((

P

i /∈sxi)2

n +P

i /∈sxi2) 3. Common mean model

Yi=β+i whereE(i) = 0,V ar(i) =σ2 and Cov(i, j) = 0 βˆopt = n1P

i∈sYi = ¯Yspred=P

i∈sYi+P

i /∈ss=NY¯s V ar( ˆTpred−T) =N2(1−f)σn2

Many imputation methods are heavily dependent on this model-based approach. The reason is that imputation toward non-respondent is basically a form of prediction. We have partial lack of information about a variable of interest, and imputation essentially fills it out with scientifically reason- able guess. And, the distributional assumption on Yi from a model-based approach makes possible to obtain plausible prediction for missing values.

For instance, imputations based on regression or random selection (e.g. hot- deck) all start with the distributional assumption.

(14)

1.4 Missing data mechanism

Rubin (1976) first introduced missing data mechanism that structured associations between missing and other variables. The links are recognized in three different ways, based on which variables affect missing occurrence.

When none of the variables has any portion in missing occurrence, we call it "missing completely at random"(MCAR). Here, a variable of interest Yi

and auxiliary variableXi are fixed. AndRi is a stochastic indicator variable for missingness such that

Ri =





1, ifi∈sr, 0, otherwise

wheresr denotes an observed sample. Thus,Ri has a binomial distribution with response rate. In mathematical notations, the response rate under MCAR can be expressed as

pi =P(Ri = 1|Yi, Xi) =P(Ri = 1)

However, if the link between an auxiliary variable Xi and unobserved Yi is found, we assume "missing at random"(MAR) mechanism. In reality, MAR has been detected quite frequently. The lower response rate of lower socio-economic class to a political poll, for instance, is the typical case of MAR. Likewise, the response rate of MAR mechanism can be written as

pi =P(Ri= 1|Yi, Xi) =P(Ri= 1|Xi)

The third mechanism is called "missing not at random"(MNAR), when evenYi, as well asXi, influences missing. In addition toXi, evenYipossibly affects its own missing occurrence. In the case thatY is missing, both X

(15)

Then, theri of MNAR can be simply written as pi =P(Ri= 1|Yi, Xi)

As a matter of fact, MNAR is widely known as the most tricky case to handle. J. L. Schafer and Graham (2002) also argued that it is almost impossible to detect MNAR in real world data, and one even needs the specified distribution of missing.

To visualize the pattern of mechanism introduced above, artificial sam- ple data was generated. Two variables, Yi and Xi for i=1, ..., 500, were present in the sample. And some of Yi were missed out on purpose, based on the three mechanism respectively. In order for the intentional missing- out process, response rates,pM CAR,i,pM AR,i and pM N AR,i for all i, were in required. First, pM CAR,i was simply set to be a fixed constant. In order for a clear association between missing and variables, the framework of logistic regression was used to define pM AR,i and pM N AR,i, such as

pM CAR,i=c, pM AR,i = eβ1Xi

1 +eβ1Xi, pM N AR,i = eβ2Yi3Xi 1 +eβ2Yi3Xi

where c = 0.5, β1 = −0.005, β2 = −0.003, and β3 = −0.005. All these parameters are arbitrarily chosen in this simulation. Then, Bernoulli random number generator creates the realized values of RM CAR,i, RM AR,i

and RM N AR,i, based on the given response rates. Figure 1.1 shows the illustration of the sample discussed above.

(16)

Table 1.1: Artificial sample data for understanding missing mechanism

Figure 1.3: Scatter plots for each case of missing mechanism

The red dots indicate missing values under each mechanism, while the black ones are observed values. The left panel (MCAR) clearly has non- respondents over the whole x and y plain, since missing does not depend on any of variables. In the middle panel (MAR), more missing was found as

(17)

influence missing occurrence. The right panel (MNAR) showed the majority of non-responses appeared on the upper right part where both Yi and Xi

are high.

This mechanism truly matters in deciding what to do with missing. Be- cause, defining mechanism also means understanding of how missing hap- pened in data. As an example, the type of assumed mechanism determines specific details of imputation method. Depending on the defined mecha- nism, we decide whether to stratify our data, manipulate chances of random selection or more. We can fairly handle dependency of missing through these several techniques. Then, imputation will lead us to well-representing sample and, further, unbiased statistical inference.

(18)

Chapter 2

Imputation methods and multiple imputation

Intuitively, imputed values can be generated with the estimated condi- tional distribution of non-respondent,

fθ,ˆψˆ(yi|Ri= 0, xi), (2.1) where the θ and ψ denote the distributional parameter vector of Y and R.

Because, the original values at missing cells had been also generated through the distribution with the true parameters θ and ψ, before they became missing. This method artificially regenerates those numbers at missing cells based on an estimated distribution, not the true distribution. Monte Carlo Markov Chain (MCMC) can be a technical solution for this regeneration.

However, this method requires distributional assumptions for each ran- dom variable and reasonable parameter estimations. Of course, a R-package called "Amelia" successfully actualized this method by relying on missing at random mechanism, multivariate normality and EM algorithm. But, if one of the assumptions is wrongly addressed, this method possibly mislead

(19)

2.1 Imputation in MCAR/MAR model

Under MCAR or MAR, an attempt to impute missing cells with observed sample distribution has been done. The rationale for this attempt can be readily understood by doing some algebra with the conditional density from (2.1). Note thatP(Ri= 0|Yi, Xi) =P(Ri= 0|Xi) under MAR.

fθ,ψ(yi|Ri = 0, xi) = fθ(yi|xi)Pψ(Ri = 0|yi, xi) Pθ,ψ(Ri= 0|xi)

= fθ(yi|xi)

((Pψ(R((i = 0|x(((i) ((((((((

Pθ,ψ(Ri = 0|xi)

=fθ(yi|xi)

Likewise, we can find that fθ,ψ(yi|Ri = 1, xi) = fθ(yi|xi). This indicates that observed and missing cells all share the same distribution.

Hot-deck is heavily based on the fore-mentioned framework. Each miss- ing cell is replaced with one random draw from the entire observed data.

Other advanced methods also go through a random draw from a group of donors (observed values), with a same probability. Only the way of form- ing donors’ group makes difference among each methodology. The following details several non-Bayesian imputations.

1. Residual regression imputation.

A core process is to draweirandomly, with replacement, from the ob- served residuals,ej, based on a corresponding statistical model.Therefore, the very first job is to fit the most adequate model with observed data.

After fitting a model, fitted values and randomly drawn residuals con- stitute imputed values. So to speak,yi= ˆyi+ei, whereyˆi is a fitted value. Here, an auxiliary variable ,Xi’s are all available. The following describes how to obtain imputed values using a ratio model.

- Ratio model

(20)

Fit a model and obtain parameter estimates based on observed data. i.e. βˆr=

P

i∈srYi

P

i∈srxi and ˆσ2 = n−11 P

i∈sr

1

xi(Yi−βxˆ i)2. Stan- dardized residuals,ej can be derived in a following fashion.

ej = (yj −βˆrxj)/√

xj, wherej ∈sr

For i∈ s−sr, we randomly select ei from observed ej for j ∈ sr, and then compute our imputed values yi = ˆβrxi+ei

xi

backward.

2. Nearest neighborhood with predictive mean matching

The basic idea is to down-size the pool of observed yj for each non- respondent, where hot-deck is being processed. Theyj’s in the down- sized pool, also being called nearest neighbors, are selected through predictive mean matching. If the predictive means of some observed values are close to the one of a non-respondent, then the observed values belong to the down-sized pool. So, imputed valuesyi are not selected from all observedyj, but some close neighbors yj instead.

Little (1988) suggested this operation with a clear notation. Assuming thati∈s−sr,

yi=yj,

where (ˆµi−µˆj)2 ≤ (ˆµi−µˆk)2 for all observed j and k. And µˆi, µˆj andµˆk are fitted values of Y fori,j and k.

3. Stratified hot-deck

First, a population with size N is stratified into H strata. Then, hot- deck is implemented for each non-respondent, depending on which strata the non-respondents belong to. In hth strata, for instance,

(21)

withinhth strata,yhi’s. In principle, stratified hot-deck resembles the nearest neighborhood method, in a way that random draws occurs in the reduced pool of observed valuesyj.

2.2 Imputations in MNAR models

If both MCAR and MAR are not the case, we assume that MNAR holds.

MNAR is often called non-ignorable non-response and it is barely detectable whether it holds in data. However, MNAR occasionally happens in reality.

Those with severe depression are less likely to report their depression level, and if they are male, then the response rate drops more sharply. Individuals with higher income also typically have a lower response rate to the questions for annual earning. Unobserved values are somewhat associated with their own missing rate.

Imputed value, yi, is drawn at random from the estimated conditional distribution,fθ,ˆψˆ(yi|Ri = 0, xi). As far as probability density/mass is avail- able, random numbers can be easily generated through basic functions of statistical packages or Monte Carlo Markov Chain (MCMC), if necessary.

More generally, we may also use yi = Eθ,ˆψˆ(yi|Ri = 0, xi). With multiple imputation, we repeat the above-mentioned random draw many times and possibly obtain the empirical average over the multiple imputation.

If derivation of the conditional distribution matters here, we need to take a closer look at it.

fθ,ψ(yi|Ri = 0, xi) = fθ(yi|xi)Pψ(Ri= 0|yi, xi)

Pθ,ψ(Ri = 0|xi) , (2.2) wherePθ,ψ(Ri = 0|xi) =R

fθ(yi|xi)Pψ(Ri = 0|yi, xi)dyi and this denomina- tor can be ignored, since it can be regarded as a constant. Then, the two terms in the numerator, fθ(yi|xi) and Pψ(Ri|yi, xi), require distributional assumptions respectively. A typical case is thatR is assumed to be logistic-

(22)

distributed and continuous Yi is normally distributed. One can also use a logistic regression for binaryYi.

Furthermore, the parameters of the conditional distribution, θ and ψ, has to be estimated. The θ and ψ denote a parameter vector for Y and R respectively and maximum likelihood estimation (MLE) is widely used in estimation process. The likelihood function ofθ andψ can be written as

L(θ, ψ|yi, Ri, xi) =f(yobs,r|x)

= Y

i∈sr

fθ(yi|xi)Pψ(Ri = 1|xi, yi)Y

i /∈sr

Pψ(Ri = 0|xi)

θˆ and ψ, the MLE estimators that maximize the likelihood function areˆ plugged back into the probability density/mass function of the conditional distribution. Then,

fθ,ˆψˆ(yi|Ri = 0, xi)

Nowyi is ready to be generated through the density/mass function.

To clarify the flow of MNAR modelling, a simple but straightforward example from Bjørnstad (2015) will be studied for extra. Consider thatYi is assumed to be a binomial case,

P(Yi = 1) =θ And, we also assumed that

P(Ri = 1|yi) =





ψ, ifyi= 0 2ψ, ifyi= 1

whereψ≤ 1 2

Then, according to the equation (2.2), the conditional probability func-

(23)

tion of the missing values can be written as

Pθ,ψ(Yi|Ri = 0) = P(Yi = 1)P(Ri = 0|Yi = 1)

P(Yi= 1)P(Ri= 0|Yi = 1) +P(Yi = 0)P(Ri = 0|Yi = 0)

= θ(1−2ψ)

θ(1−2ψ) + (1−θ)(1−ψ) = θ(1−2ψ) 1−ψ(1 +θ)

Now, as explained earlier, θ and ψ have to be estimated through maxi- mum likelihood method.

L(θ, ψ|yj, Ri) = Y

j∈sr

fθ(yj)Pψ(Rj = 1|yj) Y

i∈s−sr

Pψ(Ri= 0)

v(1−θ)nr−v(2ψ)vψnr−v(1−ψ−θψ)n−nr where v = P

i∈srI(Yi = 1), the number of successes, and n and nr is the size ofsandsr. Like any other procedures of deriving closed form MLE, we find derivatives of log likelihood with respect toθandψ, and set them zero.

∂l

∂θ = 0 ⇔ v

θ −nr−v

1−θ −ψ n−nr

1−ψ(1 +θ) = 0

⇔ θˆ= v 2nr−v

∂l

∂ψ = 0 ⇔ nr

ψ −(1 +θ) n−nr

1−ψ(1 +θ) = 0

⇔ ψˆ=





 nr

n(1 + ˆθ), if nr

n(1 + ˆθ) ≤ 12 12, otherwise

We plug the MLE,θˆand ψ, back into the conditional probability,ˆ Pθ,ˆψˆ(Yi|Ri = 0) = θ(1ˆ −2ψ)

1−ψ(1 + ˆˆ θ)

Then, imputed valuesyican be drawn randomly from this conditional prob-

(24)

ability. Otherwise,

yi=Eθ,ˆψˆ(Yi|Ri= 0)

=Pθ,ˆψˆ(Yi|Ri = 0) This form can be directly used as imputed values.

2.3 Bayesian imputation

Bayesian methods draw a parameter vector(θ) from the posterior dis- tribution of θ that is commonly defined in any other Bayesian context. In general, the posterior distribution can be defined as

p(θ|Y) = p(Y|θ)p(θ)

p(Y) (2.3)

Then, we draw imputed values with data distribution using the drawn pa- rameter vector(θ).

p(Y|θ)

Bayesian Bootstrap Imputation - discrete data

Rubin et al. (1981) and Rubin and Schenker(1986) assumed that(d1, ..., dK), the possible values of Y, have probabilities (θ1, ..., θK) respectively. Here, this method draws a parameter vector(θ) with posterior densityQK

k=1θkqk−1, where qk = P

i∈srI(Yi=dk), when the prior of the parameter(θ) is set to be proportional to QK

k=1θk−1. The first part of numerator of (2.3) is the probability mass function of multinomial distirbution,

p(Y|θ)∝ Y

i∈sr

P r(Yi =yi)

=

K

Y

k=1

θqkk

(2.4)

(25)

We plug the density of prior distribution into the second part of (2.3) to obtain the posterior distribution mentioned above.

p(θ)∝

K

Y

k=1

θk−1 (2.5)

Having θ drawn with the posterior distribution,QK

k=1θqkk−1, imputed val- ues are then generated withp(Y|θ).

Fully Normal Imputation - continuous data

Rubin and Schenker(1986) also discussed imputation for a continuous case that heavily relies on normality. Assuming that Yi ∼iid N(µ, σ2) and f(θ)∝ 1

σ2, whereθ = (µ, σ2), then the posterior distribution can be readily derived. We simply plug the probability density ofY|µ, σ2 andθ into (2.3).

We replaceµ with the empirical sample meany¯=P

i∈sryi. f(σ2|Y,y) =¯ f(Y|σ2,y)f¯ (σ2|¯y)

f(Y|¯y)

∝(σ2)−nr/2exp n−1

2 X

i∈sr

(yi−y)¯ 2 o

2)−1

= (σ2)−(nr/2+1)expn−1 2σ2

X

i∈sr

(yi−y)¯ 2o

This posterior distribution ofσ2 becomes Inverse-Gamma withα= nr

2 and β= 12P

i∈sr(yi−y)¯2, which can be rewritten asσ2|Y ∼ (nr−1)s21

χ2nr−1 , where s2 = 1

nr−1 P

i∈sr(yi−y)¯2. The one value ofσ2 is drawn with the posterior density.

By using σ2, we find another posterior distribution of µ this time. If Y¯ ∼N(µ,σ2

nr) is already known, the posterior density ofµ can be obtained

(26)

in the same way right above.

f(µ|Y¯) = f( ¯Y|µ)f(µ) f( ¯Y)

∝expn −1

2/nr(µ−Y¯)2o This also infers that the posterior density of µ is N( ¯Y ,σ2

nr). Likewise, we draw one value forµas before. Those two selected parameters construct the conditional distribution, f(Y|µ, σ2), which imputed values are generated with.

Imputation Adjusted for Uncertainty in the Mean and Variance - continuous data

When the normality is uncertain in data, the idea of hot-deck can be employed here, taking the sample distribution of observed values into non- respondents. The parameters,µandσ2can be drawn, as described in "fully normal imputation". Rubin and Schenker(1986) randomly drawsX ={yi: i = 1, ..., n−nr} from donors {Yj: j = 1, ..., nr} with replacement. In design-based approach

E(yi) = ¯Y = 1 nr

X

i∈sr

Yi

V ar(yi) = (nr−1)s2r nr = 1

nr X

i∈sr

(Yi−Y¯)2

If a new variable Zi is introduced such that Zi = yi−Y¯ p(nr−1)s2r/nr

, then it has zero expected value and one variance. Then, we calculate back imputed values that satisfyYiZi.

(27)

2.4 Missing mechanism and imputation

The type of missing mechanism is worthy to know prior to implement- ing imputation, if possible. At least, we can decide between an imputa- tion method for "ignorable non-response"(MCAR and MAR) and "non- ignorable non-response"(MNAR). However, once the decision is made, miss- ing mechanism no longer influences the calculation of imputed values un- less non-ignorable non-response is the case. For imputations to ignorable non-response, the probability of non-response, Pψ(Ri = 0|yi, xi) does not contribute to finding imputed values.

For non-ignorable non-response, Pψ(Ri = 0|yi, xi), however, has a por- tion in deriving the conditional distribution, where random number genera- tion is based on:

fθ,ψ(yi|Ri = 0, xi) = fθ(yi|xi)Pψ(Ri= 0|yi, xi) Pθ,ψ(Ri = 0|xi) , wherePθ,ψ(Ri= 0|xi) =

Z

fθ(yi|xi)Pψ(Ri = 0|yi, xi)dyi

Simply, the denominator term is a normalizing constant. The second term of the numerator can be determined by missing mechanism. To finalize the conditional distribution, the distributional assumption ofRi as well as dependency among Ri,yi andxi has to be determined.

2.5 Advantage and disadvantage of methods

Residual regression imputation is encouraging in a way that imputed values can come closer to real data points. Here, each imputed value includes the predictive mean E(Yi) as well as the corresponding residual ei. Given that the regression is adequately fitted, the imputed valuesyi= ˆE(Yi) +ei will resemble the true values of non-respondents. However, the assumptions of regression are not easily met in practice. Particularly, in case that the true

(28)

distribution ofYi is skewed, this method cannot precisely reflect skewness in the imputed values. Additionally, hot-deck over the entire observed residuals might be inefficient.

Nearest neighbor and stratified hot-deck imputation cover these draw- backs somehow. Their main idea is to pursue hot-deck within the pool of similar profiles to each non-respondent. First, hot-deck can roughly project the sample distribution of yj to imputed values yi. Secondly, they rea- sonably modified the pool of donors through predictive mean matching or stratification. However, as expected, the precision of predictive mean or stratification still remains questionable.

Imputation in MNAR directly estimates the assumed true distribution of non-respondent Yi,fθ,ψ(yi|Ri = 0, xi). Since non-respondent is of interest, imputed values have to be drawn with the distribution of non-response, not those of observed(donors). Of course, estimated parameters,θˆandψˆare still based on observed values. Distributional assumptions, such as the specific form offθ(yi|xi)andPψ(Ri|yi, xi)are necessary. Examining whether MNAR actually underlies in data appears very challenging, almost impossible.

A common drawback across all introduced methods is that there is no consideration for uncertainty of imputed values. As long as imputed values were obtained through the random draw process, we should consider their uncertainty (variation). Because, the random draw results in a very differ- ent set of imputed values, even if we run the same imputation method. It is impossible to measure this variation unless we generate the same impu- tation many times. Hence, the absence of this variation term leads to the underestimated variance of our estimates in further analysis with complete data.

(29)

2.6 Multiple imputation

Multiple imputation (MI), developed by Rubin (1987), properly handles variance underestimation of single imputation (SI). As discussed earlier, the underestimated variance of SI originally comes from ignorance of imputation uncertainty. The problem is that the uncertainty (variation) between SI’s is not measurable, because SI basically runs a single time. Rubin (1987) suggested to run SImtimes, and this suggestion made it possible to measure the variation between imputations. This is called multiple imputation (MI).

Assume that

y={yi:i∈sr, yj :j ∈s−sr}, (2.6) wheresandsr denote a full sample and observed sample respectively. After conducting MI, we derive mdifferent completed samples,

y(k) ={yi :i∈sr, yj(k) :j∈s−sr}, k= 1, ..., m (2.7) Suppose that our aim is to estimate a population quantityθwith a given θ. Here, we findˆ θˆwith the multiply-imputed data instead of the original one with non-respondents. Apparently,mdifferent estimatesθˆ1, ..., θˆmcan be computed according to each of the m sets of imputed data, y(k) from (2.7). Thereafter, the grand mean and variance estimate of the grand mean can be defined as following.

θ¯ = 1 m

m

X

i=1

θˆi (2.8)

W = ¯V+ (1 + 1

m)B, (2.9)

whereW = ˆV(¯θ),V¯ = m1 Pm

i=1i and B = m−11 Pm

i=1(ˆθi −θ¯)2. Rubin (1987) also stated that a student-t approximation can be used for the dis-

(30)

tribution ofθˆ,

W1/2(¯θ−θ)∼tv, (2.10) where v= (m−1)

h V¯ (1 + 1/m)B

i2

For the full validity of the (2.8), (2.9) and (2.10), Rubin (1987) required a "proper" imputation method to be processed. According to Rubin (1996), several conditions for being a "proper" imputation were briefly mentioned.

First, the MI estimators withm→ ∞, θ¯ and V¯ should be approximately unbiased for θ(y)ˆ and Vˆ(y), where they are estimators based on the full sampley in (2.6):

E(¯θ|Yobs,y)≈θ(y)ˆ and

E( ¯V|Yobs,y)≈Vˆ(y)

Likewise,B has to be approximately unbiased for the true variance of the MI estimatorθ¯, when m→ ∞:

E(B|Yobs,y)≈V ar(¯θ|Yobs)

Theθ¯ should, moreover, follow approximate normality with expectationθˆ and variance B:

θ¯|Yobs,y∼N(ˆθ(y), B).

Yet, Bjørnstad (2007) pointed out that many imputation methods used by national statistics bureaus barely meet these requirements. If it is the case, the latter term in (2.9) that mainly accounts for the between-variance of m different θˆ is believed to be under-measured. Bjørnstad (2007) com- pensated the under-measured term by finding the right k in the following fashion.

1

(31)

wherek= 1/(1−f)andf is non-response rate. However, Bjørnstad (2007)’s argument was built upon specific assumptions such as missing mechanism and the type of imputation method.

We still need to see whether thek= 1/(1−f)uniformly holds in diverse environments that we may come across in practice. Simulation studies can contribute to examining the validity ofk. Since the true values ofW,V¯and B in (2.11) are all known in a setting of simulation, k can be accordingly calculated back.

k= W −V¯ B − 1

m

One of the advantages being expected from simulation study is that we can observek values under various conditions. For example, we generate many cases by changing sample size (=n), non-response rate (= f), imputation method, the number of iteration in multiple imputation (=m) and so forth.

This operation will be actualized in the upcoming chapters of simulation study.

2.7 R-packages for multiple imputation

2.7.1 MICE

The package called "multivariate imputation by chained equations (mice)"

is widely used in R. In this "mice" package, the "mice" function performs multiple imputation (MI). The package and functions can be easily down- loaded and active in R, by using the following R commands.

> install.packages("mice")

> require(mice)

Including the "mice" function, a series of R commands is needed to finalize statistical analysis in presence of non-respondents. Depending on the

(32)

type of parameters of interest, we can take either of two different tracks. The first track aims to estimate a population total or mean, as census bureaus have done traditionally. The other track is used to find coefficient estimates from the modelling process. The figure below depicts how they flow.

Figure 2.1: Two track process

1. Track 1: Estimation of population total/mean.

2. Track 2: Estimation of coefficients from a model.

The specific details of the "mice" function will be given first. Then, follow-up procedures will be discussed for each of those two tracks. For clear understanding, some exemplary R codes will be enclosed as well.

1. mice()

This function pursues MI and returns complete data. The options in Table 2.1 mainly specify the number of imputations as well as the type of methodology to be used. The following table details the options, based on Buuren and Groothuis-Oudshoorn (2011).

(33)

Option Description

data A dataframe or matrix that contains imputation

target as well as auxiliary variables.

m The number of imputations to be pursed.

method "pmm" Predictive mean matching.(numeric)

"sample" Hot-deck.(any)

"norm.nob" Non-Bayesian linear regression.(numeric)

"norm" Bayesian linear regression.(numeric)

"mean" Grand mean imputation.(numeric)

"logreg" Logistic regression.(factor, 2 levels)

"polyreg" Multinomial regression.(factor, >2 levels)

"lda" Linear discriminant analysis.(factor)

predictorMatrix A p×p square matrix for defining which variables will be used as preditors in modelling, where p is the number of columns.

e.g. When variablesX1, ..., X3 are present in data, a 3×3 matrix can be derived, such as

0 1 0 1 0 1 1 1 0

Each row corresponds to a target variable, while each column does variables used in a model. And the element 1 means the column variable is used as a predictor.

With the given matrix above,X1∼X2,X2∼X1+X3

and X3 ∼X1+X2 hold.

Table 2.1: Options of the mice function

2. Exemplary study, Track 1: Estimation of population mean

Assuming that our final aim is to estimate population mean, we need to develop several extra coding, as well as the "mice" function. It is indeed worthy to clarify how other functions of the "mice" package can be used in estimation process. An artificial sample data with non-respondent is created first, where Yi and Xi are our target and auxiliary variable.

> n=2000 # The number of observations

(34)

> f=0.4 # Non-response rate

> y.i=runif(n,0,5)

> x.i=runif(n,0,5)

> sample.data=data.frame(y.i,x.i)

> missing=sample(1:n,f*n,replace=F) #Select where to miss out.

> sample.data[missing,1]=NA

Then, imputations are processed using the "mice" function. We specify the number of imputations,m= 3, and use predictive mean matching.

> require(mice)

> imputed.sample.data=mice(data=sample.data, m=3, method="pmm") It is good to check if the imputations are well processed by looking at

the small part of sample data and completed data.

> head(sample.data)

> head(data.frame(

m1=complete(imputed.sample.data,1)[,1]

,m2=complete(imputed.sample.data,2)[,1]

,m3=complete(imputed.sample.data,3)[,1]))

Figure 2.2: Checking imputed values

(35)

To obtain Rubin (1987)’s equation (2.8) and (2.9) in the Multiple Imputation section, "for" iteration and "pool.scalar" function can be used.

> N = 100000 ### The number of population observation

> m = imputed.sample.data$m

> theta_star_hat=c()

> V_star_hat=c()

> for (i in 1:m) {

theta_star_hat[i] = mean(

complete(imputed.sample.data, i)$y.i) V_star_hat[i] = (1-n/N)*

(var(complete(imputed.sample.data, i)$y.i) / n)

### (variance of estimate) }

> pool.scalar(Q=theta_star_hat, U=V_star_hat ,method = "rubin")$qbar

[1] 2.52628

> pool.scalar(Q=theta_star_hat, U=V_star_hat ,method = "rubin")$t

[1] 0.0015755

Note that the value of "qbar" and "t" correspond toθ¯ and W in the equation (2.8) and (2.9).

3. Exemplary study, Track 2: Estimation of coefficient in modelling In practice, modelling also becomes our aim of analysis. Here, we assume that the main interest is to estimate coefficient of an ordinary regression under MI. For simplicity, the regression is to projectYi on Xi. The code for creating artificial sample data and implementing

(36)

imputations is identical to that of Track 1. But, "with" and "pool"

function are used in estimating.

> fit = with(data=imputed.sample.data,exp=glm(y.i~x.i))

> pool(fit)$qbar

(Intercept) x.i

2.50004980 0.01040729

> pool(fit)$t

(Intercept) x.i

(Intercept) 0.02439523 -0.010614243 x.i -0.01061424 0.004830097 2.7.2 Amelia

According to Honaker, King, Blackwell, et al. (2011), "Amelia" was de- veloped under the assumptions of multivariate normality and missing at random (MAR). As an example, let us assume that we have data matrix as follows,

Y =

 y1 y2 y3 y4 y5 y6

=

y11 y12 y13 y14 y15

y21 y22 y23 y24 y25 y31 y32 y33 y34 y35

y41 y42 y43 y44 y45 y51 y52 y53 y54 y55

y61 y62 y63 y64 y65

M issing

=⇒

y11 y12 y13 − y15

y21 − − y24 y25 y31 y32 y33 y34

− y42 y43 y44 y45 y51 − − − y55

− y62 y63 y64

The ith row vector, Yi = yi ∼ N5(µ,Σ), i = 1,2, ...,6. Also, we can split Yi into two components such as Yi = (Yi,obs,Yi,mis). For instance, Y2 = (Y2,obs = (y21, y24, y25),Y2,mis = (y22, y23)). The main idea of this

(37)

package is to draw Yi,mis with a conditional probability, P(Yi,mis|Yi,obs; ˆµ,Σ),ˆ

Derivation of the maximum likelihood estimates µ,ˆ Σˆ can be achieved through estimation maximization(EM) algorithm. It is widely known that EM algorithm performs well in estimating parameters with the presence of unobserved data. For the E-step, the conditional expectation ofYmis given Yobs,E(Ymis|Yobs; ˆµold,Σˆold), replaces the missing spots. The initial values of µˆold and Σˆold are arbitrarily chosen to find the conditional expectation.

The M-step is to find the updated parameter estimates, µˆnew and Σˆnew, which maximizes the likelihood of Yobs and E(Ymis|Yobs; ˆµold,Σˆold). Then, we go back to the E-step with the newly obtained parameter estimates and find another updated estimates and repeat this iteration over again until the parameter estimates converge.

More importantly, Amelia generatesmbootstrapped sub-samples of size nbefore implementing EM algorithm and imputation procedure mentioned above. That is why the entire algorithm built in Amelia is called expecta- tion maximization with bootstrapping (EMB). Themdifferent sub-samples results in mdifferent imputed values for one missing unit. We can combine them sets of imputed data as we did in multiple imputation. Honaker and King (2010) provided further theoretical details behind Amelia package.

> ## Import "Amelia"

> install.packages("Amelia")

> require(Amelia)

> ## Import MASS to generate sample data

> install.packages("MASS")

> require(MASS)

(38)

> sigma=matrix(3,4,4)

> diag(sigma)=c(10,8,8,5)

> raw=mvrnorm(200, rep(0,4),sigma)

> ## Make NA’s

> for(i in 1:dim(raw)[2]){

+ miss_out=sample(1:200,40) + raw[miss_out,i]=NA

+ }

> ## Run Amelia

> output_raw=amelia(raw, m=5)

> ## Call imputed datasets

> output_raw$imputations$imp1

> output_raw$imputations$imp2

> output_raw$imputations$imp3

> output_raw$imputations$imp4

> output_raw$imputations$imp5 2.7.3 Hmisc

The functions, impute()and aregimpute(), from "Hmisc" package developed by Alzola and Harrell (2006) can be used for multiple imputation.

Theimpute()simply imputes mean, median, or max into missing, or im- plements hot-deck. Let us use the old sample data with non-respondent from the MICE section.

> ## Generate sample data

> n=2000

> f=0.4 # Non-response rate

(39)

> x.i=runif(n,0,5)

> sample.data=data.frame(y.i,x.i)

> ## Make NA’s

> missing=sample(1:n,f*n,replace=F)

> sample.data[missing,1]=NA

> ## Process imputation

> imputed.sample.data=impute(sample.data$y.i,fun = mean)

> imputed.sample.data=impute(sample.data$y.i,fun = median)

> imputed.sample.data=impute(sample.data$y.i,fun = "random")

> ## Result

> head(sample.data$y.i)

[1] NA NA 2.506376 NA 4.378280 3.932805

> head(imputed.sample.data)

1 2 3 4 5 6

2.553923* 2.553923* 2.506376 2.553923* 4.378280 3.932805 The checkpoint here is that an user should specify a single target vari- able to be imputed and a method to be used in the impute() function.

Then, it returns entire data including imputed values with an asterisk mark.

If multiple imputation is the case, one should build his/her own iteration coding manually.

The function aregimpute()based on bootstrapping, additive regres- sion and predictive mean matching is also widely used in practice.

1. Bootstrapping : Among the non-missing part of a target variable being imputed, m different sub-samples are randomly drawn with replace- ment from the entire data.

2. Additive regression: In each sub-sample, we find one optimal additive

(40)

model with the largestR2 than any other models. Then, each opitmal model calculates the predicted values of the target variable.

3. Preditive mean matching: Each missing spot is now filled with an observation whose predicted value is closest to the predicted value of the each missing spot.

> ## Get sample data with missing ready

> esoph_NA=esoph

> esoph_NA=data.frame(esoph_NA)

> for(i in 1:ncol(esoph)){

+ missing_spot=sample(1:nrow(esoph),round(nrow(esoph)*0.2)) + esoph_NA[missing_spot,i]=NA

+ }

> ## Run multiple imputation through aregImpute

> set.seed(500)

> (result=aregImpute(~agegp + alcgp + tobgp + ncases + ncontrols

+ ,n.impute=5

+ ,type = "pmm"

+ ,data=esoph_NA

+ ))

If we call the object "result", R displays summary information about multiple imputation processed by aregimpute()(Listing 2.1). The infor- mation includes, for instance, the list of target variables being imputed, the number of impuations(m), the number of knots(nk) used in additive models, theR2 of each model and so on.

result$imputed

(41)

The upper command returns a data matrix of size nmis ×m for each target variable being imputed. Also, we can extract the imputed values of a certain target variable by using the second command. Listing 2.2 shows us how the resulting matrix looks like. Since this package returns imputed values only, as mentioned earlier, some additional coding for replacing NA’s with imputed values will be necessary.

Listing 2.1: R output

M u l t i p l e I m p u t a t i o n u s i n g B o o t s t r a p and PMM

a r e g I m p u t e ( f o r m u l a = ~agegp + a l c g p + tobgp + n c a s e s

+ n c o n t r o l s , d a t a = esoph_NA , n . impute = 5 , t y p e = "pmm" )

n : 88 p : 5 I m p u t a t i o n s : 5 nk : 3

Number o f NAs :

agegp a l c g p tobgp n c a s e s n c o n t r o l s

18 18 18 18 18

t y p e d . f .

agegp c 5

a l c g p c 3

tobgp c 3

n c a s e s s 2

n c o n t r o l s s 1

T r a n s f o r m a t i o n o f T a r g e t V a r i a b l e s Forced t o be L i n e a r

R−s q u a r e s f o r P r e d i c t i n g Non−M i s s i n g V a l u e s f o r Each V a r i a b l e Using L a s t I m p u t a t i o n s o f P r e d i c t o r s

(42)

agegp a l c g p tobgp n c a s e s n c o n t r o l s 0 . 7 6 2 0 . 7 3 4 0 . 7 6 9 0 . 6 5 8 0 . 7 5 8

Listing 2.2: R output

$agegp

[ , 1 ] [ , 2 ] [ , 3 ] [ , 4 ] [ , 5 ]

19 2 4 6 3 2

25 6 4 1 6 1

28 5 2 1 2 1

30 3 2 4 4 5

32 3 3 6 2 6

37 3 6 1 1 2

41 1 2 6 6 2

$ a l c g p

[ , 1 ] [ , 2 ] [ , 3 ] [ , 4 ] [ , 5 ]

5 2 1 2 1 1

7 3 2 2 4 3

8 1 1 2 1 1

9 1 4 2 2 2

19 2 1 2 1 1

20 1 1 1 1 1

33 3 4 4 1 1

2.7.4 VIM

The "VIM" package also includes several well-developed functions for handling imputation procedure. Not merely hotdeck() and kNN(), which implement hot-deck and k-nearest neighbor imputation respectively, but the

(43)

Kowarik and Templ (2016) detailed the actual applications of the functions mentioned above. According to Templ, Kowarik, and Filzmoser (2011), the irmi() heavily relies on modelling methods that fill in missing cells with predicted values plus noise. Interestingly, it is capable of handling outliers through robust methods as well as semi-continuous variables on top of other ordinary data type.

Templ et al. (2011) also detailed several steps programmed in irmi():

Step 1. The irmi() function simply pursues either mean-imputation or k-nearest neighbor imputation for all the missing cells at the very beginning.

Step 2-1. One variable, called response, is regressed on the rest of variables, called predictors, from the entire data set. But, the variable with the most missing cells should be modelled first. Then, the one with the next most missing cells comes and so on.

Step 2-2. Model development is completely dependent upon obser- vations whose response variable being regressed is non-missing. Cer- tainly, we may have observations whose response is non-missing but predictors are missing. Then, we use the initially imputed values in the step 1 instead in modelling.

Step 2-3. The irmi() automatically detects the data type of each vari- able being imputed and selects a correct modelling method accordingly.

Please refer to the following table for further details:

Step 3. The derived model updates the mean or kNN imputed cells with the sum of predicted values and noise. Then, it repeats the Step2- 1 to Step3 until the imputed values converge.

Since the actual usage of this function is quite straight forward, the brief introduction of available options in irmi() will be given here.

(44)

The type of data Modelling method

Continuous Robust regression with the identity link Categorical Generalized linear model(robust is optional)

Binary Logistic regression(robust is optional)

Semi-continuous 1. Logistic regression for imputing a constant(mostly zero) 2. For continuous, robust regression

Count Generalized linear model with Poisson and the log link Table 2.2: Modelling method used for each data type.

result=irmi(mydata,mixed=c("x1","x2")

,modelFormulas=list(x1=c("x2","x3"), x2=c("x1"),

x3=c("x1","x2")) ,trace=TRUE)

)

Even though irmi() find the right model for each variable type automati- cally, the "mix" option can still force irmi() to recognize the stated variables as semi-continuous. "ModelFormulas" is an option for specifying which re- gressors to be used in modelling, and "trace" shows us detailed information for each imputation.

Referanser

RELATERTE DOKUMENTER

By elaborating on the essential MSaaS infrastructure capabilities; that is, simulation data management capabil- ities, simulation composition capabilities and simulation

The name indicates that the source is in position 304, the sensor in position 306, and that the measured time series of the pressure is from the detonation with file number

Extending Carlsson et al’s 16 research, the aims of this paper were to simulate cross-country skiing on varying terrain by using a power balance model, compare a skier’s

(i) estimating a population average from simple random samples using hot-deck imputation, (ii) estimating the regression coefficient in the ratio model using residual

Section 3 has two applications with random nonresponse, (i) estimating a population average from simple random samples using hot-deck imputation and (ii) estimating a regression

Thermal desorption spectra from titanium dihydrides: (a) TDS of insitu rehydro- genated samples with different heating rate, except for sample 10 K/min-B which was

Odds ratios and 95% confidence intervals of logistic regression models predicting autism spectrum disorder from quartile categories of each PFAS in a nested case–control study

Odds ratios and 95% confidence intervals of logistic regression models predicting autism spectrum disorder from quartile categories of each PFAS in a nested case – control study