Non-Bayesian Multiple Imputation

(1)

Non-Bayesian Multiple Imputation

Jan F. Bjørnstad¹

Multiple imputation is a method specifically designed for variance estimation in the presence of missing data. Rubin’s combination formula requires that the imputation method is

“proper,” which essentially means that the imputations are random draws from a posterior distribution in a Bayesian framework. In national statistical institutes (NSI’s) like Statistics Norway, the methods used for imputing for nonresponse are typically non-Bayesian, e.g., some kind of stratified hot-deck. Hence, Rubin’s method of multiple imputation is not valid and cannot be applied in NSI’s. This article deals with the problem of deriving an alternative combination formula that can be applied for imputation methods typically used in NSI’s and suggests an approach for studying this problem. Alternative combination formulas are derived for certain response mechanisms and hot-deck type imputation methods.

Key words: Variance estimation; survey sampling; stratified sampling; logistic regression;

nonresponse; hot-deck imputation.

1. Introduction

Multiple imputation is a method specifically designed for variance estimation in the presence of missing data, developed by Rubin (1987). Two more recent references with further discussions and studies are Rubin (1996) and Schafer (1997). The basic idea is to createmimputed values for each missing value and combine themcompleted data sets by Rubin’s combination formula for variance estimation. For the estimator to be valid, the imputations must display an appropriate level of variability. In Rubin’s term, the imputation method is required to be “proper.” In national statistical institutes (NSI’s) the methods used for imputing for nonresponse very seldom if ever satisfy the requirement of being “proper.” However, the idea of creating multiple imputations to measure the imputation uncertainty and use it for variance estimation and for computing confidence intervals is still of interest. The problem is then that Rubin’s combination formula is no longer valid with the usual nonproper imputations used by NSI’s. The reason is that the variability in nonproper imputations is too small and the between-imputation component must be given a larger weight in the variance estimate. The problem is then to determine what this weight should be to give valid statistical inference, and also for what kind of nonresponse mechanisms and estimation problems it is possible to determine a simple

qStatistics Sweden

1Statistics Norway, Division for Statistical Methods and Standards, P.O. Box 8131 Dep., N-0033 Oslo, Norway.

Email: jab@ssb.no

Acknowledgment:The problem of deriving a non-Bayesian multiple imputation method was studied in a Master’s thesis in 1999 by Tonje Braaten with the author as her adviser. The present research began within the DACSEIS research project, and started originally when the author was contacted by Tonje Braaten regarding this issue in her doctoral studies in epidemiology.

(2)

combination formula not dependent on unknown parameters. This article suggests an approach for studying this problem.

In Section 2 an approach for determining the combination of the imputed completed data sets is suggested. Section 3 has three applications with random nonresponse:

(i) estimating a population average from simple random samples using hot-deck imputation, (ii) estimating the regression coefficient in the ratio model using residual regression imputation and (iii) estimating the regression coefficient in simple linear regression with residual regression imputation. Section 4 deals with the general problem of multiple imputation for stratified samples. In Section 5 we apply the theory in Section 4 to stratified samples with random nonresponse within strata, covering (i) estimation of population average using stratified hot-deck imputation and (ii) estimation of log (odds ratios) in logistic regression with missingness both for the dependent variable and the explanatory variable. Section 6 takes up the problem of using the same combination rule for all estimation problems with a given imputation method and data and response model. A general result for hot-deck imputation and linear estimates is presented.

2. An Approach for Determining an Alternative Combination Formula for Variance Estimation in Multiple Imputation

Let s¼ ð1; : : : ;nÞ denote the full sample, with y¼ ðy1; : : : ;ynÞ denoting the full sample data, values of random variable Y1; : : : :;Yn. In the case of sampling from a finite population under a design model, a renumbering of the selected units has been performed, of course, and the stochastic nature of y is determined by the sampling plan. The objective is to estimate some parameter u. The observed data is denoted by yobs ¼{ðyi:i[srÞ;sr}, being the observed part of y and the response sample s_r of size n_r.

Letu^be the estimator based on the full sample datay, withVarðu^Þestimated byVðyÞ.^ For i[s2sr we impute by some method y_i^* and let y^* denote the complete data ðyi:i[sr;y_i^*:i[s2srÞ. Based ony*, we haveu^^*¼u^ðy^*ÞandV^^*¼Vð^ y^*Þ.

Multiple imputation of m repeated imputations leads to m completed data-sets with m estimates u^_i^*;i¼1; : : : ;m; and related variance estimates V^_i^*;i¼1; : : : ;m.

The combined estimate is given by u^*¼Pm

i¼1u^_i^*=m. The within-imputation variance is defined as V^*¼Pm

i¼1V^_i^*=m and the between-imputation component is B^*¼ Pm

i¼1ðu^_i^*2u^*Þ²=ðm21Þ:The total estimated variance ofu^* is then proposed to be W ¼V^*þ kþ1

m

B^* ð1Þ

That is, we need to determineksuch that

EðWÞ ¼Varðu^*Þ ð2Þ

Rubin (1987) has shown that k¼1 can be used with proper imputations, which essentially means drawing imputed values from a posterior distribution in a Bayesian framework.

(3)

In general, one has to determine the terms in (2). One way to try and do this is to use double expectation, conditioning on y_obs, that is, EðWÞ ¼E{EðWjY_obsÞ} and Varðu^*Þ ¼E{Varðu^*jY_obsÞ}þVar{Eðu^*jY_obsÞ}. Typically,

EðV^*Þ<Varðu^Þ ð3Þ

andEðB^*jyobsÞ ¼Varðu^^*jyobsÞ. Hence, approximately EðWÞ ¼Varðu^Þ þ EðkÞ þ1

m

EVarðu^^*jY_obsÞ ð4Þ

Moreover,Varðu^*jy_obsÞ ¼Varðu^^*jy_obsÞ=mandEðu^*jy_obsÞ ¼Eðu^^*jy_obsÞ. This implies that Varðu^*Þ ¼m²¹E{Varðu^^*jYobsÞ}þVar{Eðu^^*jYobsÞ}. From (3) and (4), Equation (2) becomes Varðu^Þ þEðkÞEVarðu^^*jY_obsÞ ¼Var{Eðu^^*jY_obsÞ}, which gives the following general expression:

EðkÞ ¼VarEðu^^*jY_obsÞ2Varðu^Þ

EVarðu^^*jYobsÞ ð5Þ

For this to be of interest,kmust be, at least approximately, determined independently of unknown parameters. In addition, one needs to check that (3) holds. To illustrate how (5) can be used we shall in the next section consider three special cases with random nonresponse.

3. Three Applications for Random Nonresponse

3.1. Estimating Population Average with Hot-deck Imputation

Consider a simple random sample from a finite population of sizeN, where the aim is to estimate the population averagemof some variabley. We shall assume completely random nonresponse. In the terminology of Rubin (1987) and Little and Rubin (2002), the missingness mechanism is said to be MCAR (missing completely at random). We note that MCAR means that the response indicators R1; : : : ;RN are independent with the same response probability pr¼PðRi¼1Þ. The imputation method is the hot-deck method, wherey_i^*is drawn at random fromyobswith replacement, and the estimate is the sample mean. Letyr be the observed sample mean ands^_r²¼_n¹

r21

P

i[srðyi2yrÞ² the observed sample variance. ThenY^*is the imputation-based sample mean for the completed sample, and the combined estimator is given byY^*¼Pm

i¼1Y_i^*=m. LetYsdenote the sample mean based on a full sample. Then,VarðY_sÞ ¼s²ð¹_n2_N¹Þ, withs²¼ ðN21Þ²¹PN

i¼1ðy_i2mÞ² being the population variance. We have further that EðY^*jy_obsÞ ¼yr and VarðY^*jyobsÞ ¼{ðn2nrÞ=n²}{ðnr21Þ=nr}s^_r² using that EðY_i^*jyobsÞ ¼yr and VarðY_i^*jy_obsÞ ¼s^²_rðn_r21Þ=n_r.

In this case, V^^*¼s^²_*ð¹_n2_N¹Þ where s^²_*¼_n21¹ P

s_rðyi2y^*Þ²þP

s2s_rðy_i^*2y^*Þ²

. It can be shown that Eðs^²_*jyobsÞ ¼s^_r² 12_n¹

r

1þ_nðn21Þⁿ^r

<s^²_r and (3) holds.

(4)

We find, from (5),

EðkÞ ¼

Var Yr 2s² ¹ n21

N

E n2nr

n² nr21 n_r

Eðs^_r²jnrÞ

¼

s² E 1 nr

21 N

2s² ¹ n21

N

E n2n_r

n² n_r21 nr

s²

<ð12p_rÞ=p_r 12pr

¼ 1 pr

which is satisfied approximately, withf¼ ðn2nrÞ=n being the rate of nonresponse, by letting k¼1=ð12fÞ:

3.2. Estimating the Regression Coefficient in the Ratio Model with Residual Imputation We shall assume completely random nonresponse as in Section 3.1. We consider a ratio model, i.e., regression through the origin: Yi¼bxiþ1i, with Varð1iÞ ¼s²xi; i¼1,: : :,n.It is assumed thatall x_i’s are known, also in the nonresponse sample. The full data estimator of b is given byb^¼Pn

i¼1Yi=Pn

i¼1xi. The unbiased estimator ofs²is given bys^²¼Pn

i¼11

x_iðy_i2b^x_iÞ²=ðn21Þ.

We shall consider residual regression imputation. Let b^r be the b^ - estimate based on observed sample s_r. Define the standardized residuals e_i¼ ðy_i2b^rx_iÞ=pffiffiffiffix_i

, for i[sr. Fori[s2sr: draw the value ofe^*_i at random, with replacement, from the set of observed residuals ei;i[sr. The imputed y-value is given by y_i^*¼b^rxiþe_i^* ffiffiffiffixi

p . Let X¼Pn

i¼1xi;Xr¼P

i[srxi and Xnr¼P

i[s2srxi¼X2Xr. All considerations from now on are conditional onn_randX_r, and we aim to determinekdirectly from (5).

The proportion of thex-total in the nonresponse group is denoted asfX ¼Xnr=X.

We now have b^^*¼ P

s_ryiþP

s2s_ry_i^*

=X and s^²_*¼_n21¹ P

s_r 1

xiðyi2b^^*xiÞ² þP

s2s_r1

x_iðy_i^*2b^^*xiÞ² .

In order to determine k from (5) we need to check the validity of (3) and derive EVarðb^^*jyobsÞ;VarEðb^^*jyobsÞandVarðb^Þ. We note thatVarðb^Þ ¼s²=X. In Appendix A.1 it is shown that condition (3) holds for moderate and largen_r, and that

VarEðb^^*jy_obsÞ ¼s² Xr

þð12d1Þd₂nnrXnr

X² s²

nr

ð6Þ EVarðb^^*jyobsÞ ¼Xnr

X²s²

n_r ðnrþd122Þ ð7Þ

Here, 0#d1;d2#1. From (5), using (6) and (7), we find k¼nrX²2nrXXrþ ð12d1Þd2nnrXnrXr

X_rX_nrðn_rþd₁22Þ < X

X_rþ ð12d1Þd2

nnr

n_r

(5)

We note that if all x_i ¼1, then d₁¼d₂ ¼1. Now, with fX ¼Xnr=X being the proportion of thex-total in the nonresponse group andf ¼n_nr=nthe rate of nonresponse, we finally get, since typicallyð12d1Þd₂<0,

k< 1 12fX

þ ð12d₁Þd₂ f

12f < 1 12fX

for usualx-values and nonresponse rates.

3.3. Estimating the Regression Coefficient in Simple Linear Regression with Residual Imputation

As in Sections 3.1 and 3.2 the nonresponse mechanism is assumed to be MCAR with pr¼PðR_i¼1Þ. The simple linear regression model is assumed: Yi¼aþbxiþ 1i;withVarð1iÞ ¼s²;i¼1; : : : ;n:Allx_i’s are assumed to be known. We may assume, that x¼Pn

i¼1xi=n¼0. Then the full data estimates are given by b^¼Pn

i¼1xiyi=SS_x, whereSSx¼Pn

i¼1x_i²;anda^¼y¼Pn

i¼1yi=n:The unbiased estimator ofs²is given by s^²¼_n22¹ Pn

i¼1ðy_i2a^2b^x_iÞ². Leta^r;b^rbe the estimates based on the response sample, a^r¼yr2b^rxrandb^r¼P

i[srðxi2xrÞyi=SSx;r. Here,yr¼P

i[sryi=nr,xr ¼P

i[srxi=nr

andSSx;r¼P

i[srðx_i2x_rÞ².

Simple residual imputation is defined as follows: The observed residuals are ej¼ ðyj2a^r2b^rxjÞ;forj[sr. Fori[s2sr: drawe_i^*at random, with replacement from ðe_j;j[sr). The imputedy-value is given byy_i^*¼a^rþb^rxiþe_i^*:

The imputation based estimates are b^^* ¼ P

i[srxiyiþP

i[s2srxiy_i^*

=SSx, a^^*¼ ðn_ry_rþ ðn2n_rÞy_nr^*Þ=n where y_nr^* ¼P

s2sry_i^*=ðn2n_rÞ and s^²_* ¼_n22¹ P

srðy_i2 n

a^^*2b^^*xiÞ²þP

s2s_rðy_i^*2a^^*2b^^*xiÞ²o

: It can be shown (see Appendix A.2 for a summary proof) that Eðs^²_*Þ ¼s²Eðⁿ^r_n²²

r ^n22f_n22Þ<s² where, as in Section 3.1, f ¼ ðn2nrÞ=n. Since Varðb^Þ ¼s²=SSx, (3) holds. It is readily seen that Eðb^^*jy_obsÞ ¼ b^r and Varðb^^*jyobsÞ ¼s_e²cr=SSx, where s_e²¼P

s_re_i²=nr and cr ¼P

s2s_rx_i²=SSx

[k0;1l. It can be shown thatEðs_e²js_rÞ ¼ⁿ^r_n²²

r s². Moreover, clearlyEðc_rÞ ¼12p_r and Varðb^rjsrÞ ¼s²=SSx;r:It follows, from (5), that

EðkÞ ¼ Eð1=SS_x;rÞ21=SS_x ð12prÞE{ðnr22Þ=nr}=SSx

<1=EðSS_x;rÞ21=SS_x ð12prÞ=SSx

Using the fact that conditional on n_r, s_r is a simple random sample such that the response indicators are correlated with Cov(R_i,R_j)¼ 2f(12f)/(n21), we find that EðSSx;rÞ ¼ ðpr2^12p_n21^rÞSSx. It follows that, approximately,E(k)¼_p¹

r2¹_n<1=prand we can usek¼1/(12f).

4. Multiple Imputation for Stratified Samples

4.1. Separate Combinations

One way to combine themcompleted data sets is to do it separately for each stratum, i.e., determine a separate k for each stratum. The general setup is then as follows:

(6)

The sample s is divided into H sample strata, s₁,: : :, s_H. Let y_h be the planned full data from subsample s_hof size n_h. It is assumed that y₁,: : :,y_H are independent. The observed part of y_his denoted byy_h,obs withs_hrbeing the response sample from s_hof size n_hr. The estimator based on the full sample data is the sum of independent terms, u^¼PH

h¼1u^h where u^h is based on the y_h. Varðu^Þ ¼PH

h¼1Varðu^hÞ is estimated by Vð^ u^Þ ¼PH

h¼1V^hðyhÞ where V^hðyhÞ is the variance estimate of u^h based on y_h. For i[s_h2s_hr we impute by some method y_i^* based on y_h,obs and let yh* denote the complete data ðyh;obs;y_i^*;i[sh2shrÞ. Based on y_h^*, we have u^_h^*¼u^hðy_h^*Þ and V^_h^*¼ V^_hðy_h^*Þ: Then the imputation based estimator is given by u^^*¼PH

h¼1u^_h^* and V^^*¼PH

h¼1V^_h^*. Multiple imputation of m repeated imputations leads to m completed data sets with mestimates for each stratum h,u^h;i;i¼1; : : : ;m and related variance estimates V^_h;i^*;i¼1; : : : ;m: The total estimates and related variances are u^_i^*¼ PH

h¼1u^_h;i^* andV^_i^*¼PH

h¼1V^_h;i^*; fori ¼ 1,: : :,m. The combined estimate for stratumh is given by u_h^*¼Pm

i¼1u^_h;i^*=m. The within-imputation variance for stratum h is V_h^*¼ Pm

i¼1V^_h;i^*=m and the between-imputation component is given by B_h^*¼Pm

i¼1ðu^_h;i^* 2u_h^*Þ²=ðm21Þ. Following the same idea as in Section 2, Formula (1), the total estimated variance of u_h^* is then proposed to be Wh¼V_h^*þ ðk_hþ_m¹ÞB_h^*. The combined total estimate is given byu^* ¼Pm

i¼1u^_i^*=m¼PH

h¼1u_h^*. It follows that the total estimated variance of u^* can be expressed as

Wsep¼X^H

h¼1

Wh¼V^*þX^H

h¼1

khþ1 m

B_h^* ð8Þ

where V^*¼Pm

i¼1V^_i^*=m¼PH

h¼1V_h^*. Provided (3) holds for each stratum h,

EðV_h^*Þ<Varðu^hÞ ð9Þ

we have from (5) that k_hmust satisfy Eðk_hÞ ¼VarEðu^_h^*jY_h;obsÞ2Varðu^hÞ

EVarðu^^*_hjY_h;obsÞ ð10Þ

The combination Formula (8) is an alternative to the usual combination Formula (1), especially useful when we get simple expressions fork_hbut not fork.The next section develops an expression for k in this situation.

4.2. An Overall Combination Formula

Now letW be given by (1). We shall determine the between-imputation factork. Since EðWÞ ¼EðW_sepÞwe have

E X^H

h¼1

k_hþ1 m

B_h^*

( )

¼E kþ1 m

B^* ð11Þ

Here, B^*¼_m21¹ Pm

i¼1ðu^_i^*2u^*Þ²¼_m21¹ Pm i¼1

P

hðu^_h;i^* 2u_h^*Þ n o2

. Note that EðB^*jyobsÞ ¼ E PH

h¼1B_h^*jy_obs

, sinceEðB^*jy_obsÞ ¼Varðu^^*jy_obsÞ ¼PH

h¼1Varðu^_h^*jy_obsÞandEðB_h^*jy_obsÞ ¼ Varðu^_h^*jyobsÞ.

(7)

Hence, the identity (11) becomes E{PH

h¼1khEðB_h^*jYobsÞ}¼E{kEðB^*jYobsÞ}: This gives us the solution k¼PH

h¼1k_hEðB_h^*jy_obsÞ=EðB^*jy_obsÞ if we want to use the usual combination Formula (1) and hence

k¼ X^H

h¼1

khVarðu^_h^*jyobsÞ Varðu^^*jy_obsÞ ¼X^H

h¼1

khVarðu^_h^*jy_obsÞ

Varðu^^*jy_obsÞ ð12Þ

a weighted average ofk_h. We get a simple expression forkonly when allk_hare equal, sayk_h¼k₀. Thenk¼k₀.

5. Four Applications to Stratified Samples and Random Nonresponse within Strata 5.1. Estimating Population Average from Stratified Sample with Stratified Hot-deck

Imputation

Consider stratified simple random samples from a finite population of sizeN, withHstrata of sizes N_h, h¼1,: : :,H. The aim is to estimate the population average m of some variabley. We assume completely random nonresponse within each stratum, denoted as MAR (missing at random) by Rubin (1987) and Little and Rubin (2002). This means that the response indicators in stratum h, R_h;1; : : : ;R_h;N_h are independent with p_hr¼PðR_h;i¼1Þ. The imputation method is stratified hot-deck. Lety_h,obsbe the observed part from the response samples_hrof sizen_hrfrom stratumh,yh;obs¼ ðy_i:i[s_hrÞ. Then an imputed valuey_i^* in stratum h is drawn at random fromy_h,obs. The estimator based on the full sample data is the usual stratified weighted average Y_strat¼PH

h¼1N_hy_h=N¼PH

h¼1v_hy_h. Here, v_h¼N_h=N and y_h¼P

shy_i=n_h, where s_h is the sample from stratum h and nh¼ jshj. Then VarðYstratÞ ¼PH

h¼1v_h²s_h²ð_n¹

h2_N¹

hÞ, with s_h² ¼

i[Uh

Pðy_i2mhÞ²=ðN_h21Þ being the population variance in stratum h. Here U_h is stratum populationhandm_his the average inU_h.

Letyhrbe the observed sample mean from stratumhands^_hr² ¼_n¹

hr21

P

i[shrðyi2yhrÞ² the observed sample variance. The imputation-based estimator is given by Y_strat^* ¼ PH

h¼1Nhy_h^*=N where y_h^*¼ P

s_hryiþP

s_h2s_hry_i^*

=nh¼ nhryhrþP

s_h2s_hry_i^*

=nh. Let themimputation replicates ofY_strat^* be denoted byY_strat;i^* fori¼1, : : :,m. The combined estimator is given byY^*

strat¼Pm

i¼1Y_strat;i^* =m:

5.1.1. Separate Strata Combinations

It follows from Section 3.1 thatkh¼1=ð12fhÞ, wherefh¼ ðn_h2nhrÞ=n_h is the rate of nonresponse in stratum h. The combination formula for the variance estimate of Y_strat^* becomes, from (8),

Wsep¼V^*þX^H

h¼1

1 12fh

þ1 m

B_h^* Here,V^*¼PH

h¼1V_h^*andV_h^*is the average of themvalues of the imputation-based variance estimateV^_h^* ¼v_h²s^_h*² _n¹

h2_N¹

h

wheres^_h*² ¼_n¹

h21

P

s_hrðyi2y_h^*Þ²þP

s_h2s_hrðy_i^*2y_h^*Þ²

.

(8)

5.1.2. Overall Combination Formula. Determination ofkin (1)

From (12) we need to determine Varðv_hY_h^*jy_{o b s}Þ and VarðY_{s t r a t}^* jy_{o b s}Þ ¼ PH

h¼1Var ðv_hY_h^*jy_{o b s}Þ. Then k¼X^H

h¼1

1

12f_hVarðvhY_h^*jyobsÞ VarðY_strat^* jy_obsÞ

Now, EðY_h^*jy_h;obsÞ ¼yhr and VarðY_h^*jy_h;obsÞ ¼{ðn_h2nhrÞ=n_h²}{ðn_hr21Þ=n_hr}s^_hr² <

f_hs^_hr²=n_h. Hence we can determine k as k¼X^H

h¼1

1 12fh

f_hv_h²s^_hr²=n_h X^H

k¼1

fkv_k²s^_kr²=nh

If the stratum sizes N_h are large then we can let Vðv^ _hY_hÞ ¼v_h²s^_hr²=n_h. Let also bh ¼fhVðv^ hYhÞ=PH

k¼1fkVðv^ kYkÞ. Then

k¼ X^H

h¼1

Vðv^ hYhÞfh

1 12f_h X^H

h¼1

Vðv^ _hYhÞf_h

¼X^H

h¼1

bh 1 12fh

ð13Þ

Since PH

h¼1bh ¼1, we see that kis a weighted average of the inverse of the response rates. If all f_h¼f, the overall nonresponse rate, we get as for simple random sample thatk¼1/(12f). Otherwise, a stratum response rate 12f_hhas large weight if either the nonresponse rate is large and/or the estimated variance of vhYh is large.

5.1.3. An Alternative Expression forkin (1)

By directly applying (5) we can get an alternative expression fork.Giveny_obs, the imputed sample means Y_h^* are independent, which implies that EðY_strat^* jy_obsÞ ¼PH

h¼1Nhyhr=N¼

ystrat;r andVarðY_strat^* jyobsÞ<PH

h¼1v_h²fhs^_hr²=nh:Just like in Section 3.1, (3) holds. From (5) we get

EðkÞ<VarðYstrat;rÞ2VarðYstratÞ E

h

Xv_h²f_hs^_hr²=n_h

!

¼ X^H

h¼1

v_h²s_h² E 1 n_hr

2 1 N_h

2X^H

h¼1

v_h²s_h² ¹ n_h2 1

N_h

X^H

h¼1

v_h²E f_h nh

Eðs^_hr²jn_hrÞ

<

X^H

h¼1

v_h²s_h²¹²^p^hr nh

1 phr

X^H

h¼1

v_h²s_h²¹²^p^hr n_h

¼ X^H

h¼1

v_h²s_h² nhr

EðfhÞ 12fh

Eð12fhÞ X^H

h¼1

v_h²s_h²

n_hrEðfhÞð12fhÞ

ð14Þ

(9)

Now,VarðYhrÞ ¼EVarðYhrjnhrÞ ¼s_h²Eð1=nhrÞ. Let Vðv^ hYhrÞ ¼v_h²s^_hr²=nhr. Then we see that the expression forE(k) is satisfied approximately, if the stratum sizesN_hare large, by letting

1 k¼

X^H

h21

ð12fhÞfhVðv^ hYhrÞ X^H

h21

f_hVðv^ _hY_hrÞ

¼X^H

h21

a_hð12f_hÞ ð15Þ

where the weights a_h¼f_hVðv^ _hY_hrÞ=PH

k¼1f_kVðv^ _kY_krÞ. Since PH

h¼1a_h¼1, we see that 1/k is a weighted average of the response rates. If all f_h ¼f, the overall nonresponse rate, we have, as shown in Section 5.1.2, that k ¼ 1/(12f). As seen in Section 5.1.2, we note also in Expression (15) that a stratum response rate 12fh has large weight if either the nonresponse rate is large and/or the estimated variance of vhYhr is large. The estimate of the total based on the response sample is given by Ystrat;r¼P

hvhYhr: We obtain Formula (13) for k by noting from (14) that we have EðkÞ<PH

h¼1 Varðv_hY_hÞEðf_hÞ_Eð12f¹

hÞ=PH

h¼1Varðv_hY_hÞEðf_hÞ. Then we see that the expression for E(k) is satisfied approximately, if the stratum sizes N_h are large, by letting k be given by (13).

5.2. Logistic Regression with Binary Explanatory Variable. Estimating Log(Odds Ratio) The variables Y1; : : : ;Yn are independent 0/1 -variables, and we have explanatory 0/1-variable x with fixed known values x1; : : : ;xn. The class probabilities are given by p1¼PðYi¼1jxi¼1Þ and p0¼PðYi¼1jxi¼0Þ. We assume a MAR(missing at random) model for the response variables R₁; : : : ;R_n, with PðR_i¼1jx_i¼1Þ ¼p_1r and PðRi¼1jxi¼0Þ ¼p0r. We can reparametrize the model in a logit version, log {PðY ¼1jxÞ=PðY¼0jxÞ}¼aþbx, where a¼ log {p0=ð12p0Þ} and b¼ log^p_p¹^=ð12p¹^Þ

0=ð12p0Þ¼ log(odds ratio). The aim is to estimateb:Lets¼(1,: : :,n) denote the

full sample with stratas1¼{i[s:xi¼1} ands0¼{i[s:xi¼0}. The sizes ofs₁and s0are denoted byn1andn0. We note thatn1¼Pn

i¼1xi¼Xandn0 ¼n –X.The response samples in the strata ares1r¼{i[s1:Ri¼1} and s0r¼{i[s0:Ri¼1} with total response sample being s_r of size n_r. Let also n_1r ¼ js_1rj and n_0r¼ js_0rj. We see that n1r¼P

s_rxi¼Xrandn0r ¼nr2Xr. The data froms_rcan be represented as follows where n_ijrdenotes the number of observations withx ¼ iandy ¼ j: see (Table 1).

We then have the maximum likelihood estimates (MLE)p^1r¼n11r=n1r and p^0r¼ n01r=n0rand MLE ofbequalsb^r¼ log^p_p^{^}_^^1r^=ð12^p^{^}^1r^Þ

0r=ð12p^0rÞ¼ logðn11rn00r=n10rn01rÞ. Similarly, the

Table 1. The observed data and nonresponse totals for the two classes

x\y y¼0 y¼1 Totals Nonresponse

x¼0 n_00r n_01r n_0r n₀2n_0r x¼1 n_10r n_11r n_1r n12n1r

(10)

estimator based on the full sample is given byb^¼ log^p_p^{^}_^¹^=ð12^p^{^}¹^Þ

0=ð12p^0Þ¼ logðn11n00=n10n01Þwith obvious analogue notation. We can express this estimate as b^¼ log {p^1=ð12p^1Þ}2 log {p^0=ð12p^0Þ}¼b^12b^0, of the same form as in Section 4.1.

We also have thatb^1andb^0are independent based on the separate sample stratas₁ands₀. For large n0, n1, b^ is approximately Nðb;s²_^

bÞ where s²_^

b ¼{n1p1ð12p1Þ}²¹þ {n0p0ð12p0Þ}²¹. So, approximately, Varðb^1Þ ¼1={n1p1ð12p1Þ} and Varðb^0Þ ¼ 1={n₀p0ð12p0Þ} and an estimate ofVarðb^Þis given by

Vð^ b^Þ ¼ 1

n₁p^1ð12p^1Þþ 1

n₀p^0ð12p^0Þ¼ 1 n₁₁þ 1

n₁₀

þ 1

n₀₁þ 1 n₀₀

such thatVð^ b^Þ ¼V^₁þV^₀, whereV^₁¼ _n¹

11þ_n¹

10

and V^₀ ¼ _n¹

01þ_n¹

00

are the variance estimates ofb^1 and b^0, respectively.

We shall consider the following imputation method: For each missing value in s₁ – s_1r, the imputed value y* is drawn at random from the estimated distribution of Y given x¼1:

y* ¼1 with probability p^1r ¼n11r=n1r and y* ¼0 with probability 12p^1r: The same imputation method is used for s₀ – s_0r with y* drawn at random from the estimated distribution of Y given x¼0. This is the same as stratified hot-deck imputation, imputed values are drawn at random, with replacement, from y1;obs¼ ðyi: i[s1rÞ and y0;obs ¼ ðyi:i[s0rÞ.

The imputed values ins– s_rcan be represented in the same form as the original data where nown_ij^*denotes the number of imputed values withx ¼ iandy ¼ j: see (Table 2).

The imputation-based estimate of p₁ is given by p^₁^*¼ ðn11rþn₁₁^*Þ=n1 such that the imputation-based estimate b^₁^*¼ log {p^₁^*=ð12p^₁^*Þ}¼ log {ðn_11rþn₁₁^*Þ=

ðn₁2n11r2n₁₁^*Þ}. Similarly, the imputation-based estimates forb0 andbare given by b^₀^*¼ log {ðn01rþn₀₁^*Þ=ðn02n01r2n₀₁^*Þ} andb^^*¼b^₁^*2b^₀^*.

Themrepeated imputations lead tomestimatesb^_1;i^*;b^_0;i^*;b^_i^*, for i ¼ 1,: : :,m.The combined estimate is given by b^*¼Pm

i¼1b^_i^*=m¼Pm

i¼1b^_1;i^*=m2Pm

i¼1b^_0;i^*=m¼ b₁^*2b₀^*. The imputed variance estimateV^^*forb^is given by

V^^* ¼ 1

n_11rþn₁₁^* þ 1

n_10rþn₁₀^* þ 1

n_01rþn₀₁^* þ 1

n_00rþn₀₀^* ð16Þ

We see thatEðV^^*jyobsÞ<_n ¹

1p^1rð12p^1rÞþ_n ¹

0p^0rð12p^0rÞand (3) hold. We also note that (9) holds separately for each class.

Table 2. The imputed totals for the two classes

x\y y¼0 y¼1 Totals

x¼0 n^*₀₀ n^*₀₁ n02n0r

x¼1 n^*₁₀ n^*₁₁ n₁2n_1r

(11)

5.2.1. Separate Classes Combination

Let us first use the approach in Section 4.1 and determine separatek₁, k₀for the two classes. Consider first stratums₁¼{i[s:x_i¼1}. In Appendix A.3 it is shown that Eðb^₁^*jy_1;obsÞ<b^1r and Varðb^₁^*jy_1;obsÞ<f1ð12f1ÞVð^ b^1rÞ. From (10), we find approximately:

Eðk1Þ ¼ Varðb^1rÞ2Varðb^1Þ

E{f1ð12f1ÞVð^ b^1rÞ}¼ E Varðb^1rjn_1rÞ2Varðb^1Þ E{f1ð12f1ÞE½Vð^ b^1rÞjn1r}

<

1

p1ð12p1Þ E 1 n1r

2 1 n1

E f1ð12f1Þ 1 n1rp1ð12p1Þ

<ð12p_1rÞ=p_1r 12p1r

¼ 1 p1r

which is satisfied approximately by lettingk1¼1=ð12f1Þ. In exactly the same way, we find thatk₀ ¼1=ð12f₀Þwheref₀ ¼ ðn₀2n_0rÞ=n₀is the rate of nonresponse in stratum s₀. The between-imputation component forb^₁^*is given byB₁^*¼_m21¹ Pm

i¼1ðb^_1;i^* 2b₁^*Þ²and likewiseB₀^*is the between-imputation component forb^₀^*. Then an estimated variance of the combined imputation-based estimateb^* forbis given by, from (8),

W_sep¼V^*þX1 x¼0

1 12f_xþ1

m

B_x^*

whereV^*is the average ofmreplicates of the imputed variance estimateV^^*given by (16).

5.2.2. Overall Combination Formula. Determination ofkin (1)

SinceVarðb^₁^*jy1;obsÞ ¼f1ð12f1ÞVð^ b^1rÞandVarðb^₀^*jy0;obsÞ ¼f0ð12f0ÞVð^ b^0rÞ, we have from (12)

k¼ 1 12f1

f1ð12f1ÞVð^ b^1rÞ X¹

x¼0

fxð12fxÞVð^ b^xrÞ

þ 1

12f0

f0ð12f0ÞVð^ b^0rÞ X¹

x¼0

fxð12fxÞVð^ b^xrÞ

ð17Þ

Varðb^1Þ < ðn_1r=n₁ÞVarðb^1r jn_1rÞ ¼ ð12f₁ÞVarðb^1rjn_1rÞ. Similarly, Varðb^0Þ<

ð12f0ÞVarðb^0rjn_0rÞ. We can therefore estimate the variance of the full sample estimates b^1andb^0byVð^ b^1Þ ¼ ð12f1ÞVð^ b^1rÞandVð^ b^0Þ ¼ ð12f0ÞVð^ b^0rÞ, respectively. Then

k¼ 1

12f₁ f1Vð^ b^1Þ X¹

x¼0

fxVð^ b^xÞ

þ 1

12f₀ f0Vð^ b^0Þ X¹

x¼0

fxVð^ b^xrÞ

¼ 1

12f₁b1þ 1

12f₀ð12b1Þ

Just like in Section 5.1.2 we see thatkis a weighted average of the inverse of the response rates. If allf_h ¼f,the overall nonresponse rate, we get thatk ¼ 1/(12f). Otherwise, a stratum response rate 1 –f_xhas large weight if either the nonresponse rate is large and/or the estimated variance ofb^xis large.

(12)

Alternatively, from (17), 1=k¼P1

x¼0ð12fxÞfxVð^ b^xrÞ=P1

x¼0fxVð^ b^xrÞ ¼P1 x¼0

a_xð12f_xÞ, where the weights are a_x¼f_xVð^ b^xrÞ={f₁Vð^ b^1rÞ þf₀Vð^ b^0rÞ}. So we can alternatively express 1/kas a weighted average of the response rates.

If the aim is to estimate p1 andp0 we obtain, of course, k¼1/(12f₁) for p1 and k¼1=(12f0) forp0.

5.3. Logistic Regression with Categorical Explanatory Variable. Estimating Log(Odds Ratios)

If the explanatoryxis categorical defining, say,Hclasses, we can generalize the results as follows:

Let ph¼PðY¼1jx¼hÞ, h¼0,: : :,H – 1. Logistic regression defining the categories is done by introducing H21 binary explanatory variables x₁, : : :, x_H-1 wherex_h ¼1 if observation belongs to Class h, and 0 otherwise forh¼1, : : :,H – 1. Then an observation belongs to Class 0 if x₁¼x₂¼: : :¼xH21¼0. The logit version of the model becomes, with x¼ ðx₁;x2; : : : ;xH21Þ: log {PðY¼1jxÞ=

PðY ¼0jxÞ}¼aþb1x1þb2x2þ: : :þxH21bH21. We see that a¼ log_12p^p⁰

0 and

bh¼ log^p_p^h^=ð12p^h^Þ

0=ð12p0Þ¼ log (odds ratio) for Class h versus Class 0. Estimating bh by

multiple imputation is done in exactly the same manner as for binary x, with Class h replacing Class 1.

5.4. Logistic Regression with Missing Values in a Binary Explanatory Variable The situation is as in Section 5.2, except thatyis fully observed ins,y¼ ðy1; : : : ;ynÞ, and we have missing values for thex-variable.Y1; : : : ;Ynare independent 0/1-variables and we have an explanatory 0/1-variablexwith fixed valuesx₁; : : : ;x_n, some of which are missing. The response variables indicate missingness of thex_i’s but now with MAR model PðR_i¼1jy_i¼1Þ ¼q_1r andPðR_i¼1jy_i¼0Þ ¼q_0r.

Otherwise, the model is the same as in Section 5.2 with class probabilities: p1¼PðYi¼1jxi¼1Þ and p0¼PðYi¼1jxi¼0Þ, and the logit version

log {PðY¼1jxÞ= PðY¼0jxÞ}¼aþbx with b¼ log ^p_p¹^=ð12p¹^Þ

0=ð12p0Þ. The aim is still to

estimate b.

Let nows¹ ¼ {i[s:y_i¼1} ands⁰ ¼{i[s:y_i¼0} with sizes n⁺₁ andn⁺₀. The response samples in the strata ares¹_r ¼{i[s¹:Ri¼1} ands⁰_r ¼{i[s⁰ :Ri¼1} with total response sample being s_r¼{i[s:R_i¼1}¼s¹_r <s⁰_r. The data can now be represented as before, except that nonresponse totals are for eachy-stratum. See Table 3.

The MLE p^1r;p^0r;b^r, based on s_r are the same as before, as is the full sample estimate b^. The imputation method is stratified hot-deck for the y-strata. For each

Table 3. The observed data and nonresponse totals for the y-strata

x\y y¼0 y¼1

x¼0 n_00r n_01r

x¼1 n_10r n_11r

Totals n⁺_0r n⁺_1r

Nonresponse n⁺₀2n⁺_0r n⁺₁2n⁺_1r

(13)

missing value of x in s¹2s¹_r, the imputed value x* is drawn at random from x_1;obs ¼ ðx_i:i[s¹_rÞ. Similarly, imputed values in s⁰2s⁰_r are drawn at random from x0;obs ¼ ðx_i:i[s⁰_rÞ. The imputed values in s – s_r can be represented in the same form as the original data where now n_ij^* denotes the number of imputed values with x ¼ i and y ¼ j. See Table 4.

We need to find an approximate expression for the expectation and variance ofb^^*, now denoted b^_*, conditional on the observed data. We defer to Appendix A.4 to show that Varðb^_*jy;xobsÞ<f¹ð12f¹Þð_n¹

11rþ_n¹

01rÞ þf⁰ð12f⁰Þð_n¹

10rþ_n¹

00rÞandEðb^_*jy;xobsÞ<b^r. Heref¹¼ ðn⁺₁2n⁺_1rÞ=n⁺₁ is the nonresponse rate in Stratums¹andf⁰¼ ðn⁺₀2n⁺_0rÞ=n⁺₀ the nonresponse rate ins⁰. We note thatq^1r ¼n⁺_1r=n⁺₁¼12f¹andq^0r¼n⁺_0r=n⁺₀. So the denominator in (5) becomes

E f¹12f¹ 1 n11r

þ 1 n01r

þf⁰12f⁰ 1 n10r

þ 1 n00r

ð18Þ

The numerator in (5) equals, as before,Varðb^rÞ2Varðb^Þ, and we have approximately Varðb^rÞ2Varðb^Þ ¼ 1

n1p1ð12p1Þ12p_1r p1r

þ 1

n0p0ð12p0Þ12p_0r p0r

ð19Þ where, as before,p1r¼PðRi¼1jxi¼1Þandp0r ¼PðRi¼1jxi¼0Þ. We need alternative estimates of p_1r and p_0r. Since p_1r¼p1q_1rþ ð12p1Þq_0r; we have

^

p1r¼p^1ð12f¹Þ þ ð12p^1Þð12f⁰Þ. Similarly,p^0r ¼p^0ð12f¹Þ þ ð12p^0Þð12f⁰Þ.

We can also use thatn₁p^_1r <n_1randn₀p^_0r<n_0r. From (18) and (19) it follows that we can use

k¼ 1 n_11rþ 1

n_10r

p^1rf¹þð12p^1rÞf⁰

þ 1

n_01rþ 1 n_00r

p^0rf¹þð12p^0rÞf⁰

f¹12f¹ 1 n_11rþ 1

n_01r

þf⁰12f⁰ 1 n_01rþ 1

n_00r

¼

f¹ 1 n10r

þ 1 n00r

þf⁰ 1 n11r

þ 1 n01r

f¹12f¹ 1 n11r

þ 1 n01r

þf⁰12f⁰ 1 n01r

þ 1 n00r

We note that iff¹ ¼ f⁰¼f,thenk ¼ 1/(1 – f). Otherwise, we can express 1/kas a linear combination of the response rates (1 –f¹, 1 –f⁰). Let w1¼_n¹

11rþ_n¹

01r and

w0 ¼_n¹

10rþ_n¹

00r. Then

Table 4. The imputed totals for the y-strata

x\y y¼0 y¼1

x¼0 n^*₀₀ n^*₀₁

x¼1 n^*₁₀ n^*₁₁

Totals n⁺₀2n⁺_0r n⁺₁2n⁺_1r