Non-Bayesian Multiple Imputation
Jan F. Bjørnstad1
Multiple imputation is a method specifically designed for variance estimation in the presence of missing data. Rubin’s combination formula requires that the imputation method is
“proper,” which essentially means that the imputations are random draws from a posterior distribution in a Bayesian framework. In national statistical institutes (NSI’s) like Statistics Norway, the methods used for imputing for nonresponse are typically non-Bayesian, e.g., some kind of stratified hot-deck. Hence, Rubin’s method of multiple imputation is not valid and cannot be applied in NSI’s. This article deals with the problem of deriving an alternative combination formula that can be applied for imputation methods typically used in NSI’s and suggests an approach for studying this problem. Alternative combination formulas are derived for certain response mechanisms and hot-deck type imputation methods.
Key words: Variance estimation; survey sampling; stratified sampling; logistic regression;
nonresponse; hot-deck imputation.
1. Introduction
Multiple imputation is a method specifically designed for variance estimation in the presence of missing data, developed by Rubin (1987). Two more recent references with further discussions and studies are Rubin (1996) and Schafer (1997). The basic idea is to createmimputed values for each missing value and combine themcompleted data sets by Rubin’s combination formula for variance estimation. For the estimator to be valid, the imputations must display an appropriate level of variability. In Rubin’s term, the imputation method is required to be “proper.” In national statistical institutes (NSI’s) the methods used for imputing for nonresponse very seldom if ever satisfy the requirement of being “proper.” However, the idea of creating multiple imputations to measure the imputation uncertainty and use it for variance estimation and for computing confidence intervals is still of interest. The problem is then that Rubin’s combination formula is no longer valid with the usual nonproper imputations used by NSI’s. The reason is that the variability in nonproper imputations is too small and the between-imputation component must be given a larger weight in the variance estimate. The problem is then to determine what this weight should be to give valid statistical inference, and also for what kind of nonresponse mechanisms and estimation problems it is possible to determine a simple
qStatistics Sweden
1Statistics Norway, Division for Statistical Methods and Standards, P.O. Box 8131 Dep., N-0033 Oslo, Norway.
Email: jab@ssb.no
Acknowledgment:The problem of deriving a non-Bayesian multiple imputation method was studied in a Master’s thesis in 1999 by Tonje Braaten with the author as her adviser. The present research began within the DACSEIS research project, and started originally when the author was contacted by Tonje Braaten regarding this issue in her doctoral studies in epidemiology.
combination formula not dependent on unknown parameters. This article suggests an approach for studying this problem.
In Section 2 an approach for determining the combination of the imputed completed data sets is suggested. Section 3 has three applications with random nonresponse:
(i) estimating a population average from simple random samples using hot-deck imputation, (ii) estimating the regression coefficient in the ratio model using residual regression imputation and (iii) estimating the regression coefficient in simple linear regression with residual regression imputation. Section 4 deals with the general problem of multiple imputation for stratified samples. In Section 5 we apply the theory in Section 4 to stratified samples with random nonresponse within strata, covering (i) estimation of population average using stratified hot-deck imputation and (ii) estimation of log (odds ratios) in logistic regression with missingness both for the dependent variable and the explanatory variable. Section 6 takes up the problem of using the same combination rule for all estimation problems with a given imputation method and data and response model. A general result for hot-deck imputation and linear estimates is presented.
2. An Approach for Determining an Alternative Combination Formula for Variance Estimation in Multiple Imputation
Let s¼ ð1; : : : ;nÞ denote the full sample, with y¼ ðy1; : : : ;ynÞ denoting the full sample data, values of random variable Y1; : : : :;Yn. In the case of sampling from a finite population under a design model, a renumbering of the selected units has been performed, of course, and the stochastic nature of y is determined by the sampling plan. The objective is to estimate some parameter u. The observed data is denoted by yobs ¼{ðyi:i[srÞ;sr}, being the observed part of y and the response sample sr of size nr.
Letu^be the estimator based on the full sample datay, withVarðu^Þestimated byVðyÞ.^ For i[s2sr we impute by some method yi* and let y* denote the complete data ðyi:i[sr;yi*:i[s2srÞ. Based ony*, we haveu^*¼u^ðy*ÞandV^*¼Vð^ y*Þ.
Multiple imputation of m repeated imputations leads to m completed data-sets with m estimates u^i*;i¼1; : : : ;m; and related variance estimates V^i*;i¼1; : : : ;m.
The combined estimate is given by u*¼Pm
i¼1u^i*=m. The within-imputation variance is defined as V*¼Pm
i¼1V^i*=m and the between-imputation component is B*¼ Pm
i¼1ðu^i*2u*Þ2=ðm21Þ:The total estimated variance ofu* is then proposed to be W ¼V*þ kþ1
m
B* ð1Þ
That is, we need to determineksuch that
EðWÞ ¼Varðu*Þ ð2Þ
Rubin (1987) has shown that k¼1 can be used with proper imputations, which essentially means drawing imputed values from a posterior distribution in a Bayesian framework.
In general, one has to determine the terms in (2). One way to try and do this is to use double expectation, conditioning on yobs, that is, EðWÞ ¼E{EðWjYobsÞ} and Varðu*Þ ¼E{Varðu*jYobsÞ}þVar{Eðu*jYobsÞ}. Typically,
EðV*Þ<Varðu^Þ ð3Þ
andEðB*jyobsÞ ¼Varðu^*jyobsÞ. Hence, approximately EðWÞ ¼Varðu^Þ þ EðkÞ þ1
m
EVarðu^*jYobsÞ ð4Þ
Moreover,Varðu*jyobsÞ ¼Varðu^*jyobsÞ=mandEðu*jyobsÞ ¼Eðu^*jyobsÞ. This implies that Varðu*Þ ¼m21E{Varðu^*jYobsÞ}þVar{Eðu^*jYobsÞ}. From (3) and (4), Equation (2) becomes Varðu^Þ þEðkÞEVarðu^*jYobsÞ ¼Var{Eðu^*jYobsÞ}, which gives the following general expression:
EðkÞ ¼VarEðu^*jYobsÞ2Varðu^Þ
EVarðu^*jYobsÞ ð5Þ
For this to be of interest,kmust be, at least approximately, determined independently of unknown parameters. In addition, one needs to check that (3) holds. To illustrate how (5) can be used we shall in the next section consider three special cases with random nonresponse.
3. Three Applications for Random Nonresponse
3.1. Estimating Population Average with Hot-deck Imputation
Consider a simple random sample from a finite population of sizeN, where the aim is to estimate the population averagemof some variabley. We shall assume completely random nonresponse. In the terminology of Rubin (1987) and Little and Rubin (2002), the missingness mechanism is said to be MCAR (missing completely at random). We note that MCAR means that the response indicators R1; : : : ;RN are independent with the same response probability pr¼PðRi¼1Þ. The imputation method is the hot-deck method, whereyi*is drawn at random fromyobswith replacement, and the estimate is the sample mean. Letyr be the observed sample mean ands^r2¼n1
r21
P
i[srðyi2yrÞ2 the observed sample variance. ThenY*is the imputation-based sample mean for the completed sample, and the combined estimator is given byY*¼Pm
i¼1Yi*=m. LetYsdenote the sample mean based on a full sample. Then,VarðYsÞ ¼s2ð1n2N1Þ, withs2¼ ðN21Þ21PN
i¼1ðyi2mÞ2 being the population variance. We have further that EðY*jyobsÞ ¼yr and VarðY*jyobsÞ ¼{ðn2nrÞ=n2}{ðnr21Þ=nr}s^r2 using that EðYi*jyobsÞ ¼yr and VarðYi*jyobsÞ ¼s^2rðnr21Þ=nr.
In this case, V^*¼s^2*ð1n2N1Þ where s^2*¼n211 P
srðyi2y*Þ2þP
s2srðyi*2y*Þ2
. It can be shown that Eðs^2*jyobsÞ ¼s^r2 12n1
r
1þnðn21Þnr
<s^2r and (3) holds.
We find, from (5),
EðkÞ ¼
Var Yr 2s2 1 n21
N
E n2nr
n2 nr21 nr
Eðs^r2jnrÞ
¼
s2 E 1 nr
21 N
2s2 1 n21
N
E n2nr
n2 nr21 nr
s2
<ð12prÞ=pr 12pr
¼ 1 pr
which is satisfied approximately, withf¼ ðn2nrÞ=n being the rate of nonresponse, by letting k¼1=ð12fÞ:
3.2. Estimating the Regression Coefficient in the Ratio Model with Residual Imputation We shall assume completely random nonresponse as in Section 3.1. We consider a ratio model, i.e., regression through the origin: Yi¼bxiþ1i, with Varð1iÞ ¼s2xi; i¼1,: : :,n.It is assumed thatall xi’s are known, also in the nonresponse sample. The full data estimator of b is given byb^¼Pn
i¼1Yi=Pn
i¼1xi. The unbiased estimator ofs2is given bys^2¼Pn
i¼11
xiðyi2b^xiÞ2=ðn21Þ.
We shall consider residual regression imputation. Let b^r be the b^ - estimate based on observed sample sr. Define the standardized residuals ei¼ ðyi2b^rxiÞ=pffiffiffiffixi
, for i[sr. Fori[s2sr: draw the value ofe*i at random, with replacement, from the set of observed residuals ei;i[sr. The imputed y-value is given by yi*¼b^rxiþei* ffiffiffiffixi
p . Let X¼Pn
i¼1xi;Xr¼P
i[srxi and Xnr¼P
i[s2srxi¼X2Xr. All considerations from now on are conditional onnrandXr, and we aim to determinekdirectly from (5).
The proportion of thex-total in the nonresponse group is denoted asfX ¼Xnr=X.
We now have b^*¼ P
sryiþP
s2sryi*
=X and s^2*¼n211 P
sr 1
xiðyi2b^*xiÞ2 þP
s2sr1
xiðyi*2b^*xiÞ2 .
In order to determine k from (5) we need to check the validity of (3) and derive EVarðb^*jyobsÞ;VarEðb^*jyobsÞandVarðb^Þ. We note thatVarðb^Þ ¼s2=X. In Appendix A.1 it is shown that condition (3) holds for moderate and largenr, and that
VarEðb^*jyobsÞ ¼s2 Xr
þð12d1Þd2nnrXnr
X2 s2
nr
ð6Þ EVarðb^*jyobsÞ ¼Xnr
X2s2
nr ðnrþd122Þ ð7Þ
Here, 0#d1;d2#1. From (5), using (6) and (7), we find k¼nrX22nrXXrþ ð12d1Þd2nnrXnrXr
XrXnrðnrþd122Þ < X
Xrþ ð12d1Þd2
nnr
nr
We note that if all xi ¼1, then d1¼d2 ¼1. Now, with fX ¼Xnr=X being the proportion of thex-total in the nonresponse group andf ¼nnr=nthe rate of nonresponse, we finally get, since typicallyð12d1Þd2<0,
k< 1 12fX
þ ð12d1Þd2 f
12f < 1 12fX
for usualx-values and nonresponse rates.
3.3. Estimating the Regression Coefficient in Simple Linear Regression with Residual Imputation
As in Sections 3.1 and 3.2 the nonresponse mechanism is assumed to be MCAR with pr¼PðRi¼1Þ. The simple linear regression model is assumed: Yi¼aþbxiþ 1i;withVarð1iÞ ¼s2;i¼1; : : : ;n:Allxi’s are assumed to be known. We may assume, that x¼Pn
i¼1xi=n¼0. Then the full data estimates are given by b^¼Pn
i¼1xiyi=SSx, whereSSx¼Pn
i¼1xi2;anda^¼y¼Pn
i¼1yi=n:The unbiased estimator ofs2is given by s^2¼n221 Pn
i¼1ðyi2a^2b^xiÞ2. Leta^r;b^rbe the estimates based on the response sample, a^r¼yr2b^rxrandb^r¼P
i[srðxi2xrÞyi=SSx;r. Here,yr¼P
i[sryi=nr,xr ¼P
i[srxi=nr
andSSx;r¼P
i[srðxi2xrÞ2.
Simple residual imputation is defined as follows: The observed residuals are ej¼ ðyj2a^r2b^rxjÞ;forj[sr. Fori[s2sr: drawei*at random, with replacement from ðej;j[sr). The imputedy-value is given byyi*¼a^rþb^rxiþei*:
The imputation based estimates are b^* ¼ P
i[srxiyiþP
i[s2srxiyi*
=SSx, a^*¼ ðnryrþ ðn2nrÞynr*Þ=n where ynr* ¼P
s2sryi*=ðn2nrÞ and s^2* ¼n221 P
srðyi2 n
a^*2b^*xiÞ2þP
s2srðyi*2a^*2b^*xiÞ2o
: It can be shown (see Appendix A.2 for a summary proof) that Eðs^2*Þ ¼s2Eðnrn22
r n22fn22Þ<s2 where, as in Section 3.1, f ¼ ðn2nrÞ=n. Since Varðb^Þ ¼s2=SSx, (3) holds. It is readily seen that Eðb^*jyobsÞ ¼ b^r and Varðb^*jyobsÞ ¼se2cr=SSx, where se2¼P
srei2=nr and cr ¼P
s2srxi2=SSx
[k0;1l. It can be shown thatEðse2jsrÞ ¼nrn22
r s2. Moreover, clearlyEðcrÞ ¼12pr and Varðb^rjsrÞ ¼s2=SSx;r:It follows, from (5), that
EðkÞ ¼ Eð1=SSx;rÞ21=SSx ð12prÞE{ðnr22Þ=nr}=SSx
<1=EðSSx;rÞ21=SSx ð12prÞ=SSx
Using the fact that conditional on nr, sr is a simple random sample such that the response indicators are correlated with Cov(Ri,Rj)¼ 2f(12f)/(n21), we find that EðSSx;rÞ ¼ ðpr212pn21rÞSSx. It follows that, approximately,E(k)¼p1
r21n<1=prand we can usek¼1/(12f).
4. Multiple Imputation for Stratified Samples
4.1. Separate Combinations
One way to combine themcompleted data sets is to do it separately for each stratum, i.e., determine a separate k for each stratum. The general setup is then as follows:
The sample s is divided into H sample strata, s1,: : :, sH. Let yh be the planned full data from subsample shof size nh. It is assumed that y1,: : :,yH are independent. The observed part of yhis denoted byyh,obs withshrbeing the response sample from shof size nhr. The estimator based on the full sample data is the sum of independent terms, u^¼PH
h¼1u^h where u^h is based on the yh. Varðu^Þ ¼PH
h¼1Varðu^hÞ is estimated by Vð^ u^Þ ¼PH
h¼1V^hðyhÞ where V^hðyhÞ is the variance estimate of u^h based on yh. For i[sh2shr we impute by some method yi* based on yh,obs and let yh* denote the complete data ðyh;obs;yi*;i[sh2shrÞ. Based on yh*, we have u^h*¼u^hðyh*Þ and V^h*¼ V^hðyh*Þ: Then the imputation based estimator is given by u^*¼PH
h¼1u^h* and V^*¼PH
h¼1V^h*. Multiple imputation of m repeated imputations leads to m completed data sets with mestimates for each stratum h,u^h;i;i¼1; : : : ;m and related variance estimates V^h;i*;i¼1; : : : ;m: The total estimates and related variances are u^i*¼ PH
h¼1u^h;i* andV^i*¼PH
h¼1V^h;i*; fori ¼ 1,: : :,m. The combined estimate for stratumh is given by uh*¼Pm
i¼1u^h;i*=m. The within-imputation variance for stratum h is Vh*¼ Pm
i¼1V^h;i*=m and the between-imputation component is given by Bh*¼Pm
i¼1ðu^h;i* 2uh*Þ2=ðm21Þ. Following the same idea as in Section 2, Formula (1), the total estimated variance of uh* is then proposed to be Wh¼Vh*þ ðkhþm1ÞBh*. The combined total estimate is given byu* ¼Pm
i¼1u^i*=m¼PH
h¼1uh*. It follows that the total estimated variance of u* can be expressed as
Wsep¼XH
h¼1
Wh¼V*þXH
h¼1
khþ1 m
Bh* ð8Þ
where V*¼Pm
i¼1V^i*=m¼PH
h¼1Vh*. Provided (3) holds for each stratum h,
EðVh*Þ<Varðu^hÞ ð9Þ
we have from (5) that khmust satisfy EðkhÞ ¼VarEðu^h*jYh;obsÞ2Varðu^hÞ
EVarðu^*hjYh;obsÞ ð10Þ
The combination Formula (8) is an alternative to the usual combination Formula (1), especially useful when we get simple expressions forkhbut not fork.The next section develops an expression for k in this situation.
4.2. An Overall Combination Formula
Now letW be given by (1). We shall determine the between-imputation factork. Since EðWÞ ¼EðWsepÞwe have
E XH
h¼1
khþ1 m
Bh*
( )
¼E kþ1 m
B* ð11Þ
Here, B*¼m211 Pm
i¼1ðu^i*2u*Þ2¼m211 Pm i¼1
P
hðu^h;i* 2uh*Þ n o2
. Note that EðB*jyobsÞ ¼ E PH
h¼1Bh*jyobs
, sinceEðB*jyobsÞ ¼Varðu^*jyobsÞ ¼PH
h¼1Varðu^h*jyobsÞandEðBh*jyobsÞ ¼ Varðu^h*jyobsÞ.
Hence, the identity (11) becomes E{PH
h¼1khEðBh*jYobsÞ}¼E{kEðB*jYobsÞ}: This gives us the solution k¼PH
h¼1khEðBh*jyobsÞ=EðB*jyobsÞ if we want to use the usual combination Formula (1) and hence
k¼ XH
h¼1
khVarðu^h*jyobsÞ Varðu^*jyobsÞ ¼XH
h¼1
khVarðu^h*jyobsÞ
Varðu^*jyobsÞ ð12Þ
a weighted average ofkh. We get a simple expression forkonly when allkhare equal, saykh¼k0. Thenk¼k0.
5. Four Applications to Stratified Samples and Random Nonresponse within Strata 5.1. Estimating Population Average from Stratified Sample with Stratified Hot-deck
Imputation
Consider stratified simple random samples from a finite population of sizeN, withHstrata of sizes Nh, h¼1,: : :,H. The aim is to estimate the population average m of some variabley. We assume completely random nonresponse within each stratum, denoted as MAR (missing at random) by Rubin (1987) and Little and Rubin (2002). This means that the response indicators in stratum h, Rh;1; : : : ;Rh;Nh are independent with phr¼PðRh;i¼1Þ. The imputation method is stratified hot-deck. Letyh,obsbe the observed part from the response sampleshrof sizenhrfrom stratumh,yh;obs¼ ðyi:i[shrÞ. Then an imputed valueyi* in stratum h is drawn at random fromyh,obs. The estimator based on the full sample data is the usual stratified weighted average Ystrat¼PH
h¼1Nhyh=N¼PH
h¼1vhyh. Here, vh¼Nh=N and yh¼P
shyi=nh, where sh is the sample from stratum h and nh¼ jshj. Then VarðYstratÞ ¼PH
h¼1vh2sh2ðn1
h2N1
hÞ, with sh2 ¼
i[Uh
Pðyi2mhÞ2=ðNh21Þ being the population variance in stratum h. Here Uh is stratum populationhandmhis the average inUh.
Letyhrbe the observed sample mean from stratumhands^hr2 ¼n1
hr21
P
i[shrðyi2yhrÞ2 the observed sample variance. The imputation-based estimator is given by Ystrat* ¼ PH
h¼1Nhyh*=N where yh*¼ P
shryiþP
sh2shryi*
=nh¼ nhryhrþP
sh2shryi*
=nh. Let themimputation replicates ofYstrat* be denoted byYstrat;i* fori¼1, : : :,m. The combined estimator is given byY*
strat¼Pm
i¼1Ystrat;i* =m:
5.1.1. Separate Strata Combinations
It follows from Section 3.1 thatkh¼1=ð12fhÞ, wherefh¼ ðnh2nhrÞ=nh is the rate of nonresponse in stratum h. The combination formula for the variance estimate of Ystrat* becomes, from (8),
Wsep¼V*þXH
h¼1
1 12fh
þ1 m
Bh* Here,V*¼PH
h¼1Vh*andVh*is the average of themvalues of the imputation-based variance estimateV^h* ¼vh2s^h*2 n1
h2N1
h
wheres^h*2 ¼n1
h21
P
shrðyi2yh*Þ2þP
sh2shrðyi*2yh*Þ2
.
5.1.2. Overall Combination Formula. Determination ofkin (1)
From (12) we need to determine VarðvhYh*jyo b sÞ and VarðYs t r a t* jyo b sÞ ¼ PH
h¼1Var ðvhYh*jyo b sÞ. Then k¼XH
h¼1
1
12fhVarðvhYh*jyobsÞ VarðYstrat* jyobsÞ
Now, EðYh*jyh;obsÞ ¼yhr and VarðYh*jyh;obsÞ ¼{ðnh2nhrÞ=nh2}{ðnhr21Þ=nhr}s^hr2 <
fhs^hr2=nh. Hence we can determine k as k¼XH
h¼1
1 12fh
fhvh2s^hr2=nh XH
k¼1
fkvk2s^kr2=nh
If the stratum sizes Nh are large then we can let Vðv^ hYhÞ ¼vh2s^hr2=nh. Let also bh ¼fhVðv^ hYhÞ=PH
k¼1fkVðv^ kYkÞ. Then
k¼ XH
h¼1
Vðv^ hYhÞfh
1 12fh XH
h¼1
Vðv^ hYhÞfh
¼XH
h¼1
bh 1 12fh
ð13Þ
Since PH
h¼1bh ¼1, we see that kis a weighted average of the inverse of the response rates. If all fh¼f, the overall nonresponse rate, we get as for simple random sample thatk¼1/(12f). Otherwise, a stratum response rate 12fhhas large weight if either the nonresponse rate is large and/or the estimated variance of vhYh is large.
5.1.3. An Alternative Expression forkin (1)
By directly applying (5) we can get an alternative expression fork.Givenyobs, the imputed sample means Yh* are independent, which implies that EðYstrat* jyobsÞ ¼PH
h¼1Nhyhr=N¼
ystrat;r andVarðYstrat* jyobsÞ<PH
h¼1vh2fhs^hr2=nh:Just like in Section 3.1, (3) holds. From (5) we get
EðkÞ<VarðYstrat;rÞ2VarðYstratÞ E
h
Xvh2fhs^hr2=nh
!
¼ XH
h¼1
vh2sh2 E 1 nhr
2 1 Nh
2XH
h¼1
vh2sh2 1 nh2 1
Nh
XH
h¼1
vh2E fh nh
Eðs^hr2jnhrÞ
<
XH
h¼1
vh2sh212phr nh
1 phr
XH
h¼1
vh2sh212phr nh
¼ XH
h¼1
vh2sh2 nhr
EðfhÞ 12fh
Eð12fhÞ XH
h¼1
vh2sh2
nhrEðfhÞð12fhÞ
ð14Þ
Now,VarðYhrÞ ¼EVarðYhrjnhrÞ ¼sh2Eð1=nhrÞ. Let Vðv^ hYhrÞ ¼vh2s^hr2=nhr. Then we see that the expression forE(k) is satisfied approximately, if the stratum sizesNhare large, by letting
1 k¼
XH
h21
ð12fhÞfhVðv^ hYhrÞ XH
h21
fhVðv^ hYhrÞ
¼XH
h21
ahð12fhÞ ð15Þ
where the weights ah¼fhVðv^ hYhrÞ=PH
k¼1fkVðv^ kYkrÞ. Since PH
h¼1ah¼1, we see that 1/k is a weighted average of the response rates. If all fh ¼f, the overall nonresponse rate, we have, as shown in Section 5.1.2, that k ¼ 1/(12f). As seen in Section 5.1.2, we note also in Expression (15) that a stratum response rate 12fh has large weight if either the nonresponse rate is large and/or the estimated variance of vhYhr is large. The estimate of the total based on the response sample is given by Ystrat;r¼P
hvhYhr: We obtain Formula (13) for k by noting from (14) that we have EðkÞ<PH
h¼1 VarðvhYhÞEðfhÞEð12f1
hÞ=PH
h¼1VarðvhYhÞEðfhÞ. Then we see that the expression for E(k) is satisfied approximately, if the stratum sizes Nh are large, by letting k be given by (13).
5.2. Logistic Regression with Binary Explanatory Variable. Estimating Log(Odds Ratio) The variables Y1; : : : ;Yn are independent 0/1 -variables, and we have explanatory 0/1-variable x with fixed known values x1; : : : ;xn. The class probabilities are given by p1¼PðYi¼1jxi¼1Þ and p0¼PðYi¼1jxi¼0Þ. We assume a MAR(missing at random) model for the response variables R1; : : : ;Rn, with PðRi¼1jxi¼1Þ ¼p1r and PðRi¼1jxi¼0Þ ¼p0r. We can reparametrize the model in a logit version, log {PðY ¼1jxÞ=PðY¼0jxÞ}¼aþbx, where a¼ log {p0=ð12p0Þ} and b¼ logpp1=ð12p1Þ
0=ð12p0Þ¼ log(odds ratio). The aim is to estimateb:Lets¼(1,: : :,n) denote the
full sample with stratas1¼{i[s:xi¼1} ands0¼{i[s:xi¼0}. The sizes ofs1and s0are denoted byn1andn0. We note thatn1¼Pn
i¼1xi¼Xandn0 ¼n –X.The response samples in the strata ares1r¼{i[s1:Ri¼1} and s0r¼{i[s0:Ri¼1} with total response sample being sr of size nr. Let also n1r ¼ js1rj and n0r¼ js0rj. We see that n1r¼P
srxi¼Xrandn0r ¼nr2Xr. The data fromsrcan be represented as follows where nijrdenotes the number of observations withx ¼ iandy ¼ j: see (Table 1).
We then have the maximum likelihood estimates (MLE)p^1r¼n11r=n1r and p^0r¼ n01r=n0rand MLE ofbequalsb^r¼ logpp^^1r=ð12p^1rÞ
0r=ð12p^0rÞ¼ logðn11rn00r=n10rn01rÞ. Similarly, the
Table 1. The observed data and nonresponse totals for the two classes
x\y y¼0 y¼1 Totals Nonresponse
x¼0 n00r n01r n0r n02n0r x¼1 n10r n11r n1r n12n1r
estimator based on the full sample is given byb^¼ logpp^^1=ð12p^1Þ
0=ð12p^0Þ¼ logðn11n00=n10n01Þwith obvious analogue notation. We can express this estimate as b^¼ log {p^1=ð12p^1Þ}2 log {p^0=ð12p^0Þ}¼b^12b^0, of the same form as in Section 4.1.
We also have thatb^1andb^0are independent based on the separate sample stratas1ands0. For large n0, n1, b^ is approximately Nðb;s2^
bÞ where s2^
b ¼{n1p1ð12p1Þ}21þ {n0p0ð12p0Þ}21. So, approximately, Varðb^1Þ ¼1={n1p1ð12p1Þ} and Varðb^0Þ ¼ 1={n0p0ð12p0Þ} and an estimate ofVarðb^Þis given by
Vð^ b^Þ ¼ 1
n1p^1ð12p^1Þþ 1
n0p^0ð12p^0Þ¼ 1 n11þ 1
n10
þ 1
n01þ 1 n00
such thatVð^ b^Þ ¼V^1þV^0, whereV^1¼ n1
11þn1
10
and V^0 ¼ n1
01þn1
00
are the variance estimates ofb^1 and b^0, respectively.
We shall consider the following imputation method: For each missing value in s1 – s1r, the imputed value y* is drawn at random from the estimated distribution of Y given x¼1:
y* ¼1 with probability p^1r ¼n11r=n1r and y* ¼0 with probability 12p^1r: The same imputation method is used for s0 – s0r with y* drawn at random from the estimated distribution of Y given x¼0. This is the same as stratified hot-deck imputation, imputed values are drawn at random, with replacement, from y1;obs¼ ðyi: i[s1rÞ and y0;obs ¼ ðyi:i[s0rÞ.
The imputed values ins– srcan be represented in the same form as the original data where nownij*denotes the number of imputed values withx ¼ iandy ¼ j: see (Table 2).
The imputation-based estimate of p1 is given by p^1*¼ ðn11rþn11*Þ=n1 such that the imputation-based estimate b^1*¼ log {p^1*=ð12p^1*Þ}¼ log {ðn11rþn11*Þ=
ðn12n11r2n11*Þ}. Similarly, the imputation-based estimates forb0 andbare given by b^0*¼ log {ðn01rþn01*Þ=ðn02n01r2n01*Þ} andb^*¼b^1*2b^0*.
Themrepeated imputations lead tomestimatesb^1;i*;b^0;i*;b^i*, for i ¼ 1,: : :,m.The combined estimate is given by b*¼Pm
i¼1b^i*=m¼Pm
i¼1b^1;i*=m2Pm
i¼1b^0;i*=m¼ b1*2b0*. The imputed variance estimateV^*forb^is given by
V^* ¼ 1
n11rþn11* þ 1
n10rþn10* þ 1
n01rþn01* þ 1
n00rþn00* ð16Þ
We see thatEðV^*jyobsÞ<n 1
1p^1rð12p^1rÞþn 1
0p^0rð12p^0rÞand (3) hold. We also note that (9) holds separately for each class.
Table 2. The imputed totals for the two classes
x\y y¼0 y¼1 Totals
x¼0 n*00 n*01 n02n0r
x¼1 n*10 n*11 n12n1r
5.2.1. Separate Classes Combination
Let us first use the approach in Section 4.1 and determine separatek1, k0for the two classes. Consider first stratums1¼{i[s:xi¼1}. In Appendix A.3 it is shown that Eðb^1*jy1;obsÞ<b^1r and Varðb^1*jy1;obsÞ<f1ð12f1ÞVð^ b^1rÞ. From (10), we find approxi- mately:
Eðk1Þ ¼ Varðb^1rÞ2Varðb^1Þ
E{f1ð12f1ÞVð^ b^1rÞ}¼ E Varðb^1rjn1rÞ2Varðb^1Þ E{f1ð12f1ÞE½Vð^ b^1rÞjn1r}
<
1
p1ð12p1Þ E 1 n1r
2 1 n1
E f1ð12f1Þ 1 n1rp1ð12p1Þ
<ð12p1rÞ=p1r 12p1r
¼ 1 p1r
which is satisfied approximately by lettingk1¼1=ð12f1Þ. In exactly the same way, we find thatk0 ¼1=ð12f0Þwheref0 ¼ ðn02n0rÞ=n0is the rate of nonresponse in stratum s0. The between-imputation component forb^1*is given byB1*¼m211 Pm
i¼1ðb^1;i* 2b1*Þ2and likewiseB0*is the between-imputation component forb^0*. Then an estimated variance of the combined imputation-based estimateb* forbis given by, from (8),
Wsep¼V*þX1 x¼0
1 12fxþ1
m
Bx*
whereV*is the average ofmreplicates of the imputed variance estimateV^*given by (16).
5.2.2. Overall Combination Formula. Determination ofkin (1)
SinceVarðb^1*jy1;obsÞ ¼f1ð12f1ÞVð^ b^1rÞandVarðb^0*jy0;obsÞ ¼f0ð12f0ÞVð^ b^0rÞ, we have from (12)
k¼ 1 12f1
f1ð12f1ÞVð^ b^1rÞ X1
x¼0
fxð12fxÞVð^ b^xrÞ
þ 1
12f0
f0ð12f0ÞVð^ b^0rÞ X1
x¼0
fxð12fxÞVð^ b^xrÞ
ð17Þ
Varðb^1Þ < ðn1r=n1ÞVarðb^1r jn1rÞ ¼ ð12f1ÞVarðb^1rjn1rÞ. Similarly, Varðb^0Þ<
ð12f0ÞVarðb^0rjn0rÞ. We can therefore estimate the variance of the full sample estimates b^1andb^0byVð^ b^1Þ ¼ ð12f1ÞVð^ b^1rÞandVð^ b^0Þ ¼ ð12f0ÞVð^ b^0rÞ, respectively. Then
k¼ 1
12f1 f1Vð^ b^1Þ X1
x¼0
fxVð^ b^xÞ
þ 1
12f0 f0Vð^ b^0Þ X1
x¼0
fxVð^ b^xrÞ
¼ 1
12f1b1þ 1
12f0ð12b1Þ
Just like in Section 5.1.2 we see thatkis a weighted average of the inverse of the response rates. If allfh ¼f,the overall nonresponse rate, we get thatk ¼ 1/(12f). Otherwise, a stratum response rate 1 –fxhas large weight if either the nonresponse rate is large and/or the estimated variance ofb^xis large.
Alternatively, from (17), 1=k¼P1
x¼0ð12fxÞfxVð^ b^xrÞ=P1
x¼0fxVð^ b^xrÞ ¼P1 x¼0
axð12fxÞ, where the weights are ax¼fxVð^ b^xrÞ={f1Vð^ b^1rÞ þf0Vð^ b^0rÞ}. So we can alternatively express 1/kas a weighted average of the response rates.
If the aim is to estimate p1 andp0 we obtain, of course, k¼1/(12f1) for p1 and k¼1=(12f0) forp0.
5.3. Logistic Regression with Categorical Explanatory Variable. Estimating Log(Odds Ratios)
If the explanatoryxis categorical defining, say,Hclasses, we can generalize the results as follows:
Let ph¼PðY¼1jx¼hÞ, h¼0,: : :,H – 1. Logistic regression defining the categories is done by introducing H21 binary explanatory variables x1, : : :, xH-1 wherexh ¼1 if observation belongs to Class h, and 0 otherwise forh¼1, : : :,H – 1. Then an observation belongs to Class 0 if x1¼x2¼: : :¼xH21¼0. The logit version of the model becomes, with x¼ ðx1;x2; : : : ;xH21Þ: log {PðY¼1jxÞ=
PðY ¼0jxÞ}¼aþb1x1þb2x2þ: : :þxH21bH21. We see that a¼ log12pp0
0 and
bh¼ logpph=ð12phÞ
0=ð12p0Þ¼ log (odds ratio) for Class h versus Class 0. Estimating bh by
multiple imputation is done in exactly the same manner as for binary x, with Class h replacing Class 1.
5.4. Logistic Regression with Missing Values in a Binary Explanatory Variable The situation is as in Section 5.2, except thatyis fully observed ins,y¼ ðy1; : : : ;ynÞ, and we have missing values for thex-variable.Y1; : : : ;Ynare independent 0/1-variables and we have an explanatory 0/1-variablexwith fixed valuesx1; : : : ;xn, some of which are missing. The response variables indicate missingness of thexi’s but now with MAR model PðRi¼1jyi¼1Þ ¼q1r andPðRi¼1jyi¼0Þ ¼q0r.
Otherwise, the model is the same as in Section 5.2 with class prob- abilities: p1¼PðYi¼1jxi¼1Þ and p0¼PðYi¼1jxi¼0Þ, and the logit version
log {PðY¼1jxÞ= PðY¼0jxÞ}¼aþbx with b¼ log pp1=ð12p1Þ
0=ð12p0Þ. The aim is still to
estimate b.
Let nows1 ¼ {i[s:yi¼1} ands0 ¼{i[s:yi¼0} with sizes n+1 andn+0. The response samples in the strata ares1r ¼{i[s1:Ri¼1} ands0r ¼{i[s0 :Ri¼1} with total response sample being sr¼{i[s:Ri¼1}¼s1r <s0r. The data can now be represented as before, except that nonresponse totals are for eachy-stratum. See Table 3.
The MLE p^1r;p^0r;b^r, based on sr are the same as before, as is the full sample estimate b^. The imputation method is stratified hot-deck for the y-strata. For each
Table 3. The observed data and nonresponse totals for the y-strata
x\y y¼0 y¼1
x¼0 n00r n01r
x¼1 n10r n11r
Totals n+0r n+1r
Nonresponse n+02n+0r n+12n+1r
missing value of x in s12s1r, the imputed value x* is drawn at random from x1;obs ¼ ðxi:i[s1rÞ. Similarly, imputed values in s02s0r are drawn at random from x0;obs ¼ ðxi:i[s0rÞ. The imputed values in s – sr can be represented in the same form as the original data where now nij* denotes the number of imputed values with x ¼ i and y ¼ j. See Table 4.
We need to find an approximate expression for the expectation and variance ofb^*, now denoted b^*, conditional on the observed data. We defer to Appendix A.4 to show that Varðb^*jy;xobsÞ<f1ð12f1Þðn1
11rþn1
01rÞ þf0ð12f0Þðn1
10rþn1
00rÞandEðb^*jy;xobsÞ<b^r. Heref1¼ ðn+12n+1rÞ=n+1 is the nonresponse rate in Stratums1andf0¼ ðn+02n+0rÞ=n+0 the nonresponse rate ins0. We note thatq^1r ¼n+1r=n+1¼12f1andq^0r¼n+0r=n+0. So the denominator in (5) becomes
E f112f1 1 n11r
þ 1 n01r
þf012f0 1 n10r
þ 1 n00r
ð18Þ
The numerator in (5) equals, as before,Varðb^rÞ2Varðb^Þ, and we have approximately Varðb^rÞ2Varðb^Þ ¼ 1
n1p1ð12p1Þ12p1r p1r
þ 1
n0p0ð12p0Þ12p0r p0r
ð19Þ where, as before,p1r¼PðRi¼1jxi¼1Þandp0r ¼PðRi¼1jxi¼0Þ. We need alternative estimates of p1r and p0r. Since p1r¼p1q1rþ ð12p1Þq0r; we have
^
p1r¼p^1ð12f1Þ þ ð12p^1Þð12f0Þ. Similarly,p^0r ¼p^0ð12f1Þ þ ð12p^0Þð12f0Þ.
We can also use thatn1p^1r <n1randn0p^0r<n0r. From (18) and (19) it follows that we can use
k¼ 1 n11rþ 1
n10r
p^1rf1þð12p^1rÞf0
þ 1
n01rþ 1 n00r
p^0rf1þð12p^0rÞf0
f112f1 1 n11rþ 1
n01r
þf012f0 1 n01rþ 1
n00r
¼
f1 1 n10r
þ 1 n00r
þf0 1 n11r
þ 1 n01r
f112f1 1 n11r
þ 1 n01r
þf012f0 1 n01r
þ 1 n00r
We note that iff1 ¼ f0¼f,thenk ¼ 1/(1 – f). Otherwise, we can express 1/kas a linear combination of the response rates (1 –f1, 1 –f0). Let w1¼n1
11rþn1
01r and
w0 ¼n1
10rþn1
00r. Then
Table 4. The imputed totals for the y-strata
x\y y¼0 y¼1
x¼0 n*00 n*01
x¼1 n*10 n*11
Totals n+02n+0r n+12n+1r