• No results found

Non-Bayesian Multiple Imputation

N/A
N/A
Protected

Academic year: 2022

Share "Non-Bayesian Multiple Imputation"

Copied!
20
0
0

Laster.... (Se fulltekst nå)

Fulltekst

(1)

Non-Bayesian Multiple Imputation

Jan F. Bjørnstad1

Multiple imputation is a method specifically designed for variance estimation in the presence of missing data. Rubin’s combination formula requires that the imputation method is

“proper,” which essentially means that the imputations are random draws from a posterior distribution in a Bayesian framework. In national statistical institutes (NSI’s) like Statistics Norway, the methods used for imputing for nonresponse are typically non-Bayesian, e.g., some kind of stratified hot-deck. Hence, Rubin’s method of multiple imputation is not valid and cannot be applied in NSI’s. This article deals with the problem of deriving an alternative combination formula that can be applied for imputation methods typically used in NSI’s and suggests an approach for studying this problem. Alternative combination formulas are derived for certain response mechanisms and hot-deck type imputation methods.

Key words: Variance estimation; survey sampling; stratified sampling; logistic regression;

nonresponse; hot-deck imputation.

1. Introduction

Multiple imputation is a method specifically designed for variance estimation in the presence of missing data, developed by Rubin (1987). Two more recent references with further discussions and studies are Rubin (1996) and Schafer (1997). The basic idea is to createmimputed values for each missing value and combine themcompleted data sets by Rubin’s combination formula for variance estimation. For the estimator to be valid, the imputations must display an appropriate level of variability. In Rubin’s term, the imputation method is required to be “proper.” In national statistical institutes (NSI’s) the methods used for imputing for nonresponse very seldom if ever satisfy the requirement of being “proper.” However, the idea of creating multiple imputations to measure the imputation uncertainty and use it for variance estimation and for computing confidence intervals is still of interest. The problem is then that Rubin’s combination formula is no longer valid with the usual nonproper imputations used by NSI’s. The reason is that the variability in nonproper imputations is too small and the between-imputation component must be given a larger weight in the variance estimate. The problem is then to determine what this weight should be to give valid statistical inference, and also for what kind of nonresponse mechanisms and estimation problems it is possible to determine a simple

qStatistics Sweden

1Statistics Norway, Division for Statistical Methods and Standards, P.O. Box 8131 Dep., N-0033 Oslo, Norway.

Email: jab@ssb.no

Acknowledgment:The problem of deriving a non-Bayesian multiple imputation method was studied in a Master’s thesis in 1999 by Tonje Braaten with the author as her adviser. The present research began within the DACSEIS research project, and started originally when the author was contacted by Tonje Braaten regarding this issue in her doctoral studies in epidemiology.

(2)

combination formula not dependent on unknown parameters. This article suggests an approach for studying this problem.

In Section 2 an approach for determining the combination of the imputed completed data sets is suggested. Section 3 has three applications with random nonresponse:

(i) estimating a population average from simple random samples using hot-deck imputation, (ii) estimating the regression coefficient in the ratio model using residual regression imputation and (iii) estimating the regression coefficient in simple linear regression with residual regression imputation. Section 4 deals with the general problem of multiple imputation for stratified samples. In Section 5 we apply the theory in Section 4 to stratified samples with random nonresponse within strata, covering (i) estimation of population average using stratified hot-deck imputation and (ii) estimation of log (odds ratios) in logistic regression with missingness both for the dependent variable and the explanatory variable. Section 6 takes up the problem of using the same combination rule for all estimation problems with a given imputation method and data and response model. A general result for hot-deck imputation and linear estimates is presented.

2. An Approach for Determining an Alternative Combination Formula for Variance Estimation in Multiple Imputation

Let s¼ ð1; : : : ;nÞ denote the full sample, with y¼ ðy1; : : : ;ynÞ denoting the full sample data, values of random variable Y1; : : : :;Yn. In the case of sampling from a finite population under a design model, a renumbering of the selected units has been performed, of course, and the stochastic nature of y is determined by the sampling plan. The objective is to estimate some parameter u. The observed data is denoted by yobs ¼{ðyi:i[srÞ;sr}, being the observed part of y and the response sample sr of size nr.

Letu^be the estimator based on the full sample datay, withVarðu^Þestimated byVðyÞ.^ For i[s2sr we impute by some method yi* and let y* denote the complete data ðyi:i[sr;yi*:i[s2srÞ. Based ony*, we haveu^*¼u^ðy*ÞandV^*¼Vð^ y*Þ.

Multiple imputation of m repeated imputations leads to m completed data-sets with m estimates u^i*;i¼1; : : : ;m; and related variance estimates V^i*;i¼1; : : : ;m.

The combined estimate is given by u*¼Pm

i¼1u^i*=m. The within-imputation variance is defined as V*¼Pm

i¼1V^i*=m and the between-imputation component is B*¼ Pm

i¼1ðu^i*2u*Þ2=ðm21Þ:The total estimated variance ofu* is then proposed to be W ¼V*þ kþ1

m

B* ð1Þ

That is, we need to determineksuch that

EðWÞ ¼Varðu*Þ ð2Þ

Rubin (1987) has shown that k¼1 can be used with proper imputations, which essentially means drawing imputed values from a posterior distribution in a Bayesian framework.

(3)

In general, one has to determine the terms in (2). One way to try and do this is to use double expectation, conditioning on yobs, that is, EðWÞ ¼E{EðWjYobsÞ} and Varðu*Þ ¼E{Varðu*jYobsÞ}þVar{Eðu*jYobsÞ}. Typically,

EðV*Þ<Varðu^Þ ð3Þ

andEðB*jyobsÞ ¼Varðu^*jyobsÞ. Hence, approximately EðWÞ ¼Varðu^Þ þ EðkÞ þ1

m

EVarðu^*jYobsÞ ð4Þ

Moreover,Varðu*jyobsÞ ¼Varðu^*jyobsÞ=mandEðu*jyobsÞ ¼Eðu^*jyobsÞ. This implies that Varðu*Þ ¼m21E{Varðu^*jYobsÞ}þVar{Eðu^*jYobsÞ}. From (3) and (4), Equation (2) becomes Varðu^Þ þEðkÞEVarðu^*jYobsÞ ¼Var{Eðu^*jYobsÞ}, which gives the following general expression:

EðkÞ ¼VarEðu^*jYobsÞ2Varðu^Þ

EVarðu^*jYobsÞ ð5Þ

For this to be of interest,kmust be, at least approximately, determined independently of unknown parameters. In addition, one needs to check that (3) holds. To illustrate how (5) can be used we shall in the next section consider three special cases with random nonresponse.

3. Three Applications for Random Nonresponse

3.1. Estimating Population Average with Hot-deck Imputation

Consider a simple random sample from a finite population of sizeN, where the aim is to estimate the population averagemof some variabley. We shall assume completely random nonresponse. In the terminology of Rubin (1987) and Little and Rubin (2002), the missingness mechanism is said to be MCAR (missing completely at random). We note that MCAR means that the response indicators R1; : : : ;RN are independent with the same response probability pr¼PðRi¼1Þ. The imputation method is the hot-deck method, whereyi*is drawn at random fromyobswith replacement, and the estimate is the sample mean. Letyr be the observed sample mean ands^r2¼n1

r21

P

i[srðyi2yrÞ2 the observed sample variance. ThenY*is the imputation-based sample mean for the completed sample, and the combined estimator is given byY*¼Pm

i¼1Yi*=m. LetYsdenote the sample mean based on a full sample. Then,VarðYsÞ ¼s2ð1n2N1Þ, withs2¼ ðN21Þ21PN

i¼1ðyi2mÞ2 being the population variance. We have further that EðY*jyobsÞ ¼yr and VarðY*jyobsÞ ¼{ðn2nrÞ=n2}{ðnr21Þ=nr}s^r2 using that EðYi*jyobsÞ ¼yr and VarðYi*jyobsÞ ¼s^2rðnr21Þ=nr.

In this case, V^*¼s^2*ð1n2N1Þ where s^2*¼n211 P

srðyi2y*Þ2þP

s2srðyi*2y*Þ2

. It can be shown that Eðs^2*jyobsÞ ¼s^r2 12n1

r

nðn21Þnr

<s^2r and (3) holds.

(4)

We find, from (5),

EðkÞ ¼

Var Yr 2s2 1 n21

N

E n2nr

n2 nr21 nr

Eðs^r2jnrÞ

¼

s2 E 1 nr

21 N

2s2 1 n21

N

E n2nr

n2 nr21 nr

s2

<ð12prÞ=pr 12pr

¼ 1 pr

which is satisfied approximately, withf¼ ðn2nrÞ=n being the rate of nonresponse, by letting k¼1=ð12fÞ:

3.2. Estimating the Regression Coefficient in the Ratio Model with Residual Imputation We shall assume completely random nonresponse as in Section 3.1. We consider a ratio model, i.e., regression through the origin: Yi¼bxiþ1i, with Varð1iÞ ¼s2xi; i¼1,: : :,n.It is assumed thatall xi’s are known, also in the nonresponse sample. The full data estimator of b is given byb^¼Pn

i¼1Yi=Pn

i¼1xi. The unbiased estimator ofs2is given bys^2¼Pn

i¼11

xiðyi2b^xiÞ2=ðn21Þ.

We shall consider residual regression imputation. Let b^r be the b^ - estimate based on observed sample sr. Define the standardized residuals ei¼ ðyi2b^rxiÞ=pffiffiffiffixi

, for i[sr. Fori[s2sr: draw the value ofe*i at random, with replacement, from the set of observed residuals ei;i[sr. The imputed y-value is given by yi*¼b^rxiþei* ffiffiffiffixi

p . Let X¼Pn

i¼1xi;Xr¼P

i[srxi and Xnr¼P

i[s2srxi¼X2Xr. All considerations from now on are conditional onnrandXr, and we aim to determinekdirectly from (5).

The proportion of thex-total in the nonresponse group is denoted asfX ¼Xnr=X.

We now have b^*¼ P

sryiþP

s2sryi*

=X and s^2*¼n211 P

sr 1

xiðyi2b^*xiÞ2 þP

s2sr1

xiðyi*2b^*xiÞ2 .

In order to determine k from (5) we need to check the validity of (3) and derive EVarðb^*jyobsÞ;VarEðb^*jyobsÞandVarðb^Þ. We note thatVarðb^Þ ¼s2=X. In Appendix A.1 it is shown that condition (3) holds for moderate and largenr, and that

VarEðb^*jyobsÞ ¼s2 Xr

þð12d1Þd2nnrXnr

X2 s2

nr

ð6Þ EVarðb^*jyobsÞ ¼Xnr

X2s2

nr ðnrþd122Þ ð7Þ

Here, 0#d1;d2#1. From (5), using (6) and (7), we find k¼nrX22nrXXrþ ð12d1Þd2nnrXnrXr

XrXnrðnrþd122Þ < X

Xrþ ð12d1Þd2

nnr

nr

(5)

We note that if all xi ¼1, then d1¼d2 ¼1. Now, with fX ¼Xnr=X being the proportion of thex-total in the nonresponse group andf ¼nnr=nthe rate of nonresponse, we finally get, since typicallyð12d1Þd2<0,

k< 1 12fX

þ ð12d1Þd2 f

12f < 1 12fX

for usualx-values and nonresponse rates.

3.3. Estimating the Regression Coefficient in Simple Linear Regression with Residual Imputation

As in Sections 3.1 and 3.2 the nonresponse mechanism is assumed to be MCAR with pr¼PðRi¼1Þ. The simple linear regression model is assumed: Yi¼aþbxiþ 1i;withVarð1iÞ ¼s2;i¼1; : : : ;n:Allxi’s are assumed to be known. We may assume, that x¼Pn

i¼1xi=n¼0. Then the full data estimates are given by b^¼Pn

i¼1xiyi=SSx, whereSSx¼Pn

i¼1xi2;anda^¼y¼Pn

i¼1yi=n:The unbiased estimator ofs2is given by s^2¼n221 Pn

i¼1ðyi2a^2b^xiÞ2. Leta^r;b^rbe the estimates based on the response sample, a^r¼yr2b^rxrandb^r¼P

i[srðxi2xrÞyi=SSx;r. Here,yr¼P

i[sryi=nr,xr ¼P

i[srxi=nr

andSSx;r¼P

i[srðxi2xrÞ2.

Simple residual imputation is defined as follows: The observed residuals are ej¼ ðyj2a^r2b^rxjÞ;forj[sr. Fori[s2sr: drawei*at random, with replacement from ðej;j[sr). The imputedy-value is given byyi*¼a^rþb^rxiþei*:

The imputation based estimates are b^* ¼ P

i[srxiyiþP

i[s2srxiyi*

=SSx, a^*¼ ðnryrþ ðn2nrÞynr*Þ=n where ynr* ¼P

s2sryi*=ðn2nrÞ and s^2* ¼n221 P

srðyi2 n

a^*2b^*xiÞ2þP

s2srðyi*2a^*2b^*xiÞ2o

: It can be shown (see Appendix A.2 for a summary proof) that Eðs^2*Þ ¼s2nrn22

r n22fn22Þ<s2 where, as in Section 3.1, f ¼ ðn2nrÞ=n. Since Varðb^Þ ¼s2=SSx, (3) holds. It is readily seen that Eðb^*jyobsÞ ¼ b^r and Varðb^*jyobsÞ ¼se2cr=SSx, where se2¼P

srei2=nr and cr ¼P

s2srxi2=SSx

[k0;1l. It can be shown thatEðse2jsrÞ ¼nrn22

r s2. Moreover, clearlyEðcrÞ ¼12pr and Varðb^rjsrÞ ¼s2=SSx;r:It follows, from (5), that

EðkÞ ¼ Eð1=SSx;rÞ21=SSx ð12prÞE{ðnr22Þ=nr}=SSx

<1=EðSSx;rÞ21=SSx ð12prÞ=SSx

Using the fact that conditional on nr, sr is a simple random sample such that the response indicators are correlated with Cov(Ri,Rj)¼ 2f(12f)/(n21), we find that EðSSx;rÞ ¼ ðpr212pn21rÞSSx. It follows that, approximately,E(k)¼p1

r21n<1=prand we can usek¼1/(12f).

4. Multiple Imputation for Stratified Samples

4.1. Separate Combinations

One way to combine themcompleted data sets is to do it separately for each stratum, i.e., determine a separate k for each stratum. The general setup is then as follows:

(6)

The sample s is divided into H sample strata, s1,: : :, sH. Let yh be the planned full data from subsample shof size nh. It is assumed that y1,: : :,yH are independent. The observed part of yhis denoted byyh,obs withshrbeing the response sample from shof size nhr. The estimator based on the full sample data is the sum of independent terms, u^¼PH

h¼1u^h where u^h is based on the yh. Varðu^Þ ¼PH

h¼1Varðu^hÞ is estimated by Vð^ u^Þ ¼PH

h¼1V^hðyhÞ where V^hðyhÞ is the variance estimate of u^h based on yh. For i[sh2shr we impute by some method yi* based on yh,obs and let yh* denote the complete data ðyh;obs;yi*;i[sh2shrÞ. Based on yh*, we have u^h*¼u^hðyh*Þ and V^h*¼ V^hðyh*Þ: Then the imputation based estimator is given by u^*¼PH

h¼1u^h* and V^*¼PH

h¼1V^h*. Multiple imputation of m repeated imputations leads to m completed data sets with mestimates for each stratum h,u^h;i;i¼1; : : : ;m and related variance estimates V^h;i*;i¼1; : : : ;m: The total estimates and related variances are u^i*¼ PH

h¼1u^h;i* andV^i*¼PH

h¼1V^h;i*; fori ¼ 1,: : :,m. The combined estimate for stratumh is given by uh*¼Pm

i¼1u^h;i*=m. The within-imputation variance for stratum h is Vh*¼ Pm

i¼1V^h;i*=m and the between-imputation component is given by Bh*¼Pm

i¼1ðu^h;i* 2uh*Þ2=ðm21Þ. Following the same idea as in Section 2, Formula (1), the total estimated variance of uh* is then proposed to be Wh¼Vh*þ ðkhþm1ÞBh*. The combined total estimate is given byu* ¼Pm

i¼1u^i*=m¼PH

h¼1uh*. It follows that the total estimated variance of u* can be expressed as

Wsep¼XH

h¼1

Wh¼V*þXH

h¼1

khþ1 m

Bh* ð8Þ

where V*¼Pm

i¼1V^i*=m¼PH

h¼1Vh*. Provided (3) holds for each stratum h,

EðVh*Þ<Varðu^hÞ ð9Þ

we have from (5) that khmust satisfy EðkhÞ ¼VarEðu^h*jYh;obsÞ2Varðu^hÞ

EVarðu^*hjYh;obsÞ ð10Þ

The combination Formula (8) is an alternative to the usual combination Formula (1), especially useful when we get simple expressions forkhbut not fork.The next section develops an expression for k in this situation.

4.2. An Overall Combination Formula

Now letW be given by (1). We shall determine the between-imputation factork. Since EðWÞ ¼EðWsepÞwe have

E XH

h¼1

khþ1 m

Bh*

( )

¼E kþ1 m

B* ð11Þ

Here, B*¼m211 Pm

i¼1ðu^i*2u*Þ2¼m211 Pm i¼1

P

hðu^h;i* 2uh*Þ n o2

. Note that EðB*jyobsÞ ¼ E PH

h¼1Bh*jyobs

, sinceEðB*jyobsÞ ¼Varðu^*jyobsÞ ¼PH

h¼1Varðu^h*jyobsÞandEðBh*jyobsÞ ¼ Varðu^h*jyobsÞ.

(7)

Hence, the identity (11) becomes E{PH

h¼1khEðBh*jYobsÞ}¼E{kEðB*jYobsÞ}: This gives us the solution k¼PH

h¼1khEðBh*jyobsÞ=EðB*jyobsÞ if we want to use the usual combination Formula (1) and hence

k¼ XH

h¼1

khVarðu^h*jyobsÞ Varðu^*jyobsÞ ¼XH

h¼1

khVarðu^h*jyobsÞ

Varðu^*jyobsÞ ð12Þ

a weighted average ofkh. We get a simple expression forkonly when allkhare equal, saykh¼k0. Thenk¼k0.

5. Four Applications to Stratified Samples and Random Nonresponse within Strata 5.1. Estimating Population Average from Stratified Sample with Stratified Hot-deck

Imputation

Consider stratified simple random samples from a finite population of sizeN, withHstrata of sizes Nh, h¼1,: : :,H. The aim is to estimate the population average m of some variabley. We assume completely random nonresponse within each stratum, denoted as MAR (missing at random) by Rubin (1987) and Little and Rubin (2002). This means that the response indicators in stratum h, Rh;1; : : : ;Rh;Nh are independent with phr¼PðRh;i¼1Þ. The imputation method is stratified hot-deck. Letyh,obsbe the observed part from the response sampleshrof sizenhrfrom stratumh,yh;obs¼ ðyi:i[shrÞ. Then an imputed valueyi* in stratum h is drawn at random fromyh,obs. The estimator based on the full sample data is the usual stratified weighted average Ystrat¼PH

h¼1Nhyh=N¼PH

h¼1vhyh. Here, vh¼Nh=N and yh¼P

shyi=nh, where sh is the sample from stratum h and nh¼ jshj. Then VarðYstratÞ ¼PH

h¼1vh2sh2ðn1

h2N1

hÞ, with sh2 ¼

i[Uh

Pðyi2mhÞ2=ðNh21Þ being the population variance in stratum h. Here Uh is stratum populationhandmhis the average inUh.

Letyhrbe the observed sample mean from stratumhands^hr2 ¼n1

hr21

P

i[shrðyi2yhrÞ2 the observed sample variance. The imputation-based estimator is given by Ystrat* ¼ PH

h¼1Nhyh*=N where yh*¼ P

shryiþP

sh2shryi*

=nh¼ nhryhrþP

sh2shryi*

=nh. Let themimputation replicates ofYstrat* be denoted byYstrat;i* fori¼1, : : :,m. The combined estimator is given byY*

strat¼Pm

i¼1Ystrat;i* =m:

5.1.1. Separate Strata Combinations

It follows from Section 3.1 thatkh¼1=ð12fhÞ, wherefh¼ ðnh2nhrÞ=nh is the rate of nonresponse in stratum h. The combination formula for the variance estimate of Ystrat* becomes, from (8),

Wsep¼V*þXH

h¼1

1 12fh

þ1 m

Bh* Here,V*¼PH

h¼1Vh*andVh*is the average of themvalues of the imputation-based variance estimateV^h* ¼vh2s^h*2 n1

h2N1

h

wheres^h*2 ¼n1

h21

P

shrðyi2yh*Þ2þP

sh2shrðyi*2yh*Þ2

.

(8)

5.1.2. Overall Combination Formula. Determination ofkin (1)

From (12) we need to determine VarðvhYh*jyo b sÞ and VarðYs t r a t* jyo b sÞ ¼ PH

h¼1Var ðvhYh*jyo b sÞ. Then k¼XH

h¼1

1

12fhVarðvhYh*jyobsÞ VarðYstrat* jyobsÞ

Now, EðYh*jyh;obsÞ ¼yhr and VarðYh*jyh;obsÞ ¼{ðnh2nhrÞ=nh2}{ðnhr21Þ=nhr}s^hr2 <

fhs^hr2=nh. Hence we can determine k as k¼XH

h¼1

1 12fh

fhvh2s^hr2=nh XH

k¼1

fkvk2s^kr2=nh

If the stratum sizes Nh are large then we can let Vðv^ hYhÞ ¼vh2s^hr2=nh. Let also bh ¼fhVðv^ hYhÞ=PH

k¼1fkVðv^ kYkÞ. Then

k¼ XH

h¼1

Vðv^ hYhÞfh

1 12fh XH

h¼1

Vðv^ hYhÞfh

¼XH

h¼1

bh 1 12fh

ð13Þ

Since PH

h¼1bh ¼1, we see that kis a weighted average of the inverse of the response rates. If all fh¼f, the overall nonresponse rate, we get as for simple random sample thatk¼1/(12f). Otherwise, a stratum response rate 12fhhas large weight if either the nonresponse rate is large and/or the estimated variance of vhYh is large.

5.1.3. An Alternative Expression forkin (1)

By directly applying (5) we can get an alternative expression fork.Givenyobs, the imputed sample means Yh* are independent, which implies that EðYstrat* jyobsÞ ¼PH

h¼1Nhyhr=N¼

ystrat;r andVarðYstrat* jyobsÞ<PH

h¼1vh2fhs^hr2=nh:Just like in Section 3.1, (3) holds. From (5) we get

EðkÞ<VarðYstrat;rÞ2VarðYstratÞ E

h

Xvh2fhs^hr2=nh

!

¼ XH

h¼1

vh2sh2 E 1 nhr

2 1 Nh

2XH

h¼1

vh2sh2 1 nh2 1

Nh

XH

h¼1

vh2E fh nh

Eðs^hr2jnhrÞ

<

XH

h¼1

vh2sh212phr nh

1 phr

XH

h¼1

vh2sh212phr nh

¼ XH

h¼1

vh2sh2 nhr

EðfhÞ 12fh

Eð12fhÞ XH

h¼1

vh2sh2

nhrEðfhÞð12fhÞ

ð14Þ

(9)

Now,VarðYhrÞ ¼EVarðYhrjnhrÞ ¼sh2Eð1=nhrÞ. Let Vðv^ hYhrÞ ¼vh2s^hr2=nhr. Then we see that the expression forE(k) is satisfied approximately, if the stratum sizesNhare large, by letting

1 k¼

XH

h21

ð12fhÞfhVðv^ hYhrÞ XH

h21

fhVðv^ hYhrÞ

¼XH

h21

ahð12fhÞ ð15Þ

where the weights ah¼fhVðv^ hYhrÞ=PH

k¼1fkVðv^ kYkrÞ. Since PH

h¼1ah¼1, we see that 1/k is a weighted average of the response rates. If all fh ¼f, the overall nonresponse rate, we have, as shown in Section 5.1.2, that k ¼ 1/(12f). As seen in Section 5.1.2, we note also in Expression (15) that a stratum response rate 12fh has large weight if either the nonresponse rate is large and/or the estimated variance of vhYhr is large. The estimate of the total based on the response sample is given by Ystrat;r¼P

hvhYhr: We obtain Formula (13) for k by noting from (14) that we have EðkÞ<PH

h¼1 VarðvhYhÞEðfhÞEð12f1

hÞ=PH

h¼1VarðvhYhÞEðfhÞ. Then we see that the expression for E(k) is satisfied approximately, if the stratum sizes Nh are large, by letting k be given by (13).

5.2. Logistic Regression with Binary Explanatory Variable. Estimating Log(Odds Ratio) The variables Y1; : : : ;Yn are independent 0/1 -variables, and we have explanatory 0/1-variable x with fixed known values x1; : : : ;xn. The class probabilities are given by p1¼PðYi¼1jxi¼1Þ and p0¼PðYi¼1jxi¼0Þ. We assume a MAR(missing at random) model for the response variables R1; : : : ;Rn, with PðRi¼1jxi¼1Þ ¼p1r and PðRi¼1jxi¼0Þ ¼p0r. We can reparametrize the model in a logit version, log {PðY ¼1jxÞ=PðY¼0jxÞ}¼aþbx, where a¼ log {p0=ð12p0Þ} and b¼ logpp1=ð12p1Þ

0=ð12p0Þ¼ log(odds ratio). The aim is to estimateb:Lets¼(1,: : :,n) denote the

full sample with stratas1¼{i[s:xi¼1} ands0¼{i[s:xi¼0}. The sizes ofs1and s0are denoted byn1andn0. We note thatn1¼Pn

i¼1xi¼Xandn0 ¼n –X.The response samples in the strata ares1r¼{i[s1:Ri¼1} and s0r¼{i[s0:Ri¼1} with total response sample being sr of size nr. Let also n1r ¼ js1rj and n0r¼ js0rj. We see that n1r¼P

srxi¼Xrandn0r ¼nr2Xr. The data fromsrcan be represented as follows where nijrdenotes the number of observations withx ¼ iandy ¼ j: see (Table 1).

We then have the maximum likelihood estimates (MLE)p^1r¼n11r=n1r and p^0r¼ n01r=n0rand MLE ofbequalsb^r¼ logpp^^1r=ð12p^1rÞ

0r=ð12p^0rÞ¼ logðn11rn00r=n10rn01rÞ. Similarly, the

Table 1. The observed data and nonresponse totals for the two classes

x\y y¼0 y¼1 Totals Nonresponse

x¼0 n00r n01r n0r n02n0r x¼1 n10r n11r n1r n12n1r

(10)

estimator based on the full sample is given byb^¼ logpp^^1=ð12p^1Þ

0=ð12p^0Þ¼ logðn11n00=n10n01Þwith obvious analogue notation. We can express this estimate as b^¼ log {p^1=ð12p^1Þ}2 log {p^0=ð12p^0Þ}¼b^12b^0, of the same form as in Section 4.1.

We also have thatb^1andb^0are independent based on the separate sample stratas1ands0. For large n0, n1, b^ is approximately Nðb;s2^

bÞ where s2^

b ¼{n1p1ð12p1Þ}21þ {n0p0ð12p0Þ}21. So, approximately, Varðb^1Þ ¼1={n1p1ð12p1Þ} and Varðb^0Þ ¼ 1={n0p0ð12p0Þ} and an estimate ofVarðb^Þis given by

Vð^ b^Þ ¼ 1

n1p^1ð12p^1Þþ 1

n0p^0ð12p^0Þ¼ 1 n11þ 1

n10

þ 1

n01þ 1 n00

such thatVð^ b^Þ ¼V^1þV^0, whereV^1¼ n1

11þn1

10

and V^0 ¼ n1

01þn1

00

are the variance estimates ofb^1 and b^0, respectively.

We shall consider the following imputation method: For each missing value in s1 – s1r, the imputed value y* is drawn at random from the estimated distribution of Y given x¼1:

y* ¼1 with probability p^1r ¼n11r=n1r and y* ¼0 with probability 12p^1r: The same imputation method is used for s0 – s0r with y* drawn at random from the estimated distribution of Y given x¼0. This is the same as stratified hot-deck imputation, imputed values are drawn at random, with replacement, from y1;obs¼ ðyi: i[s1rÞ and y0;obs ¼ ðyi:i[s0rÞ.

The imputed values ins– srcan be represented in the same form as the original data where nownij*denotes the number of imputed values withx ¼ iandy ¼ j: see (Table 2).

The imputation-based estimate of p1 is given by p^1*¼ ðn11rþn11*Þ=n1 such that the imputation-based estimate b^1*¼ log {p^1*=ð12p^1*Þ}¼ log {ðn11rþn11*Þ=

ðn12n11r2n11*Þ}. Similarly, the imputation-based estimates forb0 andbare given by b^0*¼ log {ðn01rþn01*Þ=ðn02n01r2n01*Þ} andb^*¼b^1*2b^0*.

Themrepeated imputations lead tomestimatesb^1;i*;b^0;i*;b^i*, for i ¼ 1,: : :,m.The combined estimate is given by b*¼Pm

i¼1b^i*=m¼Pm

i¼1b^1;i*=m2Pm

i¼1b^0;i*=m¼ b1*2b0*. The imputed variance estimateV^*forb^is given by

V^* ¼ 1

n11rþn11* þ 1

n10rþn10* þ 1

n01rþn01* þ 1

n00rþn00* ð16Þ

We see thatEðV^*jyobsÞ<n 1

1p^1rð12p^1rÞþn 1

0p^0rð12p^0rÞand (3) hold. We also note that (9) holds separately for each class.

Table 2. The imputed totals for the two classes

x\y y¼0 y¼1 Totals

x¼0 n*00 n*01 n02n0r

x¼1 n*10 n*11 n12n1r

(11)

5.2.1. Separate Classes Combination

Let us first use the approach in Section 4.1 and determine separatek1, k0for the two classes. Consider first stratums1¼{i[s:xi¼1}. In Appendix A.3 it is shown that Eðb^1*jy1;obsÞ<b^1r and Varðb^1*jy1;obsÞ<f1ð12f1ÞVð^ b^1rÞ. From (10), we find approxi- mately:

Eðk1Þ ¼ Varðb^1rÞ2Varðb^1Þ

E{f1ð12f1ÞVð^ b^1rÞ}¼ E Varðb^1rjn1rÞ2Varðb^1Þ E{f1ð12f1ÞE½Vð^ b^1rÞjn1r}

<

1

p1ð12p1Þ E 1 n1r

2 1 n1

E f1ð12f1Þ 1 n1rp1ð12p1Þ

<ð12p1rÞ=p1r 12p1r

¼ 1 p1r

which is satisfied approximately by lettingk1¼1=ð12f1Þ. In exactly the same way, we find thatk0 ¼1=ð12f0Þwheref0 ¼ ðn02n0rÞ=n0is the rate of nonresponse in stratum s0. The between-imputation component forb^1*is given byB1*¼m211 Pm

i¼1ðb^1;i* 2b1*Þ2and likewiseB0*is the between-imputation component forb^0*. Then an estimated variance of the combined imputation-based estimateb* forbis given by, from (8),

Wsep¼V*þX1 x¼0

1 12fxþ1

m

Bx*

whereV*is the average ofmreplicates of the imputed variance estimateV^*given by (16).

5.2.2. Overall Combination Formula. Determination ofkin (1)

SinceVarðb^1*jy1;obsÞ ¼f1ð12f1ÞVð^ b^1rÞandVarðb^0*jy0;obsÞ ¼f0ð12f0ÞVð^ b^0rÞ, we have from (12)

k¼ 1 12f1

f1ð12f1ÞVð^ b^1rÞ X1

x¼0

fxð12fxÞVð^ b^xrÞ

þ 1

12f0

f0ð12f0ÞVð^ b^0rÞ X1

x¼0

fxð12fxÞVð^ b^xrÞ

ð17Þ

Varðb^1Þ < ðn1r=n1ÞVarðb^1r jn1rÞ ¼ ð12f1ÞVarðb^1rjn1rÞ. Similarly, Varðb^0Þ<

ð12f0ÞVarðb^0rjn0rÞ. We can therefore estimate the variance of the full sample estimates b^1andb^0byVð^ b^1Þ ¼ ð12f1ÞVð^ b^1rÞandVð^ b^0Þ ¼ ð12f0ÞVð^ b^0rÞ, respectively. Then

k¼ 1

12f1 f1Vð^ b^1Þ X1

x¼0

fxVð^ b^xÞ

þ 1

12f0 f0Vð^ b^0Þ X1

x¼0

fxVð^ b^xrÞ

¼ 1

12f1b1þ 1

12f0ð12b1Þ

Just like in Section 5.1.2 we see thatkis a weighted average of the inverse of the response rates. If allfh ¼f,the overall nonresponse rate, we get thatk ¼ 1/(12f). Otherwise, a stratum response rate 1 –fxhas large weight if either the nonresponse rate is large and/or the estimated variance ofb^xis large.

(12)

Alternatively, from (17), 1=k¼P1

x¼0ð12fxÞfxVð^ b^xrÞ=P1

x¼0fxVð^ b^xrÞ ¼P1 x¼0

axð12fxÞ, where the weights are ax¼fxVð^ b^xrÞ={f1Vð^ b^1rÞ þf0Vð^ b^0rÞ}. So we can alternatively express 1/kas a weighted average of the response rates.

If the aim is to estimate p1 andp0 we obtain, of course, k¼1/(12f1) for p1 and k¼1=(12f0) forp0.

5.3. Logistic Regression with Categorical Explanatory Variable. Estimating Log(Odds Ratios)

If the explanatoryxis categorical defining, say,Hclasses, we can generalize the results as follows:

Let ph¼PðY¼1jx¼hÞ, h¼0,: : :,H – 1. Logistic regression defining the categories is done by introducing H21 binary explanatory variables x1, : : :, xH-1 wherexh ¼1 if observation belongs to Class h, and 0 otherwise forh¼1, : : :,H – 1. Then an observation belongs to Class 0 if x1¼x2¼: : :¼xH21¼0. The logit version of the model becomes, with x¼ ðx1;x2; : : : ;xH21Þ: log {PðY¼1jxÞ=

PðY ¼0jxÞ}¼aþb1x1þb2x2þ: : :þxH21bH21. We see that a¼ log12pp0

0 and

bh¼ logpph=ð12phÞ

0=ð12p0Þ¼ log (odds ratio) for Class h versus Class 0. Estimating bh by

multiple imputation is done in exactly the same manner as for binary x, with Class h replacing Class 1.

5.4. Logistic Regression with Missing Values in a Binary Explanatory Variable The situation is as in Section 5.2, except thatyis fully observed ins,y¼ ðy1; : : : ;ynÞ, and we have missing values for thex-variable.Y1; : : : ;Ynare independent 0/1-variables and we have an explanatory 0/1-variablexwith fixed valuesx1; : : : ;xn, some of which are missing. The response variables indicate missingness of thexi’s but now with MAR model PðRi¼1jyi¼1Þ ¼q1r andPðRi¼1jyi¼0Þ ¼q0r.

Otherwise, the model is the same as in Section 5.2 with class prob- abilities: p1¼PðYi¼1jxi¼1Þ and p0¼PðYi¼1jxi¼0Þ, and the logit version

log {PðY¼1jxÞ= PðY¼0jxÞ}¼aþbx with b¼ log pp1=ð12p1Þ

0=ð12p0Þ. The aim is still to

estimate b.

Let nows1 ¼ {i[s:yi¼1} ands0 ¼{i[s:yi¼0} with sizes n+1 andn+0. The response samples in the strata ares1r ¼{i[s1:Ri¼1} ands0r ¼{i[s0 :Ri¼1} with total response sample being sr¼{i[s:Ri¼1}¼s1r <s0r. The data can now be represented as before, except that nonresponse totals are for eachy-stratum. See Table 3.

The MLE p^1r;p^0r;b^r, based on sr are the same as before, as is the full sample estimate b^. The imputation method is stratified hot-deck for the y-strata. For each

Table 3. The observed data and nonresponse totals for the y-strata

x\y y¼0 y¼1

x¼0 n00r n01r

x¼1 n10r n11r

Totals n+0r n+1r

Nonresponse n+02n+0r n+12n+1r

(13)

missing value of x in s12s1r, the imputed value x* is drawn at random from x1;obs ¼ ðxi:i[s1rÞ. Similarly, imputed values in s02s0r are drawn at random from x0;obs ¼ ðxi:i[s0rÞ. The imputed values in s – sr can be represented in the same form as the original data where now nij* denotes the number of imputed values with x ¼ i and y ¼ j. See Table 4.

We need to find an approximate expression for the expectation and variance ofb^*, now denoted b^*, conditional on the observed data. We defer to Appendix A.4 to show that Varðb^*jy;xobsÞ<f1ð12f1Þðn1

11rþn1

01rÞ þf0ð12f0Þðn1

10rþn1

00rÞandEðb^*jy;xobsÞ<b^r. Heref1¼ ðn+12n+1rÞ=n+1 is the nonresponse rate in Stratums1andf0¼ ðn+02n+0rÞ=n+0 the nonresponse rate ins0. We note thatq^1r ¼n+1r=n+1¼12f1andq^0r¼n+0r=n+0. So the denominator in (5) becomes

E f112f1 1 n11r

þ 1 n01r

þf012f0 1 n10r

þ 1 n00r

ð18Þ

The numerator in (5) equals, as before,Varðb^rÞ2Varðb^Þ, and we have approximately Varðb^rÞ2Varðb^Þ ¼ 1

n1p1ð12p1Þ12p1r p1r

þ 1

n0p0ð12p0Þ12p0r p0r

ð19Þ where, as before,p1r¼PðRi¼1jxi¼1Þandp0r ¼PðRi¼1jxi¼0Þ. We need alternative estimates of p1r and p0r. Since p1r¼p1q1rþ ð12p1Þq0r; we have

^

p1r¼p^1ð12f1Þ þ ð12p^1Þð12f0Þ. Similarly,p^0r ¼p^0ð12f1Þ þ ð12p^0Þð12f0Þ.

We can also use thatn1p^1r <n1randn0p^0r<n0r. From (18) and (19) it follows that we can use

k¼ 1 n11rþ 1

n10r

p^1rf1þð12p^1rÞf0

þ 1

n01rþ 1 n00r

p^0rf1þð12p^0rÞf0

f112f1 1 n11rþ 1

n01r

þf012f0 1 n01rþ 1

n00r

¼

f1 1 n10r

þ 1 n00r

þf0 1 n11r

þ 1 n01r

f112f1 1 n11r

þ 1 n01r

þf012f0 1 n01r

þ 1 n00r

We note that iff1 ¼ f0¼f,thenk ¼ 1/(1 – f). Otherwise, we can express 1/kas a linear combination of the response rates (1 –f1, 1 –f0). Let w1¼n1

11rþn1

01r and

w0 ¼n1

10rþn1

00r. Then

Table 4. The imputed totals for the y-strata

x\y y¼0 y¼1

x¼0 n*00 n*01

x¼1 n*10 n*11

Totals n+02n+0r n+12n+1r

Referanser

RELATERTE DOKUMENTER

In conclusion, random regression models using Legendre polynomials functions of varying order for the fixed mean curve and for the random animal effects were

Using Australia and Norway as representative case studies, we take the theory to the data by developing and estimating a Bayesian Dynamic Factor model, that includes

Using Australia and Norway as representative case studies, we take the theory to the data by developing and estimating a Bayesian Dynamic Factor model, that includes separate

We pro- pose a calibrated imputation approach so that valid point and variance estimates of the population (or domain) totals can be computed by the secondary users using

Two of the approximate methods are based on the hazardous distance found for single charges, whereas one approximation is based on transforming the true hazardous area (zone) into

Furthermore, using the extended temporal coverage nighttime light data, three simple regression models (the linear model, quadratic polynomial model and power function

Kernel density plots of observed and predicted N indicators using simple regression (SR) models based on the best-performing modified Multiplex indicators, multiple

Using ordinary multiple imputation it is quite straightforward to impute the missing values and to use for instance LISREL on the imputed datasets analysing the appropriate SEM