Non-Bayesian multiple imputation

(1)

Discussion Papers No. 421, May 2005 Statistics Norway, Research Department

Jan F. Bjørnstad

Non-Bayesian Multiple Imputation

Abstract:

Multiple imputation is a method specifically designed for variance estimation in the presence of missing data. Rubin’s combination formula requires that the imputation method is “proper” which essentially means that the imputations are random draws from a posterior distribution in a Bayesian framework. In national statistical institutes (NSI’s) like Statistics Norway, the methods used for imputing for nonresponse are typically non-Bayesian, e.g., some kind of stratified hot-deck. Hence, Rubin’s method of multiple imputation is not valid and cannot be applied in NSI’s. This paper deals with the problem of deriving an alternative combination formula that can be applied for imputation methods typically used in NSI’s and suggests an approach for studying this problem. Alternative combination formulas are derived for certain response mechanisms and hot-deck type imputation methods.

Keywords: Multiple imputation, survey sampling, nonresponse, hot-deck imputation JEL classification: C42, C13, C15

Address: Jan F. Bjørnstad, Statistics Norway, Division for Statistical Methods and Standards, P.O.

Box 8131 Dep., N-0033 Oslo, Norway. E-mail: [email protected]

(2)

Discussion Papers comprise research papers intended for international journals or books. A preprint of a Discussion Paper may be longer and more elaborate than a standard journal article, as it may include intermediate calculations and background material etc.

Abstracts with downloadable Discussion Papers in PDF are available on the Internet:

http://www.ssb.no

http://ideas.repec.org/s/ssb/dispap.html

For printed Discussion Papers contact:

Statistics Norway

Sales- and subscription service NO-2225 Kongsvinger

Telephone: +47 62 88 55 00

(3)

1. Introduction

Multiple imputation is a method specifically designed for variance estimation in the presence of missing data, developed by Rubin (1987). The basic idea is to create m imputed values for each missing value and combine the m completed data sets by Rubin’s combination formula for variance estimation. For the estimator to be valid, the imputations must display an appropriate level of variability. In Rubin’s term, the imputation method is required to be “proper”. In national statistical institutes (NSI’s) the methods used for imputing for nonresponse very seldom if ever satisfy the requirement of being “proper”. However, the idea of creating multiple imputations to measure the imputation uncertainty and use it for variance estimation and for computing confidence intervals is still of interest. The problem is then that Rubin’s combination formula is no longer valid with the usual nonproper imputations used by NSI’s. The reason being that the variability in nonproper imputations is too little and the between imputation component must be given a larger weight in the variance estimate. The problem is then to determine what this weight should be to give valid statistical

inference, and also for what kind of nonresponse mechanisms and estimation problems it is possible to determine a simple combination formula not dependent on unknown parameters. This paper suggests an approach for studying this problem.

In Section 2 an approach for determining the combination of the imputed completed data sets is suggested. Section 3 has two applications with random nonresponse, (i) estimating a population average from simple random samples using hot-deck imputation and (ii) estimating a regression coefficient using residual regression imputation. Section 4 deals with the general problem of multiple imputation for stratified samples. In Section 5 we apply the theory in Section 4 to stratified samples with random nonresponse within strata, covering (i) estimation of population average using stratified hot-deck imputation and (ii) estimation of log(odds ratios) in logistic regression with missingness both for the dependent variable and the explanatory variable. Section 6 takes up the problem of using the same combination rule for all estimation problems with a given imputation method and data &

response model.

(4)

2. An approach for determining an alternative combination for- mula for variance estimation in multiple imputation

Let s = (1,…,n ) denote the full sample, with y =(y₁,....,y_n) denoting the full sample data, values of random variable Y₁,....,Y_n. The objective is to estimate some parameter θ. Now, let y_obs be the observed part of y, with s_rbeing the response sample of size nr,

) :

( _i _r

obs y i s

y = ∈ .

Let θˆ be the estimator based on the full sample data y, with Var(θˆ)estimated by Vˆ(y). For i∈s−s_r we impute by some method y^∗_i and let y* denote the complete data (y_obs,y_i^∗,i∈s−s_r). Based on y*, we have θˆ*=θˆ(y*) and Vˆ^∗ =Vˆ(y^∗).

Multiple imputation of m repeated imputations leads to m completed data-sets with m estimates ,

,..., 1 ˆ_i^∗,i= m

θ and related variance estimates Vˆ_i^∗,i=1,...,m. The combined estimate is given by m

m i

i / ˆ

∑

1

=

∗

∗ = θ

θ .

The within-imputation variance is defined as m V V

m i

i / ˆ

∑

1

=

∗

∗ =

and the between-imputation component is

∑

=

∗

∗ −

= − ^m

i

m i

B

1

)2

(ˆ 1

1 θ θ .

The total estimated variance of θ^∗ is then proposed to be

∗

∗+ +

= B

k m V

W 1)

( . (1)

That is, we need to determine k such that

E(W) = Var(θ^∗). (2)

Rubin (1987) has shown that k = 1 can be used with proper imputations, which essentially means drawing imputed values from a posterior distribution in a Bayesian framework.

In general, one has to determine the terms in (2). One way to try and do this is to use double expectation, conditioning on y_obs, that is,

)}

| ( { )

( =

(5)

)}

| ( { )}

| ( { )

( EVar Y_obs Var E Y_obs

Varθ^∗ = θ^∗ + θ^∗ .

Typically,

ˆ) ( )

(V Varθ

E ^∗ ≈ (3)

and

E(B^∗| y_obs)=Var(θˆ^∗|y_obs). Hence, approximately

) ˆ | ( 1)

) ( ( ˆ) ( )

( EVar Y_obs

k m E Var

W

E = θ + + θ^∗ . (4)

Moreover,

m y Var y

Var(θ^∗| _obs)= (θˆ^∗| _obs)/ and

) ˆ |

( )

|

( y_obs E y_obs

E θ^∗ = θ^∗ .

This implies that

)}

ˆ | ( { )}

ˆ | ( 1 {

)

( E Var Y_obs Var E Y_obs

Varθ^∗ =m θ^∗ + θ^∗ .

From (3) and (4), the equation (2) becomes

) ˆ | ( ) ( ˆ)

( E k EVar Y_obs

Varθ + θ^∗ = )}Var{E(θˆ^∗|Y_obs ,

which gives the following general expression for E(k):

) ˆ | (

ˆ) ( ) ˆ | ) (

(

obs obs

Y EVar

Var Y

k VarE

E _∗

∗ −

= θ

θ

θ . (5)

For this to be of interest, k must be, at least approximately, determined independent of unknown parameters. In addition, one needs to check that (3) holds.

To illustrate how (5) can be used we shall in the next section consider two special cases with random nonresponse.

(6)

3. Two applications to simple random samples and random non- response

3.1. Estimating population average with hot-deck imputation

Consider a simple random sample from a finite population of size N, where the aim is to estimate the population average µ of some variable y. We shall assume completely random nonresponse. In Rubin’s term MCAR (missing completely at random). We note that MCAR means that the response indicators R₁,...,R_N are independent with the same response probabilityp_r =P(R_i =1). The imputation method is the hot-deck method, where y_i^∗ is drawn at random from y_obs, and the estimate is the sample mean. Let y_r be the observed sample mean and ⁼ ₋

∑

_∈_r ⁻

r i s i r

r² n¹₁ (y y )²

σˆ the observed

sample variance. Then Y^∗ is the imputation-based sample mean for the completed sample, and the combined estimator is given by

m Y Y

m i

i /

∑

1

=

∗

∗= .

Let Y_s denote the sample mean based on a full sample. Then, 1)

(1 )

( ²

N Y n

Var _s =σ − , with

∑

=

− −

= ^N

i

yi

N ₁

2

2 ( )

1

1 µ

σ

being the population variance. We have further that

r

obs y

y Y

E( ^∗| )= and ₂ 1ˆ²

)

|

( _r

r r r

obs n

n n

n y n

Y

Var − ⋅ ⁻ σ

∗ =

using that E(Y_i^∗|y_obs)= y_rand 1ˆ² )

|

( _r

r obs r

i n

y n Y

Var ^∗ = ⁻ σ .

In this case,

1) (1 ˆ ˆ²

N V^∗=σ_∗ n− where

( ∑

∗

∑

− ∗ ∗

)

∗ − + −

= −

r r

s yi y s s yi y

n

2 2

2 ( ) ( )

1 ˆ 1

σ .

It can be shown that

)

| ˆ ( ² y_obs

E σ_∗ = ˆ_r²(1 _n¹)(1 _n₍_nⁿ^r₁₎) ˆ_r²

r σ

σ − + ₋ ≈

and (3) holds. We find, from (5),

(7)

) ˆ | ( ) (

) ( ) ) (

( ₁ ¹ ₂ ¹

2

2 n r r

n n

N r n

n E E

Y k Var

E

r r

r σ

σ

−

− ⋅

−

= −

= ₁ ₂

1 2 1 1 2 1

) (

) ( ) ) ( (

2 σ

σ σ

r r r r

n n n

n n

N n N

n

E E

−

− ⋅

−

r r

p p

p

p 1

1 / ) 1

( =

−

≈ −

which is satisfied approximately by letting

k f

= − 1

1

where f = (n−n_r)/n is the rate of nonresponse.

3.2. Estimating regression coefficient with residual imputation

We shall assume completely random nonresponse as in Section 3.1. We consider a ratio model, i.e., regression through the origin:

i i

i x

Y =β +ε , with Var(ε_i)=σ²x_i; i = 1, …,n.

It is assumed that all xi’s are known, also in the nonresponse sample. The full data estimator of β is given by

∑

= =

= ⁿ

i i n

i

i x

Y

1 1

ˆ /

β .

The unbiased estimator of σ²is given by

∑

=

− −

= ⁿ

i

i x

n ₁^xⁱ y

1 2

2 ( ˆ )

1

ˆ 1 β

σ .

We shall consider residual regression imputation:

Let βˆ be the _r βˆ - estimate based on observed sample s_r. Define the standardized residuals

i i r i

i y x x

e =( −βˆ )/ , for i∈s_r.

For i∈s−s_r: Draw the value of e^∗_i at random from the set of observed residuals e_i,i∈s_r, and the imputed y-value is given by

i i i r

i x e x

y^∗=β^ˆ + ^∗ .

(8)

Let _r

s s

i i

s nr

i i

r n

i xi X x X x X X

X

r r

−

=

∑

= ,

∑

∈ and

∑

∈−

1 . All considerations from now on are

conditional on nr and Xr, and we aim to determine k directly from (5). Define the proportion of the x- total in the nonresponse group to be:

X X f_X = _nr/ . We now have

X y y

r

r s s i

s i )/

ˆ∗ =(

∑

+

∑

− ∗

β

) ˆ ) ( ˆ )

( 1(

ˆ² 1

∑

¹ ²

∑

¹ ²

−

∗

∗ − ∗ + −

= −

r r

i

s i s s

i i i

i x y x

n ^x y β ^x β

σ .

In order to determine k from (5) we need to check the validity of (3) and derive the following quantities: )Var(βˆ^∗|y_obs),E(βˆ^∗| y_obs and Var(βˆ). We note that

X Var(βˆ)=σ²/ . Consider (3) which is equivalent to

2 2) (σˆ_∗ ≈σ

E .

Let _nr

s s

i

nr y X

r

ˆ

∑

/

−

= ∗

β , and ∑

−

∗−

= −

r

s i

s

i nr i nr

nr y x

n ^x

1 2

2 ( ˆ )

1

ˆ 1 β

σ . Here, nnr = n - nr. Then, after some

algebra, one can express σˆ_∗² in the following way:

⎟⎠

⎜ ⎞

⎝

⎛ − + − + −

= −

∗2 ( 1)ˆ2 ( 1)ˆ2 ( ˆ ˆ )2

1

ˆ 1 _r _r _nr _nr ^r ^nr _r _nr

X X n X

n n σ σ β β

σ .

In this case,

i i r obs

i y x e x

Y

E( ^∗| )=β^ˆ + , where e _s e_i n_r

r

∑

/

= ,

) 2

|

(Y_i y_obs x_is_e

Var ^∗ = , where se² ⁼ n¹_r

∑

s_r(ei⁻e)².

Using this, it can be shown that

1) )

1 (

4 1 1

( ˆ )

( ² ² ¹ ² ₃

r

r n n

f n n c n

c n

E c

⋅

− −

∗ =σ σ

where c₁,c₂,c₃ lies in the interval (0,1).

Hence, E(σˆ_∗²)≈σ² and (3) follows, at least for moderate and large n_r.

(9)

Next, we look at Var(βˆ^∗|y_obs )and E(βˆ^∗| y_obs): We see that βˆ^∗=(βˆ_rX_r+βˆ_nrX_nr)/X , and

∑

−

+

=

sr

s i nr r obs

nr x

X y e

E(βˆ | ) β^ˆ

nr e obs

nr y s X

Var(βˆ | )= ²/ .

This gives us

) ˆ |

( y_obs

E β^∗ =

∑

−

+

sr

s i

r x

X β^ˆ e

. )

ˆ |

( _obs ^nr₂ s_e² X y X Var β^∗ =

Next, we need to find EVar(βˆ^∗|y_obs )and VarE(βˆ^∗| y_obs):

) ˆ , ( 2

) ) (

) ( ( ˆ ) ˆ |

(

Var ₂

2

e X Cov

x e

X Var x Var

y

E β_∗ _obs β_r

∑

^s₋^s^r ⁱ

∑

^s₋^s^r ⁱ β_r +

+

= .

Using Cauchy-Schwarz inequality,

∑

= = =

≤ ⁿ

i i n

i i

ib a b

a

1 2 1

2 2

1

) (

with a_i = x_i and b_i =1, we see that . )

( ²

1

nX x

n i

i ≤

∑

=

(6) Now, after some algebra we find that Cov(βˆ_r,e)=0 and

= ) (e

Var ⎟⎟

⎠

⎞

⎜⎜

⎝

⎛ −

∑

r r

s i

r n X

x n

r

2 ( )2

σ 1

=

nr

d

2 1) 1

( σ

− , .0≤d₁≤1 Moreover, from (6),

. 1 0

) ,

(

2 2 2 2

2

≤

∑

− =

X d X n d X

x

nr s nr

s i

r

(10)

Hence,

r nr nr r

obs X n

X n d d y X

E

2 2

2

2 (1 1)

) ˆ | (

Var β σ − ⋅σ

+

∗ = .

Next we find that

) 2 (

) ( ) 1 ( )

( ₁

1 2 2

2 = − − = n +d −

e n Var s

E _r

n r

e r

σ σ

which gives us

).

2 (

) ˆ |

( ^∗ = ₂ ⋅ ² n +d₁− n

X y X

EVar _r

r nr obs

β σ

From (5),

) 2

( ₁

) 1 (

2 2

2 2 1

2

− +

⋅

−

⋅

= +

−

d k n

X r X n

X X X n d d n X

nr r

nr nr r

r

σ

σ σ

σ

= ( 2)

) 1 (

1 2 1 2

− +

⋅

−

d n X X

X X n d d X

X n X n

r nr r

r nr nr r

r r

r nr

r n

d n X d

X

2 1) 1 ( − +

≈ .

We note that if all x_i = 1, then d1= d2 = 1. Now, with f_X = X_nr/X being the proportion of the x-total in the nonresponse goup and f = n_nr/n the rate of nonresponse, we finally get, since typically

0 ) 1

( −d₁ d₂≈ ,

k

X

X f f

d f

f d ≈ −

− −

− +

≈ 1

1 ) 1

1 1 (

1

2 1

for usual x-values and nonresponse rates.

(11)

4. Multiple imputation for stratified samples

4.1. Separate combinations

One way to combine the m completed data sets is to do it separately for each stratum, that is determine k . The general setup is then as follows: The sample s is divided into H sample strata, s1, . . ., sH. Let yh

be the planned full data from sub sample sh of size nh. It is assumed that y₁, . . . ,y_H are independent.

The observed part of yh is denoted by yh,obs with shr being the response sample from sh of size nhr. The estimator based on the full sample data is the sum of independent terms:

∑=

= ^H

h 1 h

ˆ

ˆ θ

θ where θˆ is based on the y_h h.

∑ =

= ^H_h Var _h

Var(θˆ) ₁ (θˆ ) is estimated by Vˆ(θˆ)=∑^H_h=₁Vˆ_h(y_h)where )ˆ (

h h y

V is the variance estimate of θˆ based on yh h. For i∈s_h −s_hr we impute by some method y_i^∗based on yh,obs and let yh* denote the complete data (y_h_,_obs,y_i^∗,i∈s_h −s_hr). Based on y , we have ^∗_h θˆ_h^∗=θˆ_h(y^∗_h )and Vˆ_h^∗ =Vˆ_h(y^∗_h). Then the imputation based estimator is given by ∑

=

∗

∗= ^H

h 1 h

ˆ

ˆ θ

θ and Vˆ^∗ ⁼∑^Hh₌₁Vˆh^∗. Multiple imputation of m repeated imputations leads to m completed data-sets with m estimates for each stratum h,

m

i i

h , 1,..., ˆ, =

θ and related variance estimates ˆ , 1,..., .

, i m

V_h^∗_i = The total estimates and related variances are θˆ_i∗=∑_h^H=₁θˆ_h∗_,_iand Vˆ_i∗ =∑^H_h=₁Vˆ_h∗_,_i , for i =1, . . . , m. The combined estimate for stratum h is given by

m

m i

i h

h ˆ /

1 ,

∑=

∗

∗ = θ

θ .

The within-imputation variance for stratum h is m

V V

m i hi

h ˆ /

1 ,

∑=

∗

∗ =

and the between-imputation component is

∑=

∗

∗ −

= − ^m

i hi h

h m

B

1

, )2

(ˆ 1

1 θ θ .

Following the same idea as in Section 2, formula (1), the total estimated variance of θ_h^∗ is then proposed to be

∗

∗+ +

= _h _h _h

h B

k m V

W 1)

( .

(12)

The combined total estimate is given by

∑

∑ =

∗

=

∗

∗ = = ^H

h h m

i

i m

1 1

ˆ / θ

θ

θ .

It follows that the total estimated variance of θ^∗can be expressed as

∑

∑ =

∗

= = + +

= ^H

h

h h

H h

h

sep B

k m V

W W

1 1

1)

( (7)

where

∑ ∑

= =

∗

∗ = ^m =

i

H h

h

i m V

V V

1 1

ˆ / .

Provided (3) holds for each stratum h,

ˆ ) ( )

(V_h Var _h

E ^∗ ≈ θ (8)

we have from (5) that kh must satisfy

) ˆ |

(

ˆ ) ( ) ˆ |

) ( (

, ,

obs h h

h obs

h h

h EVar Y

Var Y

k VarE

E _∗

∗ −

= θ

θ

θ . (9)

The combination formula (7) is an alternative to the usual combination formula (1), especially useful when we get simple expressions for kh, but not for k. The next section developes an expression for k in this situation.

4.2. An overall combination formula

Now let W be given by (1). We shall determine the between imputation factor k. Since )

( )

(W E W_sep

E = we have

. 1) ( } 1) ( {

1

∗

=

∗ = +

∑ + ^B

k m E m B

k E

H h

h

h (10)

Here, B^∗⁼ ₋

∑

^m(ˆi^∗⁻ ^∗)² 1

1 θ θ = ∑ ∑ ^∗ ⁻ ^∗

−

m

h h hi

2

, )}

(ˆ 1 {

1 θ θ . We note that

(13)

)

| (

)

|

(B y_obs E ^H_h₁B_h y_obs

E ∗ = ∑= ∗ .

This follows from the fact that E(B∗|y_obs)=Var(θˆ∗|y_obs)=∑^H_h=₁Var(θˆ_h∗|y_obs)and )

ˆ | ( )

|

(B_h y_obs Var _h y_obs

E ^∗ = θ^∗ .

Hence, the identity (10) becomes

)}.

| ( { )}

| ( {

1 obs

H h

obs h

hE B Y E kE B Y

k

E ^∗

=

∗ =

∑

This gives us a solution for k if we want to use the usual combination formula (1):

)

| (

)

|

1 (

obs H

h h h obs

y B E

y B E k ⁼ k _∗

∑ ∗

=

= (ˆ | )

) ˆ |

1 (

obs H

h h h obs

y Var

y Var k

∗

= ∗

∑ θ

θ =

) ˆ | (

1 obs

obs H h

h

h Var y

y k Var _∗

∗

= ⋅

∑ θ

θ , (11)

a weighted average of kh. We get a simple expression for k only when all kh are equal, say kh = k₀. Then k = k0.

5. Four applications to stratified samples and random nonre- sponse within strata

5.1. Estimating population average from stratified sample with stratified hot- deck imputation

Consider stratified simple random samples from a finite population of size N, with H strata of sizes Nh, h = 1,...,H. The aim is to estimate the population average µ of some variable y. We assume completely random nonresponse within each stratum, typically denoted as MAR (missing at random). This means that the response indicators in stratum h,

Nh

h

h R

R_,₁,..., _, are independent with the same response probability phr = P(R_h_,_i =1). The imputation method is stratified hot-deck. Let yh,obs be the observed part from the response sample shr of size nhr from stratum h,

(14)

Then an imputed value y_i^∗in stratum h is drawn at random from yh,obs.

The estimator based on the full sample data is the usual stratified weighted average

∑=

= ^H

h h h

strat N y

Y N

1

1 = ∑

= H h

h hy v

1

. Here, v_h =N_h/N and _h

s i

i

h y n

y

h

∑ /

∈

= , where sh is the sample from stratum h and n_h =|s_h|. Then

1 ) ( 1 )

( ²

1 2

h h h H h

h strat

N v n

Y

Var =∑ −

= σ , with ∑

∈ −

= −

Uh

i

h i h

h y

N

2

2 ( )

1

1 µ

σ

being the population variance in stratum h. Here Uh is stratum population h and µh is the average in Uh.

Let y_hrbe the observed sample mean from stratum h and = − ∑∈ −

hr i shr i hr

hr² n¹₁ (y y )²

σˆ the observed

sample variance. The imputation-based estimator is given by

∗

=

∗ ⁼ ∑ h

H h

h

strat N y

Y N

1

1 where

) 1 (

∑

∑ ∈ −

∗

−

∈

∗

∈

∗= + = +

hr h hr

h

hr i s s

i hr

hr s h

s i

i s

i i h

h n y y

y n n y

y .

Let the m imputation replicates of Y_strat^∗ be denoted by Y_strat^∗ _,_ifor i = 1, …, m. The combined estimator is given by

∑=

∗

∗ = ^m

i i strat

strat Y

Y

1 , .

5.1.1. Separate strata combinations It follows from Section 3.1 that

h

h f

k = − 1

1

where f_h =(n_h−n_hr)/n_h is the rate of nonresponse in stratum h. The combination formula for the variance estimate of Y_strat^∗ becomes, from (7),

∑ ^∗

∗+ +

=V ^H B

W 1 1)

( .

(15)

Here, ∑

=

∗

∗ = ^H

h

Vh

V

1

and V_h^∗is the average of the m values of the imputation based variance estimate

∗=

Vˆh 1 1 )

ˆ² (

2

h h h

h n N

v σ _∗ − where

(

∑ ∗ ∑ − ∗ ∗

)

∗ − + −

= −

hr h hr

s i h s s i h

h

h y y y y

n

2 2

2 ( ) ( )

1 ˆ 1

σ .

5.1.2. Overall combination formula. Determination of k in (1)

From (11) we need to determine Var(v_hY_h^∗|y_obs) and Var(Ystrat^∗ |yobs)⁼∑^Hh₌₁Var(vhYh^∗|yobs). Then we have that

k = ( | )

)

| ( 1

1

1 strat obs

obs h H h

h h Var Y y

y Y v Var

f ^∗

∗

= ⋅

∑ − ^.

Now, for i ∈ sh - shr:

hr obs h

i y y

Y

E( ^∗| _, )= and _, 1 ˆ²

)

|

( _hr

hr hr obs h

i n

y n Y

Var ^∗ = ⁻ σ .

This gives the following results:

hr obs h

h y y

Y

E( ^∗| _, )= and _, ₂ 1ˆ ²

)

|

( _hr

hr hr h

hr h obs h

h n

n n

n y n

Y

Var ^∗ = − ⋅ ⁻ σ

h hr h n f

ˆ2

≈ σ .

Hence we can determine k as

∑ ∑

= ⋅ =

= − _H

k k k hr h

h hr h H h

h h f v n

n v f k f

1 2 2

2 2

1 ˆ /

ˆ / 1

1

σ

σ .

If the stratum sizes Nh are large then we can let Vˆ(v_hY_h)=v_h²σˆ_hr² /n_h. Let also

∑ ₌

= _h _h _h ^H_k _k _k _k

h fV vY f V v Y

b ˆ( )/ ₁ ˆ( ). Then

∑ ∑

∑

=

⋅ −

− =

= ^H

h h

H h h

h h h H

h

h h h h

b f f

Y v V

f f Y v V k

1 1

1

1 1 )

ˆ(

1 ) 1 ˆ(

. (12)

Since 1∑H=₁ =

h bh , we see that k is a weighted average of the inverse of the response rates. If all fh = f, the overall nonresponse rate, we get as for simple random sample that k = 1/(1-f ). Otherwise, a

(16)

stratum response rate 1- f_hhas large weight if either the nonresponse rate is large and/or the estimated variance of v_hY_h is large.

5.1.3. An alternative expression for k in (1)

By directly applying (5) we can get an alternative expression for k. Given yobs, the imputed sample means Y_h^∗are independent which implies that

r strat hr H h

h obs

strat N y y

y N Y

E _,

1

) 1

|

( = ∑ =

=

∗ and ₂ ²

1

2 1 ˆ

)

|

( _hr

hr hr h

hr H h

h h obs

strat

n n n

n v n

y Y

Var = ∑ ⋅ − ⋅ ⁻ σ

=

∗ .

It follows that

2 1

2 ˆ

)

|

( _hr

h H h h

h obs

strat

n v f y

Y

Var ≈∑ ⋅ σ

=

∗ .

Just like in Section 3.1, (3) holds. From (5) we get

∑= ⋅

≈ _H −

h

hr h h h

strat r

strat

n v f E

Y Var Y

k Var E

1

2 2 ,

ˆ ) (

) ( ) ) (

(

σ

∑

=

⋅

−

= _H −

h

hr n hr

f h

H

h h h n N

H

h h h n N

n E E v

v E

v

h h

h h h

hr

1

2 2

1 2 2 1 1

)}

ˆ | ( {

) (

) ) ( (

σ

σ σ

∑

=

⋅

= _H −

h n

f E h h

n H

h h h n

h h

h hr

v E v

1

) ( 2 2

1 2 2[ ( 1 ) 1] σ

σ

∑

=

−

− ⋅

≈ _H

h h

hr h h H h

hr h

hr h h

n v p

p n v p

1 2 2 1 2 2

1

1 1

σ σ

=

∑

=

−

H h

h h hr

h h H h

h h h

hr h h

f f n E v

f E f f n E v

1 2 2 1

2 2

) 1 )(

(

) 1 ( ) 1 ( σ σ

. (13)

Now, Var(Y_hr)=EVar(Y_hr |n_hr)=σ_h²E(1/n_hr). Let Vˆ(v_hY_hr)=v_h²σˆ_hr² /n_hr. Then we see that the expression for E(k) is satisfied approximately, if the stratum sizes Nh are large, by letting

∑

=

= −

= _H

h h h hr

H

h h h h hr

Y v V f

Y v V f f

k ₁

1

) ˆ(

) ˆ( ) 1 (

1 = ∑

= −

H h

h

h f

a

1

) 1

( (14)

(17)

where the weights a_h = f_hVˆ(v_hY_hr)/∑^H_k=₁f_kVˆ(v_kY_kr). Since ∑H=₁ =1

h ah , we see that 1/k is a weighted average of the response rates. If all fh = f, the overall nonresponse rate, we have, as shown in Section 5.1.2, that k = 1/(1-f). As seen in Section 5.1.2, we note also in expression (14) that a stratum response rate 1- fh has large weight if either the nonresponse rate is large and/or the estimated variance of v_hY_hr is large. We note that the estimate of the total based on the response sample is given by

.

, ∑1

=

= ^H

h hr h r

strat v Y

Y

We obtain formula (12) for k by noting from (13) that we can express E(k) as

∑

=

= −

≈ _H

h

h h h H

h

h h

f E Y v Var

f f E

E Y v Var k

E

1 1

) ( ) (

) 1 ( ) 1 ( ) ( )

( .

Then we see that the expression for E(k) is satisfied approximately, if the stratum sizes Nh are large, by letting k be given by (12).

5.2. Logistic regression with binary explanatory variable. Estimating log(odds ratio)

The model is as follows:

Yn

Y₁,..., are independent 0/1 -variables

Explanatory 0/1-variable x with fixed known values x₁,...,x_n Class probabilities: π₁ =P(Y_i =1|x_i =1) and π₀ =P(Y_i =1|x_i =0) Response variables: R₁,...,R_nwith MAR (missing at random) model:

r i

i x p

R

P( =1| =1)= ₁ and P(R_i =1|x_i =0)= p₀_r We can reparametrize the model in a logit version:

x x Y P

x Y

P =α+β

=

= )

| 0 (

)

| 1 log (

giving us the following 1-1 relationships:

π α

π

α π ₋

= +

− ⇔

= 1 e

1

log1 ₀

0 0

) 1 /(

) 1 log /(

0 0

1

1 π

π

π β π

−

= − = log(odds ratio), and ₁ ₍ ₎

1 1

β

π ₋α₊

= +

e .

The aim is to estimate β. Let s = (1, . . . , n) denote the full sample with strata s₁ ={i∈s:x_i =1} and }

0 :

{ ∈ =

= . The sizes of s and s are denoted by n and n . We note that ⁼∑ⁿ ⁼ ^and