Discussion Papers No. 421, May 2005 Statistics Norway, Research Department
Jan F. Bjørnstad
Non-Bayesian Multiple Imputation
Abstract:
Multiple imputation is a method specifically designed for variance estimation in the presence of missing data. Rubin’s combination formula requires that the imputation method is “proper” which essentially means that the imputations are random draws from a posterior distribution in a Bayesian framework. In national statistical institutes (NSI’s) like Statistics Norway, the methods used for imputing for nonresponse are typically non-Bayesian, e.g., some kind of stratified hot-deck. Hence, Rubin’s method of multiple imputation is not valid and cannot be applied in NSI’s. This paper deals with the problem of deriving an alternative combination formula that can be applied for imputation methods typically used in NSI’s and suggests an approach for studying this problem. Alternative combination formulas are derived for certain response mechanisms and hot-deck type imputation methods.
Keywords: Multiple imputation, survey sampling, nonresponse, hot-deck imputation JEL classification: C42, C13, C15
Address: Jan F. Bjørnstad, Statistics Norway, Division for Statistical Methods and Standards, P.O.
Box 8131 Dep., N-0033 Oslo, Norway. E-mail: [email protected]
Discussion Papers comprise research papers intended for international journals or books. A preprint of a Discussion Paper may be longer and more elaborate than a standard journal article, as it may include intermediate calculations and background material etc.
Abstracts with downloadable Discussion Papers in PDF are available on the Internet:
http://www.ssb.no
http://ideas.repec.org/s/ssb/dispap.html
For printed Discussion Papers contact:
Statistics Norway
Sales- and subscription service NO-2225 Kongsvinger
Telephone: +47 62 88 55 00
1. Introduction
Multiple imputation is a method specifically designed for variance estimation in the presence of missing data, developed by Rubin (1987). The basic idea is to create m imputed values for each missing value and combine the m completed data sets by Rubin’s combination formula for variance estimation. For the estimator to be valid, the imputations must display an appropriate level of variability. In Rubin’s term, the imputation method is required to be “proper”. In national statistical institutes (NSI’s) the methods used for imputing for nonresponse very seldom if ever satisfy the requirement of being “proper”. However, the idea of creating multiple imputations to measure the imputation uncertainty and use it for variance estimation and for computing confidence intervals is still of interest. The problem is then that Rubin’s combination formula is no longer valid with the usual nonproper imputations used by NSI’s. The reason being that the variability in nonproper imputations is too little and the between imputation component must be given a larger weight in the variance estimate. The problem is then to determine what this weight should be to give valid statistical
inference, and also for what kind of nonresponse mechanisms and estimation problems it is possible to determine a simple combination formula not dependent on unknown parameters. This paper suggests an approach for studying this problem.
In Section 2 an approach for determining the combination of the imputed completed data sets is suggested. Section 3 has two applications with random nonresponse, (i) estimating a population average from simple random samples using hot-deck imputation and (ii) estimating a regression coefficient using residual regression imputation. Section 4 deals with the general problem of multiple imputation for stratified samples. In Section 5 we apply the theory in Section 4 to stratified samples with random nonresponse within strata, covering (i) estimation of population average using stratified hot-deck imputation and (ii) estimation of log(odds ratios) in logistic regression with missingness both for the dependent variable and the explanatory variable. Section 6 takes up the problem of using the same combination rule for all estimation problems with a given imputation method and data &
response model.
2. An approach for determining an alternative combination for- mula for variance estimation in multiple imputation
Let s = (1,…,n ) denote the full sample, with y =(y1,....,yn) denoting the full sample data, values of random variable Y1,....,Yn. The objective is to estimate some parameter θ. Now, let yobs be the observed part of y, with srbeing the response sample of size nr,
) :
( i r
obs y i s
y = ∈ .
Let θˆ be the estimator based on the full sample data y, with Var(θˆ)estimated by Vˆ(y). For i∈s−sr we impute by some method y∗i and let y* denote the complete data (yobs,yi∗,i∈s−sr). Based on y*, we have θˆ*=θˆ(y*) and Vˆ∗ =Vˆ(y∗).
Multiple imputation of m repeated imputations leads to m completed data-sets with m estimates ,
,..., 1 ˆi∗,i= m
θ and related variance estimates Vˆi∗,i=1,...,m. The combined estimate is given by m
m i
i / ˆ
∑
1=
∗
∗ = θ
θ .
The within-imputation variance is defined as m V V
m i
i / ˆ
∑
1=
∗
∗ =
and the between-imputation component is
∑
=∗
∗
∗ −
= − m
i
m i
B
1
)2
(ˆ 1
1 θ θ .
The total estimated variance of θ∗ is then proposed to be
∗
∗+ +
= B
k m V
W 1)
( . (1)
That is, we need to determine k such that
E(W) = Var(θ∗). (2)
Rubin (1987) has shown that k = 1 can be used with proper imputations, which essentially means drawing imputed values from a posterior distribution in a Bayesian framework.
In general, one has to determine the terms in (2). One way to try and do this is to use double expectation, conditioning on yobs, that is,
)}
| ( { )
( =
)}
| ( { )}
| ( { )
( EVar Yobs Var E Yobs
Varθ∗ = θ∗ + θ∗ .
Typically,
ˆ) ( )
(V Varθ
E ∗ ≈ (3)
and
E(B∗| yobs)=Var(θˆ∗|yobs). Hence, approximately
) ˆ | ( 1)
) ( ( ˆ) ( )
( EVar Yobs
k m E Var
W
E = θ + + θ∗ . (4)
Moreover,
m y Var y
Var(θ∗| obs)= (θˆ∗| obs)/ and
) ˆ |
( )
|
( yobs E yobs
E θ∗ = θ∗ .
This implies that
)}
ˆ | ( { )}
ˆ | ( 1 {
)
( E Var Yobs Var E Yobs
Varθ∗ =m θ∗ + θ∗ .
From (3) and (4), the equation (2) becomes
) ˆ | ( ) ( ˆ)
( E k EVar Yobs
Varθ + θ∗ = )}Var{E(θˆ∗|Yobs ,
which gives the following general expression for E(k):
) ˆ | (
ˆ) ( ) ˆ | ) (
(
obs obs
Y EVar
Var Y
k VarE
E ∗
∗ −
= θ
θ
θ . (5)
For this to be of interest, k must be, at least approximately, determined independent of unknown pa- rameters. In addition, one needs to check that (3) holds.
To illustrate how (5) can be used we shall in the next section consider two special cases with random nonresponse.
3. Two applications to simple random samples and random non- response
3.1. Estimating population average with hot-deck imputation
Consider a simple random sample from a finite population of size N, where the aim is to estimate the population average µ of some variable y. We shall assume completely random nonresponse. In Rubin’s term MCAR (missing completely at random). We note that MCAR means that the response indicators R1,...,RN are independent with the same response probabilitypr =P(Ri =1). The imputation method is the hot-deck method, where yi∗ is drawn at random from yobs, and the estimate is the sample mean. Let yr be the observed sample mean and = −
∑
∈r −r i s i r
r2 n11 (y y )2
σˆ the observed
sample variance. Then Y∗ is the imputation-based sample mean for the completed sample, and the combined estimator is given by
m Y Y
m i
i /
∑
1=
∗
∗= .
Let Ys denote the sample mean based on a full sample. Then, 1)
(1 )
( 2
N Y n
Var s =σ − , with
∑
=
− −
= N
i
yi
N 1
2
2 ( )
1
1 µ
σ
being the population variance. We have further that
r
obs y
y Y
E( ∗| )= and 2 1ˆ2
)
|
( r
r r r
obs n
n n
n y n
Y
Var − ⋅ − σ
∗ =
using that E(Yi∗|yobs)= yrand 1ˆ2 )
|
( r
r obs r
i n
y n Y
Var ∗ = − σ .
In this case,
1) (1 ˆ ˆ2
N V∗=σ∗ n− where
( ∑
∗∑
− ∗ ∗)
∗ − + −
= −
r r
s yi y s s yi y
n
2 2
2 ( ) ( )
1 ˆ 1
σ .
It can be shown that
)
| ˆ ( 2 yobs
E σ∗ = ˆr2(1 n1)(1 n(nnr1)) ˆr2
r σ
σ − + − ≈
and (3) holds. We find, from (5),
) ˆ | ( ) (
) ( ) ) (
( 1 1 2 1
2
2 n r r
n n
n n
N r n
n E E
Y k Var
E
r r
r σ
σ
−
− ⋅
−
= −
= 1 2
1 2 1 1 2 1
) (
) ( ) ) ( (
2 σ
σ σ
r r r r
n n n
n n
N n N
n
E E
−
− ⋅
−
−
−
r r
r r
p p
p
p 1
1 / ) 1
( =
−
≈ −
which is satisfied approximately by letting
k f
= − 1
1
where f = (n−nr)/n is the rate of nonresponse.
3.2. Estimating regression coefficient with residual imputation
We shall assume completely random nonresponse as in Section 3.1. We consider a ratio model, i.e., regression through the origin:
i i
i x
Y =β +ε , with Var(εi)=σ2xi; i = 1, …,n.
It is assumed that all xi’s are known, also in the nonresponse sample. The full data estimator of β is given by
∑
∑
= == n
i i n
i
i x
Y
1 1
ˆ /
β .
The unbiased estimator of σ2is given by
∑
=− −
= n
i
i
i x
n 1xi y
1 2
2 ( ˆ )
1
ˆ 1 β
σ .
We shall consider residual regression imputation:
Let βˆ be the r βˆ - estimate based on observed sample sr. Define the standardized residuals
i i r i
i y x x
e =( −βˆ )/ , for i∈sr.
For i∈s−sr: Draw the value of e∗i at random from the set of observed residuals ei,i∈sr, and the imputed y-value is given by
i i i r
i x e x
y∗=βˆ + ∗ .
Let r
s s
i i
s nr
i i
r n
i xi X x X x X X
X
r r
−
=
=
=
=
∑
= ,∑
∈ and∑
∈−1 . All considerations from now on are
conditional on nr and Xr, and we aim to determine k directly from (5). Define the proportion of the x- total in the nonresponse group to be:
X X fX = nr/ . We now have
X y y
r
r s s i
s i )/
ˆ∗ =(
∑
+∑
− ∗β
) ˆ ) ( ˆ )
( 1(
ˆ2 1
∑
1 2∑
1 2−
∗
∗
∗ − ∗ + −
= −
r r
i
s i s s
i i i
i x y x
n x y β x β
σ .
In order to determine k from (5) we need to check the validity of (3) and derive the following quantities: )Var(βˆ∗|yobs),E(βˆ∗| yobs and Var(βˆ). We note that
X Var(βˆ)=σ2/ . Consider (3) which is equivalent to
2 2) (σˆ∗ ≈σ
E .
Let nr
s s
i
nr y X
r
ˆ
∑
/−
= ∗
β , and ∑
−
∗−
= −
r
s i
s
i nr i nr
nr y x
n x
1 2
2 ( ˆ )
1
ˆ 1 β
σ . Here, nnr = n - nr. Then, after some
algebra, one can express σˆ∗2 in the following way:
⎟⎠
⎜ ⎞
⎝
⎛ − + − + −
= −
∗2 ( 1)ˆ2 ( 1)ˆ2 ( ˆ ˆ )2
1
ˆ 1 r r nr nr r nr r nr
X X n X
n n σ σ β β
σ .
In this case,
i i r obs
i y x e x
Y
E( ∗| )=βˆ + , where e s ei nr
r
∑
/= ,
) 2
|
(Yi yobs xise
Var ∗ = , where se2 = n1r
∑
sr(ei−e)2.Using this, it can be shown that
1) )
1 (
4 1 1
( ˆ )
( 2 2 1 2 3
r
r n n
f n n c n
c n
E c
⋅
− −
− −
− −
∗ =σ σ
where c1,c2,c3 lies in the interval (0,1).
Hence, E(σˆ∗2)≈σ2 and (3) follows, at least for moderate and large nr.
Next, we look at Var(βˆ∗|yobs )and E(βˆ∗| yobs): We see that βˆ∗=(βˆrXr+βˆnrXnr)/X , and
∑
−+
=
sr
s i nr r obs
nr x
X y e
E(βˆ | ) βˆ
nr e obs
nr y s X
Var(βˆ | )= 2/ .
This gives us
) ˆ |
( yobs
E β∗ =
∑
−
+
sr
s i
r x
X βˆ e
. )
ˆ |
( obs nr2 se2 X y X Var β∗ =
Next, we need to find EVar(βˆ∗|yobs )and VarE(βˆ∗| yobs):
) ˆ , ( 2
) ) (
) ( ( ˆ ) ˆ |
(
Var 2
2
e X Cov
x e
X Var x Var
y
E β∗ obs βr
∑
s−sr i∑
s−sr i βr ++
= .
Using Cauchy-Schwarz inequality,
∑
∑
∑
= = =≤ n
i i n
i i n
i i
ib a b
a
1 2 1
2 2
1
) (
with ai = xi and bi =1, we see that . )
( 2
1
nX x
n i
i ≤
∑
=(6) Now, after some algebra we find that Cov(βˆr,e)=0 and
= ) (e
Var ⎟⎟
⎠
⎞
⎜⎜
⎝
⎛ −
∑
r r
s i
r n X
x n
r
2 ( )2
σ 1
=
nr
d
2 1) 1
( σ
− , .0≤d1≤1 Moreover, from (6),
. 1 0
) ,
(
2 2 2 2
2
≤
≤
∑
− =X d X n d X
x
nr s nr
s i
r
Hence,
r nr nr r
obs X n
X n d d y X
E
2 2
2
2 (1 1)
) ˆ | (
Var β σ − ⋅σ
+
∗ = .
Next we find that
) 2 (
) ( ) 1 ( )
( 1
1 2 2
2 = − − = n +d −
e n Var s
E r
n r
e r
σ σ
which gives us
).
2 (
) ˆ |
( ∗ = 2 ⋅ 2 n +d1− n
X y X
EVar r
r nr obs
β σ
From (5),
) 2
( 1
) 1 (
2 2
2 2
2 2 1
2
− +
⋅
−
⋅
= +
−
d k n
X r X n
X X X n d d n X
nr r
nr nr r
r
σ
σ σ
σ
= ( 2)
) 1 (
1 2 1 2
− +
− +
⋅
−
d n X X
X X n d d X
X n X n
r nr r
r nr nr r
r r
r nr
r n
d n X d
X
2 1) 1 ( − +
≈ .
We note that if all xi = 1, then d1= d2 = 1. Now, with fX = Xnr/X being the proportion of the x-total in the nonresponse goup and f = nnr/n the rate of nonresponse, we finally get, since typically
0 ) 1
( −d1 d2≈ ,
k
X
X f f
d f
f d ≈ −
− −
− +
≈ 1
1 ) 1
1 1 (
1
2 1
for usual x-values and nonresponse rates.
4. Multiple imputation for stratified samples
4.1. Separate combinations
One way to combine the m completed data sets is to do it separately for each stratum, that is determine k . The general setup is then as follows: The sample s is divided into H sample strata, s1, . . ., sH. Let yh
be the planned full data from sub sample sh of size nh. It is assumed that y1, . . . ,yH are independent.
The observed part of yh is denoted by yh,obs with shr being the response sample from sh of size nhr. The estimator based on the full sample data is the sum of independent terms:
∑=
= H
h 1 h
ˆ
ˆ θ
θ where θˆ is based on the yh h.
∑ =
= Hh Var h
Var(θˆ) 1 (θˆ ) is estimated by Vˆ(θˆ)=∑Hh=1Vˆh(yh)where )ˆ (
h h y
V is the variance estimate of θˆ based on yh h. For i∈sh −shr we impute by some method yi∗based on yh,obs and let yh* denote the complete data (yh,obs,yi∗,i∈sh −shr). Based on y , we have ∗h θˆh∗=θˆh(y∗h )and Vˆh∗ =Vˆh(y∗h). Then the imputation based estimator is given by ∑
=
∗
∗= H
h 1 h
ˆ
ˆ θ
θ and Vˆ∗ =∑Hh=1Vˆh∗. Multiple imputation of m repeated imputations leads to m completed data-sets with m estimates for each stratum h,
m
i i
h , 1,..., ˆ, =
θ and related variance estimates ˆ , 1,..., .
, i m
Vh∗i = The total estimates and related variances are θˆi∗=∑hH=1θˆh∗,iand Vˆi∗ =∑Hh=1Vˆh∗,i , for i =1, . . . , m. The combined estimate for stratum h is given by
m
m i
i h
h ˆ /
1 ,
∑=
∗
∗ = θ
θ .
The within-imputation variance for stratum h is m
V V
m i hi
h ˆ /
1 ,
∑=
∗
∗ =
and the between-imputation component is
∑=
∗
∗
∗ −
= − m
i hi h
h m
B
1
, )2
(ˆ 1
1 θ θ .
Following the same idea as in Section 2, formula (1), the total estimated variance of θh∗ is then proposed to be
∗
∗+ +
= h h h
h B
k m V
W 1)
( .
The combined total estimate is given by
∑
∑ =
∗
=
∗
∗ = = H
h h m
i
i m
1 1
ˆ / θ
θ
θ .
It follows that the total estimated variance of θ∗can be expressed as
∑
∑ =
∗
∗
= = + +
= H
h
h h
H h
h
sep B
k m V
W W
1 1
1)
( (7)
where
∑ ∑
= =
∗
∗
∗ = m =
i
H h
h
i m V
V V
1 1
ˆ / .
Provided (3) holds for each stratum h,
ˆ ) ( )
(Vh Var h
E ∗ ≈ θ (8)
we have from (5) that kh must satisfy
) ˆ |
(
ˆ ) ( ) ˆ |
) ( (
, ,
obs h h
h obs
h h
h EVar Y
Var Y
k VarE
E ∗
∗ −
= θ
θ
θ . (9)
The combination formula (7) is an alternative to the usual combination formula (1), especially useful when we get simple expressions for kh, but not for k. The next section developes an expression for k in this situation.
4.2. An overall combination formula
Now let W be given by (1). We shall determine the between imputation factor k. Since )
( )
(W E Wsep
E = we have
. 1) ( } 1) ( {
1
∗
=
∗ = +
∑ + B
k m E m B
k E
H h
h
h (10)
Here, B∗= −
∑
m(ˆi∗− ∗)2 11 θ θ = ∑ ∑ ∗ − ∗
−
m
h h hi
2
, )}
(ˆ 1 {
1 θ θ . We note that
)
| (
)
|
(B yobs E Hh1Bh yobs
E ∗ = ∑= ∗ .
This follows from the fact that E(B∗|yobs)=Var(θˆ∗|yobs)=∑Hh=1Var(θˆh∗|yobs)and )
ˆ | ( )
|
(Bh yobs Var h yobs
E ∗ = θ∗ .
Hence, the identity (10) becomes
)}.
| ( { )}
| ( {
1 obs
H h
obs h
hE B Y E kE B Y
k
E ∗
=
∗ =
∑
This gives us a solution for k if we want to use the usual combination formula (1):
)
| (
)
|
1 (
obs H
h h h obs
y B E
y B E k = k ∗
∑ ∗
=
= (ˆ | )
) ˆ |
1 (
obs H
h h h obs
y Var
y Var k
∗
= ∗
∑ θ
θ =
) ˆ | (
) ˆ | (
1 obs
obs H h
h
h Var y
y k Var ∗
∗
= ⋅
∑ θ
θ , (11)
a weighted average of kh. We get a simple expression for k only when all kh are equal, say kh = k0. Then k = k0.
5. Four applications to stratified samples and random nonre- sponse within strata
5.1. Estimating population average from stratified sample with stratified hot- deck imputation
Consider stratified simple random samples from a finite population of size N, with H strata of sizes Nh, h = 1,...,H. The aim is to estimate the population average µ of some variable y. We assume completely random nonresponse within each stratum, typically denoted as MAR (missing at random). This means that the response indicators in stratum h,
Nh
h
h R
R,1,..., , are independent with the same response probability phr = P(Rh,i =1). The imputation method is stratified hot-deck. Let yh,obs be the observed part from the response sample shr of size nhr from stratum h,
Then an imputed value yi∗in stratum h is drawn at random from yh,obs.
The estimator based on the full sample data is the usual stratified weighted average
∑=
= H
h h h
strat N y
Y N
1
1 = ∑
= H h
h hy v
1
. Here, vh =Nh/N and h
s i
i
h y n
y
h
∑ /
∈
= , where sh is the sample from stratum h and nh =|sh|. Then
1 ) ( 1 )
( 2
1 2
h h h H h
h strat
N v n
Y
Var =∑ −
= σ , with ∑
∈ −
= −
Uh
i
h i h
h y
N
2
2 ( )
1
1 µ
σ
being the population variance in stratum h. Here Uh is stratum population h and µh is the average in Uh.
Let yhrbe the observed sample mean from stratum h and = − ∑∈ −
hr i shr i hr
hr2 n11 (y y )2
σˆ the observed
sample variance. The imputation-based estimator is given by
∗
=
∗ = ∑ h
H h
h
strat N y
Y N
1
1 where
) 1 (
) 1 (
∑
∑
∑ ∈ −
∗
−
∈
∗
∈
∗= + = +
hr h hr
h
hr i s s
i hr
hr s h
s i
i s
i i h
h n y y
y n n y
y .
Let the m imputation replicates of Ystrat∗ be denoted by Ystrat∗ ,ifor i = 1, …, m. The combined estimator is given by
∑=
∗
∗ = m
i i strat
strat Y
Y
1 , .
5.1.1. Separate strata combinations It follows from Section 3.1 that
h
h f
k = − 1
1
where fh =(nh−nhr)/nh is the rate of nonresponse in stratum h. The combination formula for the variance estimate of Ystrat∗ becomes, from (7),
∑ ∗
∗+ +
=V H B
W 1 1)
( .
Here, ∑
=
∗
∗ = H
h
Vh
V
1
and Vh∗is the average of the m values of the imputation based variance estimate
∗=
Vˆh 1 1 )
ˆ2 (
2
h h h
h n N
v σ ∗ − where
(
∑ ∗ ∑ − ∗ ∗)
∗ − + −
= −
hr h hr
s i h s s i h
h
h y y y y
n
2 2
2 ( ) ( )
1 ˆ 1
σ .
5.1.2. Overall combination formula. Determination of k in (1)
From (11) we need to determine Var(vhYh∗|yobs) and Var(Ystrat∗ |yobs)=∑Hh=1Var(vhYh∗|yobs). Then we have that
k = ( | )
)
| ( 1
1
1 strat obs
obs h H h
h h Var Y y
y Y v Var
f ∗
∗
= ⋅
∑ − .
Now, for i ∈ sh - shr:
hr obs h
i y y
Y
E( ∗| , )= and , 1 ˆ2
)
|
( hr
hr hr obs h
i n
y n Y
Var ∗ = − σ .
This gives the following results:
hr obs h
h y y
Y
E( ∗| , )= and , 2 1ˆ 2
)
|
( hr
hr hr h
hr h obs h
h n
n n
n y n
Y
Var ∗ = − ⋅ − σ
h hr h n f
ˆ2
≈ σ .
Hence we can determine k as
∑ ∑
= ⋅ =
= − H
k k k hr h
h hr h H h
h h f v n
n v f k f
1 2 2
2 2
1 ˆ /
ˆ / 1
1
σ
σ .
If the stratum sizes Nh are large then we can let Vˆ(vhYh)=vh2σˆhr2 /nh. Let also
∑ =
= h h h Hk k k k
h fV vY f V v Y
b ˆ( )/ 1 ˆ( ). Then
∑ ∑
∑
=
=
=
⋅ −
− =
= H
h h
H h h
h h h H
h
h h h h
b f f
Y v V
f f Y v V k
1 1
1
1 1 )
ˆ(
1 ) 1 ˆ(
. (12)
Since 1∑H=1 =
h bh , we see that k is a weighted average of the inverse of the response rates. If all fh = f, the overall nonresponse rate, we get as for simple random sample that k = 1/(1-f ). Otherwise, a
stratum response rate 1- fh has large weight if either the nonresponse rate is large and/or the estimated variance of vhYh is large.
5.1.3. An alternative expression for k in (1)
By directly applying (5) we can get an alternative expression for k. Given yobs, the imputed sample means Yh∗are independent which implies that
r strat hr H h
h obs
strat N y y
y N Y
E ,
1
) 1
|
( = ∑ =
=
∗ and 2 2
1
2 1 ˆ
)
|
( hr
hr hr h
hr H h
h h obs
strat
n n n
n v n
y Y
Var = ∑ ⋅ − ⋅ − σ
=
∗ .
It follows that
2 1
2 ˆ
)
|
( hr
h H h h
h obs
strat
n v f y
Y
Var ≈∑ ⋅ σ
=
∗ .
Just like in Section 3.1, (3) holds. From (5) we get
∑= ⋅
≈ H −
h
hr h h h
strat r
strat
n v f E
Y Var Y
k Var E
1
2 2 ,
ˆ ) (
) ( ) ) (
(
σ
∑
∑
∑
=
=
=
⋅
−
−
= H −
h
hr n hr
f h
H
h h h n N
H
h h h n N
n E E v
v E
v
h h
h h h
hr
1
2 2
1 2 2 1 1
1 2 2 1 1
)}
ˆ | ( {
) (
) ) ( (
σ
σ σ
∑
∑
=
=
⋅
= H −
h n
f E h h
n H
h h h n
h h
h hr
v E v
1
) ( 2 2
1 2 2[ ( 1 ) 1] σ
σ
∑
∑
=
=
−
− ⋅
≈ H
h h
hr h h H h
hr h
hr h h
n v p
p n v p
1 2 2 1 2 2
1
1 1
σ σ
=
∑
∑
=
=
−
−
−
H h
h h hr
h h H h
h h h
hr h h
f f n E v
f E f f n E v
1 2 2 1
2 2
) 1 )(
(
) 1 ( ) 1 ( σ σ
. (13)
Now, Var(Yhr)=EVar(Yhr |nhr)=σh2E(1/nhr). Let Vˆ(vhYhr)=vh2σˆhr2 /nhr. Then we see that the expression for E(k) is satisfied approximately, if the stratum sizes Nh are large, by letting
∑
∑
=
= −
= H
h h h hr
H
h h h h hr
Y v V f
Y v V f f
k 1
1
) ˆ(
) ˆ( ) 1 (
1 = ∑
= −
H h
h
h f
a
1
) 1
( (14)
where the weights ah = fhVˆ(vhYhr)/∑Hk=1fkVˆ(vkYkr). Since ∑H=1 =1
h ah , we see that 1/k is a weighted average of the response rates. If all fh = f, the overall nonresponse rate, we have, as shown in Section 5.1.2, that k = 1/(1-f). As seen in Section 5.1.2, we note also in expression (14) that a stratum response rate 1- fh has large weight if either the nonresponse rate is large and/or the estimated variance of vhYhr is large. We note that the estimate of the total based on the response sample is given by
.
, ∑1
=
= H
h hr h r
strat v Y
Y
We obtain formula (12) for k by noting from (13) that we can express E(k) as
∑
∑
=
= −
≈ H
h
h h h H
h
h h
h h
f E Y v Var
f f E
E Y v Var k
E
1 1
) ( ) (
) 1 ( ) 1 ( ) ( )
( .
Then we see that the expression for E(k) is satisfied approximately, if the stratum sizes Nh are large, by letting k be given by (12).
5.2. Logistic regression with binary explanatory variable. Estimating log(odds ratio)
The model is as follows:
Yn
Y1,..., are independent 0/1 -variables
Explanatory 0/1-variable x with fixed known values x1,...,xn Class probabilities: π1 =P(Yi =1|xi =1) and π0 =P(Yi =1|xi =0) Response variables: R1,...,Rnwith MAR (missing at random) model:
r i
i x p
R
P( =1| =1)= 1 and P(Ri =1|xi =0)= p0r We can reparametrize the model in a logit version:
x x Y P
x Y
P =α+β
=
= )
| 0 (
)
| 1 log (
giving us the following 1-1 relationships:
π α
π
α π −
= +
− ⇔
= 1 e
1
log1 0
0 0
) 1 /(
) 1 log /(
0 0
1
1 π
π
π β π
−
= − = log(odds ratio), and 1 ( )
1 1
β
π −α+
= +
e .
The aim is to estimate β. Let s = (1, . . . , n) denote the full sample with strata s1 ={i∈s:xi =1} and }
0 :
{ ∈ =
= . The sizes of s and s are denoted by n and n . We note that =∑n = and