• No results found

Non-Bayesian multiple imputation

N/A
N/A
Protected

Academic year: 2022

Share "Non-Bayesian multiple imputation"

Copied!
32
0
0

Laster.... (Se fulltekst nå)

Fulltekst

(1)

Discussion Papers No. 421, May 2005 Statistics Norway, Research Department

Jan F. Bjørnstad

Non-Bayesian Multiple Imputation

Abstract:

Multiple imputation is a method specifically designed for variance estimation in the presence of missing data. Rubin’s combination formula requires that the imputation method is “proper” which essentially means that the imputations are random draws from a posterior distribution in a Bayesian framework. In national statistical institutes (NSI’s) like Statistics Norway, the methods used for imputing for nonresponse are typically non-Bayesian, e.g., some kind of stratified hot-deck. Hence, Rubin’s method of multiple imputation is not valid and cannot be applied in NSI’s. This paper deals with the problem of deriving an alternative combination formula that can be applied for imputation methods typically used in NSI’s and suggests an approach for studying this problem. Alternative combination formulas are derived for certain response mechanisms and hot-deck type imputation methods.

Keywords: Multiple imputation, survey sampling, nonresponse, hot-deck imputation JEL classification: C42, C13, C15

Address: Jan F. Bjørnstad, Statistics Norway, Division for Statistical Methods and Standards, P.O.

Box 8131 Dep., N-0033 Oslo, Norway. E-mail: [email protected]

(2)

Discussion Papers comprise research papers intended for international journals or books. A preprint of a Discussion Paper may be longer and more elaborate than a standard journal article, as it may include intermediate calculations and background material etc.

Abstracts with downloadable Discussion Papers in PDF are available on the Internet:

http://www.ssb.no

http://ideas.repec.org/s/ssb/dispap.html

For printed Discussion Papers contact:

Statistics Norway

Sales- and subscription service NO-2225 Kongsvinger

Telephone: +47 62 88 55 00

(3)

1. Introduction

Multiple imputation is a method specifically designed for variance estimation in the presence of missing data, developed by Rubin (1987). The basic idea is to create m imputed values for each missing value and combine the m completed data sets by Rubin’s combination formula for variance estimation. For the estimator to be valid, the imputations must display an appropriate level of variability. In Rubin’s term, the imputation method is required to be “proper”. In national statistical institutes (NSI’s) the methods used for imputing for nonresponse very seldom if ever satisfy the requirement of being “proper”. However, the idea of creating multiple imputations to measure the imputation uncertainty and use it for variance estimation and for computing confidence intervals is still of interest. The problem is then that Rubin’s combination formula is no longer valid with the usual nonproper imputations used by NSI’s. The reason being that the variability in nonproper imputations is too little and the between imputation component must be given a larger weight in the variance estimate. The problem is then to determine what this weight should be to give valid statistical

inference, and also for what kind of nonresponse mechanisms and estimation problems it is possible to determine a simple combination formula not dependent on unknown parameters. This paper suggests an approach for studying this problem.

In Section 2 an approach for determining the combination of the imputed completed data sets is suggested. Section 3 has two applications with random nonresponse, (i) estimating a population average from simple random samples using hot-deck imputation and (ii) estimating a regression coefficient using residual regression imputation. Section 4 deals with the general problem of multiple imputation for stratified samples. In Section 5 we apply the theory in Section 4 to stratified samples with random nonresponse within strata, covering (i) estimation of population average using stratified hot-deck imputation and (ii) estimation of log(odds ratios) in logistic regression with missingness both for the dependent variable and the explanatory variable. Section 6 takes up the problem of using the same combination rule for all estimation problems with a given imputation method and data &

response model.

(4)

2. An approach for determining an alternative combination for- mula for variance estimation in multiple imputation

Let s = (1,…,n ) denote the full sample, with y =(y1,....,yn) denoting the full sample data, values of random variable Y1,....,Yn. The objective is to estimate some parameter θ. Now, let yobs be the observed part of y, with srbeing the response sample of size nr,

) :

( i r

obs y i s

y = ∈ .

Let θˆ be the estimator based on the full sample data y, with Var(θˆ)estimated by Vˆ(y). For issr we impute by some method yi and let y* denote the complete data (yobs,yi,issr). Based on y*, we have θˆ*=θˆ(y*) and Vˆ =Vˆ(y).

Multiple imputation of m repeated imputations leads to m completed data-sets with m estimates ,

,..., 1 ˆi,i= m

θ and related variance estimates Vˆi,i=1,...,m. The combined estimate is given by m

m i

i / ˆ

1

=

= θ

θ .

The within-imputation variance is defined as m V V

m i

i / ˆ

1

=

=

and the between-imputation component is

=

= − m

i

m i

B

1

)2

(ˆ 1

1 θ θ .

The total estimated variance of θ is then proposed to be

+ +

= B

k m V

W 1)

( . (1)

That is, we need to determine k such that

E(W) = Var). (2)

Rubin (1987) has shown that k = 1 can be used with proper imputations, which essentially means drawing imputed values from a posterior distribution in a Bayesian framework.

In general, one has to determine the terms in (2). One way to try and do this is to use double expectation, conditioning on yobs, that is,

)}

| ( { )

( =

(5)

)}

| ( { )}

| ( { )

( EVar Yobs Var E Yobs

Varθ = θ + θ .

Typically,

ˆ) ( )

(V Varθ

E ≈ (3)

and

E(B| yobs)=Var(θˆ|yobs). Hence, approximately

) ˆ | ( 1)

) ( ( ˆ) ( )

( EVar Yobs

k m E Var

W

E = θ + + θ . (4)

Moreover,

m y Var y

Var| obs)= (θˆ| obs)/ and

) ˆ |

( )

|

( yobs E yobs

E θ = θ .

This implies that

)}

ˆ | ( { )}

ˆ | ( 1 {

)

( E Var Yobs Var E Yobs

Varθ =m θ + θ .

From (3) and (4), the equation (2) becomes

) ˆ | ( ) ( ˆ)

( E k EVar Yobs

Varθ + θ = )}Var{E(θˆ|Yobs ,

which gives the following general expression for E(k):

) ˆ | (

ˆ) ( ) ˆ | ) (

(

obs obs

Y EVar

Var Y

k VarE

E

= θ

θ

θ . (5)

For this to be of interest, k must be, at least approximately, determined independent of unknown pa- rameters. In addition, one needs to check that (3) holds.

To illustrate how (5) can be used we shall in the next section consider two special cases with random nonresponse.

(6)

3. Two applications to simple random samples and random non- response

3.1. Estimating population average with hot-deck imputation

Consider a simple random sample from a finite population of size N, where the aim is to estimate the population average µ of some variable y. We shall assume completely random nonresponse. In Rubin’s term MCAR (missing completely at random). We note that MCAR means that the response indicators R1,...,RN are independent with the same response probabilitypr =P(Ri =1). The imputation method is the hot-deck method, where yi is drawn at random from yobs, and the estimate is the sample mean. Let yr be the observed sample mean and =

r

r i s i r

r2 n11 (y y )2

σˆ the observed

sample variance. Then Y is the imputation-based sample mean for the completed sample, and the combined estimator is given by

m Y Y

m i

i /

1

=

= .

Let Ys denote the sample mean based on a full sample. Then, 1)

(1 )

( 2

N Y n

Var s =σ − , with

=

− −

= N

i

yi

N 1

2

2 ( )

1

1 µ

σ

being the population variance. We have further that

r

obs y

y Y

E( | )= and 22

)

|

( r

r r r

obs n

n n

n y n

Y

Var − ⋅ σ

=

using that E(Yi|yobs)= yrand 1ˆ2 )

|

( r

r obs r

i n

y n Y

Var = σ .

In this case,

1) (1 ˆ ˆ2

N V n− where

( ∑

)

− + −

= −

r r

s yi y s s yi y

n

2 2

2 ( ) ( )

1 ˆ 1

σ .

It can be shown that

)

| ˆ ( 2 yobs

E σ = ˆr2(1 n1)(1 n(nnr1)) ˆr2

r σ

σ − +

and (3) holds. We find, from (5),

(7)

) ˆ | ( ) (

) ( ) ) (

( 1 1 2 1

2

2 n r r

n n

n n

N r n

n E E

Y k Var

E

r r

r σ

σ

= −

= 1 2

1 2 1 1 2 1

) (

) ( ) ) ( (

2 σ

σ σ

r r r r

n n n

n n

N n N

n

E E

r r

r r

p p

p

p 1

1 / ) 1

( =

≈ −

which is satisfied approximately by letting

k f

= − 1

1

where f = (nnr)/n is the rate of nonresponse.

3.2. Estimating regression coefficient with residual imputation

We shall assume completely random nonresponse as in Section 3.1. We consider a ratio model, i.e., regression through the origin:

i i

i x

Y =β +ε , with Vari)=σ2xi; i = 1, …,n.

It is assumed that all xi’s are known, also in the nonresponse sample. The full data estimator of β is given by

= =

= n

i i n

i

i x

Y

1 1

ˆ /

β .

The unbiased estimator of σ2is given by

=

− −

= n

i

i

i x

n 1xi y

1 2

2 ( ˆ )

1

ˆ 1 β

σ .

We shall consider residual regression imputation:

Let βˆ be the r βˆ - estimate based on observed sample sr. Define the standardized residuals

i i r i

i y x x

e =( −βˆ )/ , for isr.

For issr: Draw the value of ei at random from the set of observed residuals ei,isr, and the imputed y-value is given by

i i i r

i x e x

yˆ + .

(8)

Let r

s s

i i

s nr

i i

r n

i xi X x X x X X

X

r r

=

=

=

=

= ,

and

1 . All considerations from now on are

conditional on nr and Xr, and we aim to determine k directly from (5). Define the proportion of the x- total in the nonresponse group to be:

X X fX = nr/ . We now have

X y y

r

r s s i

s i )/

ˆ =(

+

β

) ˆ ) ( ˆ )

( 1(

ˆ2 1

1 2

1 2

+ −

= −

r r

i

s i s s

i i i

i x y x

n x y β x β

σ .

In order to determine k from (5) we need to check the validity of (3) and derive the following quantities: )Var(βˆ|yobs),E(βˆ| yobs and Var(βˆ). We note that

X Var(βˆ)=σ2/ . Consider (3) which is equivalent to

2 2) (σˆ ≈σ

E .

Let nr

s s

i

nr y X

r

ˆ

/

=

β , and ∑

= −

r

s i

s

i nr i nr

nr y x

n x

1 2

2 ( ˆ )

1

ˆ 1 β

σ . Here, nnr = n - nr. Then, after some

algebra, one can express σˆ2 in the following way:

⎟⎠

⎜ ⎞

⎛ − + − + −

= −

2 ( 1)ˆ2 ( 1)ˆ2 ( ˆ ˆ )2

1

ˆ 1 r r nr nr r nr r nr

X X n X

n n σ σ β β

σ .

In this case,

i i r obs

i y x e x

Y

E( | )=βˆ + , where e s ei nr

r

/

= ,

) 2

|

(Yi yobs xise

Var = , where se2 = n1r

sr(eie)2.

Using this, it can be shown that

1) )

1 (

4 1 1

( ˆ )

( 2 2 1 2 3

r

r n n

f n n c n

c n

E c

− −

− −

− −

=σ σ

where c1,c2,c3 lies in the interval (0,1).

Hence, E(σˆ2)≈σ2 and (3) follows, at least for moderate and large nr.

(9)

Next, we look at Var(βˆ|yobs )and E(βˆ| yobs): We see that βˆ=(βˆrXr+βˆnrXnr)/X , and

+

=

sr

s i nr r obs

nr x

X y e

E(βˆ | ) βˆ

nr e obs

nr y s X

Var(βˆ | )= 2/ .

This gives us

) ˆ |

( yobs

E β =

+

sr

s i

r x

X βˆ e

. )

ˆ |

( obs nr2 se2 X y X Var β =

Next, we need to find EVar(βˆ|yobs )and VarE(βˆ| yobs):

) ˆ , ( 2

) ) (

) ( ( ˆ ) ˆ |

(

Var 2

2

e X Cov

x e

X Var x Var

y

E β obs βr

ssr i

ssr i βr +

+

= .

Using Cauchy-Schwarz inequality,

= = =

n

i i n

i i n

i i

ib a b

a

1 2 1

2 2

1

) (

with ai = xi and bi =1, we see that . )

( 2

1

nX x

n i

i

=

(6) Now, after some algebra we find that Cov(βˆr,e)=0 and

= ) (e

Var ⎟⎟

⎜⎜

⎛ −

r r

s i

r n X

x n

r

2 ( )2

σ 1

=

nr

d

2 1) 1

( σ

− , .0≤d1≤1 Moreover, from (6),

. 1 0

) ,

(

2 2 2 2

2

=

X d X n d X

x

nr s nr

s i

r

(10)

Hence,

r nr nr r

obs X n

X n d d y X

E

2 2

2

2 (1 1)

) ˆ | (

Var β σ − ⋅σ

+

= .

Next we find that

) 2 (

) ( ) 1 ( )

( 1

1 2 2

2 = − − = n +d

e n Var s

E r

n r

e r

σ σ

which gives us

).

2 (

) ˆ |

( = 22 n +d1n

X y X

EVar r

r nr obs

β σ

From (5),

) 2

( 1

) 1 (

2 2

2 2

2 2 1

2

− +

= +

d k n

X r X n

X X X n d d n X

nr r

nr nr r

r

σ

σ σ

σ

= ( 2)

) 1 (

1 2 1 2

− +

− +

d n X X

X X n d d X

X n X n

r nr r

r nr nr r

r r

r nr

r n

d n X d

X

2 1) 1 ( − +

≈ .

We note that if all xi = 1, then d1= d2 = 1. Now, with fX = Xnr/X being the proportion of the x-total in the nonresponse goup and f = nnr/n the rate of nonresponse, we finally get, since typically

0 ) 1

( −d1 d2≈ ,

k

X

X f f

d f

f d ≈ −

− −

− +

≈ 1

1 ) 1

1 1 (

1

2 1

for usual x-values and nonresponse rates.

(11)

4. Multiple imputation for stratified samples

4.1. Separate combinations

One way to combine the m completed data sets is to do it separately for each stratum, that is determine k . The general setup is then as follows: The sample s is divided into H sample strata, s1, . . ., sH. Let yh

be the planned full data from sub sample sh of size nh. It is assumed that y1, . . . ,yH are independent.

The observed part of yh is denoted by yh,obs with shr being the response sample from sh of size nhr. The estimator based on the full sample data is the sum of independent terms:

=

= H

h 1 h

ˆ

ˆ θ

θ where θˆ is based on the yh h.

=

= Hh Var h

Var(θˆ) 1 (θˆ ) is estimated by Vˆ(θˆ)=∑Hh=1Vˆh(yh)where )ˆ (

h h y

V is the variance estimate of θˆ based on yh h. For ishshr we impute by some method yibased on yh,obs and let yh* denote the complete data (yh,obs,yi,ishshr). Based on y , we have h θˆh=θˆh(yh )and Vˆh =Vˆh(yh). Then the imputation based estimator is given by ∑

=

= H

h 1 h

ˆ

ˆ θ

θ and Vˆ =Hh=1Vˆh. Multiple imputation of m repeated imputations leads to m completed data-sets with m estimates for each stratum h,

m

i i

h , 1,..., ˆ, =

θ and related variance estimates ˆ , 1,..., .

, i m

Vhi = The total estimates and related variances are θˆi=∑hH=1θˆh,iand Vˆi =∑Hh=1Vˆh,i , for i =1, . . . , m. The combined estimate for stratum h is given by

m

m i

i h

h ˆ /

1 ,

=

= θ

θ .

The within-imputation variance for stratum h is m

V V

m i hi

h ˆ /

1 ,

=

=

and the between-imputation component is

=

= − m

i hi h

h m

B

1

, )2

(ˆ 1

1 θ θ .

Following the same idea as in Section 2, formula (1), the total estimated variance of θh is then proposed to be

+ +

= h h h

h B

k m V

W 1)

( .

(12)

The combined total estimate is given by

=

=

= = H

h h m

i

i m

1 1

ˆ / θ

θ

θ .

It follows that the total estimated variance of θcan be expressed as

=

= = + +

= H

h

h h

H h

h

sep B

k m V

W W

1 1

1)

( (7)

where

∑ ∑

= =

= m =

i

H h

h

i m V

V V

1 1

ˆ / .

Provided (3) holds for each stratum h,

ˆ ) ( )

(Vh Var h

E ≈ θ (8)

we have from (5) that kh must satisfy

) ˆ |

(

ˆ ) ( ) ˆ |

) ( (

, ,

obs h h

h obs

h h

h EVar Y

Var Y

k VarE

E

= θ

θ

θ . (9)

The combination formula (7) is an alternative to the usual combination formula (1), especially useful when we get simple expressions for kh, but not for k. The next section developes an expression for k in this situation.

4.2. An overall combination formula

Now let W be given by (1). We shall determine the between imputation factor k. Since )

( )

(W E Wsep

E = we have

. 1) ( } 1) ( {

1

=

= +

∑ + B

k m E m B

k E

H h

h

h (10)

Here, B=

mi )2 1

1 θ θ = ∑ ∑

m

h h hi

2

, )}

(ˆ 1 {

1 θ θ . We note that

(13)

)

| (

)

|

(B yobs E Hh1Bh yobs

E = ∑= .

This follows from the fact that E(B|yobs)=Var(θˆ|yobs)=∑Hh=1Var(θˆh|yobs)and )

ˆ | ( )

|

(Bh yobs Var h yobs

E = θ .

Hence, the identity (10) becomes

)}.

| ( { )}

| ( {

1 obs

H h

obs h

hE B Y E kE B Y

k

E

=

=

This gives us a solution for k if we want to use the usual combination formula (1):

)

| (

)

|

1 (

obs H

h h h obs

y B E

y B E k = k

=

= (ˆ | )

) ˆ |

1 (

obs H

h h h obs

y Var

y Var k

=

∑ θ

θ =

) ˆ | (

) ˆ | (

1 obs

obs H h

h

h Var y

y k Var

=

∑ θ

θ , (11)

a weighted average of kh. We get a simple expression for k only when all kh are equal, say kh = k0. Then k = k0.

5. Four applications to stratified samples and random nonre- sponse within strata

5.1. Estimating population average from stratified sample with stratified hot- deck imputation

Consider stratified simple random samples from a finite population of size N, with H strata of sizes Nh, h = 1,...,H. The aim is to estimate the population average µ of some variable y. We assume completely random nonresponse within each stratum, typically denoted as MAR (missing at random). This means that the response indicators in stratum h,

Nh

h

h R

R,1,..., , are independent with the same response probability phr = P(Rh,i =1). The imputation method is stratified hot-deck. Let yh,obs be the observed part from the response sample shr of size nhr from stratum h,

(14)

Then an imputed value yiin stratum h is drawn at random from yh,obs.

The estimator based on the full sample data is the usual stratified weighted average

=

= H

h h h

strat N y

Y N

1

1 = ∑

= H h

h hy v

1

. Here, vh =Nh/N and h

s i

i

h y n

y

h

∑ /

= , where sh is the sample from stratum h and nh =|sh|. Then

1 ) ( 1 )

( 2

1 2

h h h H h

h strat

N v n

Y

Var =∑ −

= σ , with ∑

= −

Uh

i

h i h

h y

N

2

2 ( )

1

1 µ

σ

being the population variance in stratum h. Here Uh is stratum population h and µh is the average in Uh.

Let yhrbe the observed sample mean from stratum h and =

hr i shr i hr

hr2 n11 (y y )2

σˆ the observed

sample variance. The imputation-based estimator is given by

=

=h

H h

h

strat N y

Y N

1

1 where

) 1 (

) 1 (

= + = +

hr h hr

h

hr i s s

i hr

hr s h

s i

i s

i i h

h n y y

y n n y

y .

Let the m imputation replicates of Ystrat be denoted by Ystrat ,ifor i = 1, …, m. The combined estimator is given by

=

= m

i i strat

strat Y

Y

1 , .

5.1.1. Separate strata combinations It follows from Section 3.1 that

h

h f

k = − 1

1

where fh =(nhnhr)/nh is the rate of nonresponse in stratum h. The combination formula for the variance estimate of Ystrat becomes, from (7),

+ +

=V H B

W 1 1)

( .

(15)

Here, ∑

=

= H

h

Vh

V

1

and Vhis the average of the m values of the imputation based variance estimate

=

Vˆh 1 1 )

ˆ2 (

2

h h h

h n N

v σ − where

(

)

− + −

= −

hr h hr

s i h s s i h

h

h y y y y

n

2 2

2 ( ) ( )

1 ˆ 1

σ .

5.1.2. Overall combination formula. Determination of k in (1)

From (11) we need to determine Var(vhYh|yobs) and Var(Ystrat |yobs)=Hh=1Var(vhYh|yobs). Then we have that

k = ( | )

)

| ( 1

1

1 strat obs

obs h H h

h h Var Y y

y Y v Var

f

=

∑ − .

Now, for i ∈ sh - shr:

hr obs h

i y y

Y

E( | , )= and , 1 ˆ2

)

|

( hr

hr hr obs h

i n

y n Y

Var = σ .

This gives the following results:

hr obs h

h y y

Y

E( | , )= and , 22

)

|

( hr

hr hr h

hr h obs h

h n

n n

n y n

Y

Var = − ⋅ σ

h hr h n f

ˆ2

≈ σ .

Hence we can determine k as

∑ ∑

==

= − H

k k k hr h

h hr h H h

h h f v n

n v f k f

1 2 2

2 2

1 ˆ /

ˆ / 1

1

σ

σ .

If the stratum sizes Nh are large then we can let Vˆ(vhYh)=vh2σˆhr2 /nh. Let also

=

= h h h Hk k k k

h fV vY f V v Y

b ˆ( )/ 1 ˆ( ). Then

∑ ∑

=

=

=

⋅ −

− =

= H

h h

H h h

h h h H

h

h h h h

b f f

Y v V

f f Y v V k

1 1

1

1 1 )

ˆ(

1 ) 1 ˆ(

. (12)

Since 1∑H=1 =

h bh , we see that k is a weighted average of the inverse of the response rates. If all fh = f, the overall nonresponse rate, we get as for simple random sample that k = 1/(1-f ). Otherwise, a

(16)

stratum response rate 1- fh has large weight if either the nonresponse rate is large and/or the estimated variance of vhYh is large.

5.1.3. An alternative expression for k in (1)

By directly applying (5) we can get an alternative expression for k. Given yobs, the imputed sample means Yhare independent which implies that

r strat hr H h

h obs

strat N y y

y N Y

E ,

1

) 1

|

( = ∑ =

=

and 2 2

1

2 1 ˆ

)

|

( hr

hr hr h

hr H h

h h obs

strat

n n n

n v n

y Y

Var = ∑ ⋅ − ⋅ σ

=

.

It follows that

2 1

2 ˆ

)

|

( hr

h H h h

h obs

strat

n v f y

Y

Var ≈∑ ⋅ σ

=

.

Just like in Section 3.1, (3) holds. From (5) we get

=

H

h

hr h h h

strat r

strat

n v f E

Y Var Y

k Var E

1

2 2 ,

ˆ ) (

) ( ) ) (

(

σ

=

=

=

= H

h

hr n hr

f h

H

h h h n N

H

h h h n N

n E E v

v E

v

h h

h h h

hr

1

2 2

1 2 2 1 1

1 2 2 1 1

)}

ˆ | ( {

) (

) ) ( (

σ

σ σ

=

=

= H

h n

f E h h

n H

h h h n

h h

h hr

v E v

1

) ( 2 2

1 2 2[ ( 1 ) 1] σ

σ

=

=

− ⋅

H

h h

hr h h H h

hr h

hr h h

n v p

p n v p

1 2 2 1 2 2

1

1 1

σ σ

=

=

=

H h

h h hr

h h H h

h h h

hr h h

f f n E v

f E f f n E v

1 2 2 1

2 2

) 1 )(

(

) 1 ( ) 1 ( σ σ

. (13)

Now, Var(Yhr)=EVar(Yhr |nhr)=σh2E(1/nhr). Let Vˆ(vhYhr)=vh2σˆhr2 /nhr. Then we see that the expression for E(k) is satisfied approximately, if the stratum sizes Nh are large, by letting

=

=

= H

h h h hr

H

h h h h hr

Y v V f

Y v V f f

k 1

1

) ˆ(

) ˆ( ) 1 (

1 = ∑

=

H h

h

h f

a

1

) 1

( (14)

(17)

where the weights ah = fhVˆ(vhYhr)/∑Hk=1fkVˆ(vkYkr). Since ∑H=1 =1

h ah , we see that 1/k is a weighted average of the response rates. If all fh = f, the overall nonresponse rate, we have, as shown in Section 5.1.2, that k = 1/(1-f). As seen in Section 5.1.2, we note also in expression (14) that a stratum response rate 1- fh has large weight if either the nonresponse rate is large and/or the estimated variance of vhYhr is large. We note that the estimate of the total based on the response sample is given by

.

,1

=

= H

h hr h r

strat v Y

Y

We obtain formula (12) for k by noting from (13) that we can express E(k) as

=

=

H

h

h h h H

h

h h

h h

f E Y v Var

f f E

E Y v Var k

E

1 1

) ( ) (

) 1 ( ) 1 ( ) ( )

( .

Then we see that the expression for E(k) is satisfied approximately, if the stratum sizes Nh are large, by letting k be given by (12).

5.2. Logistic regression with binary explanatory variable. Estimating log(odds ratio)

The model is as follows:

Yn

Y1,..., are independent 0/1 -variables

Explanatory 0/1-variable x with fixed known values x1,...,xn Class probabilities: π1 =P(Yi =1|xi =1) and π0 =P(Yi =1|xi =0) Response variables: R1,...,Rnwith MAR (missing at random) model:

r i

i x p

R

P( =1| =1)= 1 and P(Ri =1|xi =0)= p0r We can reparametrize the model in a logit version:

x x Y P

x Y

P =α+β

=

= )

| 0 (

)

| 1 log (

giving us the following 1-1 relationships:

π α

π

α π

= +

− ⇔

= 1 e

1

log1 0

0 0

) 1 /(

) 1 log /(

0 0

1

1 π

π

π β π

= − = log(odds ratio), and 1 ( )

1 1

β

π α+

= +

e .

The aim is to estimate β. Let s = (1, . . . , n) denote the full sample with strata s1 ={is:xi =1} and }

0 :

{ ∈ =

= . The sizes of s and s are denoted by n and n . We note that =n = and

Referanser

RELATERTE DOKUMENTER

With California schools’ data, confidence intervals using k 1 also achieved a desired coverage, (1 − α)%, across varying sample size and missing rate, except in case of MNAR because

In conclusion, random regression models using Legendre polynomials functions of varying order for the fixed mean curve and for the random animal effects were

Two of the approximate methods are based on the hazardous distance found for single charges, whereas one approximation is based on transforming the true hazardous area (zone) into

We pro- pose a calibrated imputation approach so that valid point and variance estimates of the population (or domain) totals can be computed by the secondary users using

A regression equation for panel data with two-way random or fixed effects and a set of individual specific and period specific `within individual' and `within period', estimators

The classification is observed with error for the whole population using a fallible classifier and without error for a random sample using an accurate classifier.. Following

Some aspects of the problem of specification and estimation of single equation error components regression models from incomplete CS/TS data are discussed in Biørn (1981)..

(i) estimating a population average from simple random samples using hot-deck imputation, (ii) estimating the regression coefficient in the ratio model using residual