Two-Stage sampling from a prediction point of view

(1)

Discussion Papers No. 383, August 2004 Statistics Norway

Jan F. Bjørnstad and Elinor Ytterstad

Two-Stage Sampling from a Prediction Point of View

Abstract:

This paper considers the problem of estimating the population total in two-stage cluster sampling when cluster sizes are unknown, making use of a population model arising basically from a variance component model. The problem can be considered as one of predicting the unobserved part Z of the total, and the concept of predictive likelihood is studied. Prediction intervals and a predictor for the population total are derived for the normal case, based on predictive likelihood. The predictor obtained from the predictive likelihood is shown to be approximately uniformly optimal for large sample size and large number of clusters, in the sense of uniformly minimizing the mean square error in a partially linear class of model-unbiased predictors. Three prediction intervals for Z based on three similar predictive likelihoods are studied. For a small number n0 of sampled clusters they differ significantly, however, for large n0 the three intervals are practically identical. Model-based and design-based coverage properties of the prediction intervals are studied based on a comprehensive simulation study. Roughly, the simulation study indicates that for large sample sizes the coverage measures achieve approximately the nominal level 1 - α and are slightly less than 1 - α for

moderately large sample sizes. For small sample sizes the coverage measures are about 95% of the nominal level.

Keywords: Survey sampling, population model, predictive likelihood, optimal predictor, prediction intervals, simulation

JEL classification: C42, C13, C15

Address: Jan F. Bjørnstad, Statistics Norway, Division for Statistical Methods and Standards.

E-mail: jab@ssb.no

Elinor Ytterstad, University of Tromsø, Department of Mathematics and Statistics, N-9037 Tromsø, E-mail: Elinor.Ytterstad@matnat.uit.no

(2)

Discussion Papers comprise research papers intended for international journals or books. A preprint of a Discussion Paper may be longer and more elaborate than a standard journal article, as it may include intermediate calculations and background material etc.

Abstracts with downloadable Discussion Papers in PDF are available on the Internet:

http://www.ssb.no

http://ideas.repec.org/s/ssb/dispap.html

For printed Discussion Papers contact:

Statistics Norway

Sales- and subscription service NO-2225 Kongsvinger

Telephone: +47 62 88 55 00 Telefax: +47 62 88 55 95

E-mail: Salg-abonnement@ssb.no

(3)

1. Introduction

Two-stage surveys are used in sampling from finite populations of, say, N primary units or clusters, where each cluster consists of m_i units. N is assumed known. As mentioned by Kelly and Cumberland (1990) and Valliant, Dorfman and Royall (2000, ch. 8.9), it often happens that the mi's are unknown before sampling, and this is the case we consider in this paper. Let y_ij be the value of the variable of interest for unit j of i'th cluster. The problem is to estimate the total

∑∑

= =

= ^N

i m

j ij

i

y t

1 1

.

An example is considered in Thomsen, Tesfu and Binder (1986) and Thomsen and Tesfu (1988), with t being the size of a particular population. The clusters are certain administrative units, the units are households and y_ij is the number of persons in household j of the i'th administrative unit.

We assume that, before sampling, other measures of the sizes of the clusters are available to us. Let xN

x₁,..., be these measures with =

∑

^N= i xi

X 1 . Kelly and Cumberland (1990) consider a case where the clusters are blocks of dwelling units and x_i is the number of units in block i from a previous census.

The sampling plan is as follows: At stage 1 a sample s of size n₀ of the clusters (1,..., N) is selected according to some sampling design. At stage 2 we select for each i∈s, a sample si of size ni of units using possibly a different sampling design than at stage 1. The designs are assumed to be non- informative, i.e., they do not depend on the yij's and the mi's. E.g., in Thomsen and Tesfu (1988) the two-stage sampling plan is to use pps-sampling at stage 1 (letting selection probabilities of clusters be proportional to the x_i's) and simple random sampling (srs) at stage 2. This is a common two-stage sampling plan, as also mentioned by Kelly and Cumberland (1990). Usually, the second stage sample sizes are the same, leading to approximately equal selection probabilities for all units provided the ratios xi/mi are not too different. When the mi's are known, one often used sampling plan is to let the first stage selection probabilities be proportional to m_i, and then srs with same sample sizes at stage 2 yielding equal selection probabilities for all units. As mentioned by Valliant et al. (2000, ch.8.1), equal sample sizes at stage 2 has many advantages and is probably the most common allocation of sample units in practice.

(4)

The total sample size is =

∑

∈ s i ni

n and our data now consists of

( )

si

j s ij i

y

y(s)= _∈_,_∈ and the vector

s i

mi

s

m( )=( )_∈ , where s = }{s,s_i:i∈s . Let ^y⁼

(

^y⁽^s^),^m⁽^s⁾

)

^. For the pps-srs sampling plan mentioned above, a commonly used design-unbiased estimator of t is the Horvitz-Thompson estimator (see, e.g., Thomsen et al. 1986, Kelly and Cumberland,1990, and Särndal, Swensson and Wretman, 1992)

∑

∈

=

s

i i

i HT i

x y m n t X

0

ˆ (1)

where =

∑

∈ si

j ij i

i y n

y / .

In this paper a population model is adopted, regarding m_i,y_ij as realized values of random variables

ij

i Y

M , for j = 1, ..., Mi and i = 1, ..., N. The Mi's are assumed independent of all Yij, and furthermore:

. if 0 ) , (

0 , if )

, (

) ( , ) (

0 ) , (

) ( ) ( , ) (

2 2

2

i l Y

Y Cov

j k Y

Y Cov

Y V Y

E

M M Cov

x v M

V x M E

lk ij

ik ij

ij ij

j i

i i

≠

=

≥

≠

=

ρ ρτ

τ µ

σ β

(2)

Since the variance of a cluster total is nonnegative, we must have ρ ≥−1/(maxm_i−1)as also noted by Kelly and Cumberland (1990). It is therefore a minor restriction to assume a nonnegative ρ. Also, usually v(x)=x^g with 0 ≤ g ≤ 2. In fact, it is typically assumed that v(x) = x (see e.g., Royall,1986, Kelly and Cumberland,1990, and Valliant et al.,2000, ch. 8.9).

A more general model is to let ρ and τ vary with the clusters, having cluster parameters ρi,τi . However, we then have the problem of estimating these parameters. Without further assumptions we are only able to estimate (1−ρ_i)τ_i². As noted by Valliant el al. (2000, ch. 8.1), it is often sensible to adopt model (2), especially after suitable stratification that also may allow µ to be different for different parts of the population.

The model (2) for the Yij's arises naturally from expressing Yij in the following way:

ij i

Yij=µ +ε

where all µ_i,ε_ijare independent with

(5)

) 2

( , )

( _i V _i _b

E µ =µ µ =τ and E(ε_ij)=0,V(ε_ij)=τ_w². (3) Here, V(µ_i)=τ_b² expresses the variability between the clusters, and V(ε_ij)=τ_w² expresses the

variability within the clusters. Then τ²=τ_b²+τ_w² and the intraclass correlation is given by:

2 2/τ τ ρ= _b ,

the proportion of the total variability due to the variability between the clusters.

The total t is now a realized value of a random variable T, where T can be expressed as Z

Y T i s j s ij

i +

=

∑ ∑

∈ ∈ with

1 .

∑ ∑

∈ ∉ + ∉ =

= i s

M

j ij

s

i j s ij

i

iY Y

Z (4)

Expressing the T on this form, we see that the problem can be expressed as one of predicting the unobserved value z of the random variable Z. It is often clarifying to write a predictor Tˆ of T on the form

Z Y T i s j s ij

i

ˆ

ˆ=

∑ ∑

∈ ∈ + (5)

where Zˆ then implicitly is a predictor of Z. Considering the modified Horvitz-Thompson estimate Tˆ given by (1) on the form (5), we can use the following expression, with HT =

∑

∈

s

i i

s x

X ,

∑ ∑

∑

∑ ∑

∈ ∈ ∈ ∉ ∈ ⎟⎟

⎠

⎞

⎜⎜

⎝ + ⎛

− +

= i s j s j

j j i

s i

i i

i i s s

i j s ij

HT Y

x m x n

Y x n n m X Y

T

i 0 0

) 1

ˆ ( .

The last term predicts

∑ ∑

i∉s = M

j ij

iY

1 while the second term predicts

∑ ∑

i∈s j∉s ij

i

Y . From this point of view Tˆ does not look like a reasonable predictor. _HT

Modeling the population in survey sampling has been and still is somewhat controversial, although most statisticians seem to agree on using modeling in developing statistical methods while evaluation is done with respect to the sampling design. An important aspect of this issue is that the likelihood principle in a sense makes it necessary to model the population. Without a model the only stochastic elements are the samples s = },{s,s_i:i∈s and the likelihood function is then flat (see, e.g., Cassel, Särndal and Wretman, 1977). This means that from the likelihood principle point of view the data contains no information about the unobserved y_ij's andm_i's. To make inference we therefore need to

(6)

relate the data to the unobserved values somehow, and the most natural way of doing so is to formulate a model (see also remarks by Berger and Wolpert, 1988, p. 114 and Bjørnstad, 1996).

The random variables observed are Y(s), M(s) and s, where s now is ancillary. The likelihood principle implies that inference should depend only on the actual s observed and not on the sampling design.

This is usually called the prediction approach to survey sampling and will be adopted in this paper.

Hence, theoretical considerations are conditional on given s. The prediction approach aims at choosing a predictor that is good for the actual s obtained and has given significant contributions to a better understanding of several problems in survey sampling, some of which are mentioned in Thomsen and Tesfu (1988) and Valliant et al. (2000). It also enables one to use more conventional statistical methods, although the problem is not to make inferences about θ but rather predict Z. Hence, θ basically plays the role of a nuisance parameter.

To predict Z we shall use predictive likelihood based methods, a non-Bayesian likelihood approach to prediction problems in general. One can argue that in the context of a population model, survey sampling provides one of the more natural "prediction" problems in statistics. Predictive likelihood can therefore serve as a basis for essentially all problems of this kind in survey sampling. Some major references to the general theory of predictive likelihood are Hinkley (1979), Mathiasen (1979) and Butler (1986). A review of some of the suggested likelihoods is given in Bjørnstad (1990, 1998).

Predictive likelihood is discussed from the perspective of the likelihood principle for prediction in Bjørnstad (1996). Bolfarine and Zacks (1992) consider methods based on predictive likelihood in survey sampling.

Section 2 introduces the concept of predictive likelihood and shows how predictors and prediction intervals can be constructed from a predictive likelihood, and in Section 3 a predictive likelihood is derived for the normal model. Considering a predictive likelihood for Z directly does not work, mainly because Z is a sum of a stochastic number of random variables. Therefore, predictor and prediction interval will be obtained from a joint predictive likelihood for Z and the vector M(s) = (M_i)_i_∉_s. In Section 3.3 optimality theory for a class

l

of predictors linear in the Yij's, but not simultaneously in both Yij's and Mi's, under the distribution-free model (2) is developed.

In Section 4 three prediction intervals for Z based on similar predictive likelihoods are studied and a comprehensive simulation study for estimating confidence levels, both model-based and design-based is undertaken. The prediction intervals are evaluated by four different measures; the model-based

(7)

coverage C_m, the design-based coverage C_d, the unconditional coverage C (expected design-based coverage), and the conditional coverage given the data.

2. Predictive likelihood

We shall here give a brief general introduction to the concept of predictive likelihood. For a more complete exposition we refer to Bjørnstad (1990, 1998). Let Y = y be the data. The problem is to predict the unobserved or future value z of a random variable Z usually by a predictor and confidence interval for Z. It is assumed that (Y,Z) has a probability density or mass function (pdf) f_θ(y,z). In general we let f_θ(⋅) and f_θ(⋅|⋅)denote the pdf and conditional pdf of the enclosed variables. The likelihood basis in prediction is the generalized joint likelihood for the two unknown quantities, z and θ. In Bjørnstad (1996) it is shown that the joint likelihood function is given by l_y(z,θ)= f_θ(y,z).

With this likelihood, the corresponding likelihood principle is implied by the sufficiency principle for prediction and the conditionality principle, generalizing the fundamental result by Birnbaum (1962) for parametric likelihood. The aim is to develop a partial likelihood for z, L(z|y), by eliminating θ from l_y. Any such likelihood is called a predictive likelihood and gives rise to one particular prediction method.

Different ways of eliminating θ give rise to different L. The two main type of suggestions are the conditional predictive likelihood L_c , essentially suggested by Hinkley (1979), and the profile predictive likelihood Lp, first considered by Mathiasen (1979). Let R = r(Y,Z) denote a minimal sufficient statistics for (Y,Z). Then

L_c(z|y)= f_θ(y,z)/ f_θ(r(y,z)) (6)

) , ( ) , ( max )

|

(z y f y z f_ˆ y z

Lp = θ θ = θz . (7)

Typically, Lc and Lp are quite similar when sufficiency provides a genuine reduction and the dimension of θ is small.

In linear models, L_p will ignore the number of parameters and can be misleadingly precise. A modification of L_p, L_mp, that adjusts for this was suggested by Butler (1986, rejoinder), see also Bjørnstad (1990). If Y,Z are independent, Y consisting of n independent observations and Z being an m-dimensional vector of independent variables, then Lmp is given by

2 / 1 ' 2

/

1 /| |

| ˆ ) (

| )

| ( )

|

( _p ^z _z _z _z

mp z y L z y I H H

L = ⋅ θ . (8)

(8)

Here, )}I^z(θ)={I_ij^z(θ is the "observed" information-matrix based on (y,z), i.e. I_ij^z(θ)= .

/ ) , (

2log

j

z i

y

f_θ ∂θ ∂θ

∂

− H_z =H_z(θˆ_z), and H_z(θ)is the k x (n+m) matrix of second-order partial derivatives of log f_θ(y,z) with respect to k-dimensional θ and (y,z).

We shall assume that any L considered is normalized as a probability distribution in Z. The mean and variance of L are then called the predictive expectation and the predictive variance of Z, denoted by E_p(Z) and V_p(Z). E_p(Z) is then a natural predictor for z, called the mean predictor. L(z|y) also gives us an idea on how likely different z-values are in light of the data, and can be used to construct prediction intervals for z. An interval (a_y, b_y) is a (1-α) predictive interval based on L(z|y) if

∫

a^b_y^y ⁼ ⁻

dz y z

L( | ) 1 α.

A simplified (1-α) predictive interval is of the form ) ( )

(Z u 2 V Z

E_p ± α _p (9)

where

α2

u is the upper α/2-point in the actual (exact or approximate) conditional distribution, given y, of (Z−E_θ(Z|y))/ V_θ(Z|y).

3. Predictive likelihood and predictor in two-stage sampling

3.1 Predictive likelihood for mixtures

In two-stage sampling, Z is given by (4), and is the sum of two mixtures. Therefore, instead of considering a predictive likelihood for Z directly, we look at a joint predictive likelihood for Z and

) (s

M . It has the following form

L(z,m(s)|y)=L_m₍_s₎(z|y)L(m(s)|y). (10) )

|

)(

( z y

L_m_s is a predictive likelihood for z conditional on M(s)=m(s), i.e., based on f_θ(y,z|m(s)). Since f_θ(y,z|m(s))= f_µ_,_τ_,_ρ(y(s),z|m(s),m(s))f_β_,_σ(m(s)), L_m₍_s₎(z|y) is, in fact, based

on f_µ_,_τ_,_ρ(y(s),z|m(s),m(s)). L(m(s)|y) is a predictive likelihood for m(s)based on f_θ(y,m(s)). The predictive likelihood for Z is given by the marginal in (10), e.g., in case of a continuous model for M_i ,

(9)

L(z|y)=∫L(z,m(s)|y)dm(s) (11) Then Ep and Vp follow the usual rules for double expectation, i.e.,

))}

(

| ( { )

(Z E E Z M s

E_p = _p _p (12)

))}

(

| ( { ))}

(

| ( { )

(Z E V Z M s V E Z M s

V_p = _p _p + _p _p .

In (12), E_p(Z|m(s)) and V_p(Z|m(s)) are the predictive mean and variance for Z from L_m₍_s₎(z|y). In principle we can derive L(z|y) as the marginal likelihood in (11). The advantage of (12) is that we are able to obtain E_p(Z) and V_p(Z) without actually deriving L(z|y).

Under the model (2) we can factorize f_θ(y,z,m(s))= f_µ_,_τ_,_ρ(y(s),z|m(s),m(s))f_β_,_σ(m(s),m(s)) and it is readily seen that applying L_p, given by (7), to the terms on the right hand side in (10) in fact gives us

)) ( , , ( max )

| ) ( ,

(z m s y f y z m s

L_p = _θ _θ , i.e.,

)

| ) ( ( )

| ( )

| ) ( ,

(z m s y L ₍ _), z y L m s y

L_p = _m_s _p _p . (13)

It follows that E_p(Z) and V_p(Z) based on L_p(z,m(s)|y) can be derived by (12). We note that L_c, given by (6), has the same property, i.e., L_c(z,m(s)|y)=L_m₍_s_),_c(z|y)L_c(m(s)|y).

3.2 Normal model

It is now assumed that model (2) holds with Y_ij and M_i normally distributed. We shall first consider the second likelihood in (10), L(m(s)|y), using the profile predictive likelihood Lp. Let t_ν⁽^k⁾(∑) denote the k-dimensional multivariate t-distribution with ν degrees of freedom and variance-

covariance matrix Σ, i.e., t_ν⁽^k⁾(∑) is the distribution of (U/W) ν where U ~N_k(0,Σ) and W² has a chi-square distribution with ν degrees of freedom. Let X(s) be the vector of (x_i:i∉s). Then

)

| ) ( (m s y

L_p leads to a multivariate t-distribution, such that [ ( ) ˆ ( )]/ ˆ~ ⁽ ⁰⁾( )

0 V

t s

X s

M −β σ _n^N⁻ⁿ , where

the maximum likelihood estimators (MLE) are, with =

∑

∈ s

i i i

s x v x

W ²/ ( ),

∑

∈

=

s i

i i i s

x v x W1 m / ( )

βˆ , the

best unbiased estimator uniformly minimizing the variance, and ˆ² ¹ ( ˆ )²/ ( ).

0 i s i i i

n m βx v x

σ = Σ_∈ −

V = (v_ij) with v_ii=v(x_i)+x_i²/W_s and v_ij =x_ix_j/W_s for i ≠ j. It follows that E_p(M_i)=βˆx_i,

(10)

) / ) ( ˆ ( )

( ₂ ² ²

0

0 i i s

n n i

p M v x x W

V = ₋ σ + and the predictive covariances are given by

s j n i

n j i

p M M xx W

Cov ( , ) ₂ ˆ² /

0

0 ⋅

= ₋ σ for i ≠ j. This implies that )

(

∑

∉s

i i

p M

E = β^ˆX_s (14)

⎟⎟⎠

⎞

⎜⎜⎝

⎛ +

= −

∑

∉ s

s s s

i i

p W

v X n

M n

V ² ²

0

0 ˆ

) 2

( σ

where

∑

∉

=

s

i i

s x

X andvs ⁼

∑

i_∉sv(xi).

L_c and L_mp (for M_i / v(x_i),i∉s), given by (8), lead to moments similar to (14) with n₀ - 2 replaced by n₀ - 5 and n₀ - 4 respectively.

Let us now consider the first term in (10), L_m₍_s₎(z|y) based on f_µ_,_τ_,_ρ(y(s),z|m(s),m(s)). For this likelihood we will restrict attention to L_p, i.e., deriving L_m₍_s_),_p(z|y). The MLE µˆ,τˆ²,ρˆ can be expressed in the following way, with SSE⁼

∑ ∑

i_∈s j_∈s_i(yij ⁻yi)²:

∑

∈ − + ∈ − +

=

s

i i

i i s

i i

i

n y n

n n

ρ ρ ρ

µ ρ

ˆ 1 ˆ

ˆ / 1 ˆ

ˆ (15)

⎟⎟

⎠

⎞

⎜⎜

⎝

⎛

+

− + −

= −

∑

∈s

i i

n y SSE n

n ρ ρ

µ τ ρ

ˆ 1 ˆ

ˆ) ( 1 ˆ

ˆ 1

2 2

and ρˆ is found numerically, maximizing

).−(n/2)logτˆ² −(1/2)

∑

∈ log(1+(n −1)ρˆ)−((n−n₀)/2)log(1−ρˆ

s

i i

When n_i = c, for all i∈s, then µˆ ⁼y⁼

∑

i_∈syi /n₀ , τˆ² =SS/n, ρˆ =max(0,1−_c^c₋₁⋅^SSE_SS ), where

∑ ∑

∈ ∈ −

= i s j s ij

i y y

SS ( )².

Consider first the case when ρ and τ are known. Then µˆ is given by (15) with ρ replacing ρˆ . In this case L_m₍_s_),_p(z|y) is such that Z is normally distributed with predictive mean and predictive variance

(11)

∑

∈ ∉

⎟⎟+

⎠

⎜⎜ ⎞

⎝

⎛

+ +

⋅ +

−

=

s

i i

s

i i

i i i

i i

p y m

n n n n

m s

m Z

E µ

ρ ρ µ ρ ρ ρ

ρ _ˆ

- ˆ 1 -

1 - ) 1

( )) (

|

( (16)

2

1 2

- 1 ) 1 (

)) ( ,

| ( )) (

|

( ⎟⎟

⎠

⎜⎜ ⎞

⎝

⎛

⋅ +

− +

+

=

∑ ∑

∑

∈ − + ^∉ i^∈s i i i s

i i

s

i n

p n

n n -

m m

s m y Z V s m Z V

i

i ρ ρ

τ ρ

ρ ρ

. (17)

Here, )V(Z|⋅ denotes the usual variance in the conditional distribution of Z. When ρ,τ are unknown, )

|

), (

( z y

L_m_s _p will for large n₀ be approximately such that Z is normally distributed with E_p(Z|m(s)) and V_p(Z|m(s)) given by (16) and (17) with ρˆ,τˆ² replacing ρ,τ². Recall that =

∑

∈

si

j ij i

i y n

y / and

ˆ) ˆ, ˆ, ˆ, ˆ,

ˆ (µ τ ρ β σ

θ = the MLE of θ =(β,σ,µ,τ,ρ). Then the conditional expected value of Z given the data, estimated at θˆ , is equal to

ˆ ) ˆ ( ˆ ) 1 ˆ

ˆ ˆ ˆ 1 ˆ

1 ˆ )(

( )

|

ˆ( i

s

i i s

i i i i

i

i y x

n n n n

m y

Z

E

∑ ∑

∈ ∉

+ + + − +

−

− −

= µ β

ρ ρ µ ρ ρ ρ

ρ

θ (18)

Let )V_θ_ˆ(Z|y denote the estimated conditional variance of Z given the data. It now follows, from (12) - (14), (16) - (18) that, approximately, L_p(z,m(s)|y) has

)

| ( )

(Z E_ˆ Z y E_p = _θ and

).

2 ˆ (

ˆ ˆ

ˆ ˆ 1 ˆ

ˆ) 1 ˆ ( ˆ

)

| ( ) (

2 2 2

ˆ 1 ˆ

2 ˆ

W h x W

X n

n X m

y Z V Z V

s s

i i

s s s

i i

i s i

s

i n

n p

i

i ⎟⎟+

⎠

⎞

⎜⎜

⎝

⎛ ⋅ + ⋅

⎥ +

⎦

⎢ ⎤

⎣

⎡

+

−

− − + +

=

∑ ∑

∑

∈ − + ^∈ ^∉

τ ρ µ

ρ σ ρ ρ

τ β

ρ ρ θ

(19)

Here,

∑

∈ ⎟⎟⎠

⎜⎜ ⎞

⎝

⎛

−

− + +

−

=

s

i i

i i i

i n m n n

m y

Z

V ρ

ρ ρ

θ τ

) 1 ( )1 (

1 ) (

) 1 ( )

|

( ²

+ _s

s i

i i s

s x x

X ρσ ν ρ β β µ σ ν

β

τ²( + ² +

∑

( −1))+ ² ²

∉

and

1 . ˆ

ˆ ˆ ˆ ˆ

) ˆ

( ² ² ²

0 2 2

ˆ 0 ˆ 1 2 2

0

0 ⎟⎟

⎠

⎞

⎜⎜⎝

⎛ +

− ⋅

⎟⎟+

⎠

⎜⎜ ⎞

⎝

⎛ +

⎟⎟

⎠

⎞

⎜⎜

⎝

⎛ +

− ⋅

=

∑

∈ − +

∑

i^∉s

i s s s

s s s

i n

n x

v W k

n k W

v X n

k k

n k n h

i

i µ ρτ σ

σ τ

ρ ρ

(12)

The predictive likelihood

)

| ) ( ( )

| ( )

| ) ( ,

( ₍ _),

, z m s y L z y L m s y

L_p_c = _m_s _p _c

leads to the same E_p(Z) while V_p(Z) equals (19) with h(5) instead of h(2). With )

| ) ( ( )

| ( )

| ) ( ,

( ₍ _),

, z m s y L z y L m s y

L_p_mp = _m_s _p _mp

we get the same E_p(Z) and V_p(Z) equal to (19) with h(4).

Let )wˆ_i=(n_iρˆ)/(1−ρˆ+n_iρˆ . Writing the predictor ˆ ( | )

0 Eˆ Z y

Z = _θ , given by (18), as

(

(1 ˆ )µˆ ˆ

)

(βˆ )µˆ

ˆ0=

∑ ∑

∈ ∉ − + +

∑

∉

s

i i

s

i j s wi wiyi x

Z

i ( 20)

we see from (4) that predicting Z by Zˆ₀means that for i∉ s each unobserved Y_ij is predicted by µˆ and Mi is predicted by βˆ . For ix_i ∈ s, j ∉ si, Yij is predicted by wˆ_iy_i+(1−wˆ_i)µˆ.This predictor shrinks the natural estimate y_itowards µˆ . Using the representation (3) of the model, we note that

)) ( )

| ( /(

)

| ( ) 1

/(

) 1

( −ρ −ρ+n_iρ =VarY_i µ_i Var Y_i µ_i +Var µ_i . Hence, for i∈ s, the smaller Var(µ_i)is compared to Var(Y_i|µ_i), the more weight we put on µˆ to predict Yij for j ∉ s_i. Or, in other words, the smaller the variability is between the clusters compared to the variability within the clusters, the more

yishrinks towards µˆ .

3.3 Some optimality considerations

All three predictive likelihoods for the model (2), with normally distributed Yij and Mi, give the same predictor for the population total T,

0

0 ˆ

ˆ y Z

T i s j s ij

i

+

=

∑ ∑

∈ ∈

with Zˆ₀ given by (20).

The optimality considerations are conditional on s = },{s,s_i:i∈s and Eθ(·) is usedto denote Eθ(·|s).

Let ={ˆ: ˆ =

∑ ∑

∈ ∈ }

s

i j s ij ij

i

Y a T

l T be a class of "partially" linear predictors, where eacha_ij is a function of M(s). We shall restrict attention to the class of model-unbiased predictors in

l

^{, i.e.,}

}.

, 0 ˆ )

( ˆ :

{ ∈ _θ − = ∀θ

= T E T T

u l

l

(13)

We shall now consider the distribution-free model (2). The parameter estimates of (β,σ²) are still valid, βˆ now the best linear unbiased (BLU) estimator and ₁ ˆ²

0 0₋ σ

n

n still unbiased. Regarding the MLEµˆ , given by (15), it is readily seen that with ρ replacing ρˆ , µˆ is the BLU estimator as also noted by Kelly and Cumberland (1990). What remains is to derive alternative estimators for ρ and τ². Here one can use an ANOVA approach, as in Valliant et al. (2000, ch. 8.3) or Kelly and Cumberland (1990). When ni = c for all i ∈ s, these two ANOVA approaches yield the same estimators ρˆ_av,τˆ_av² satisfying

⎟⎟⎠

⎜⎜ ⎞

⎝

⎛

− −

−

= −

0 0

2

1 ˆ 1

ˆ n n

SSE n

SSE SS

av c

avτ ρ

0

ˆ2

ˆ ) 1

( n n

SSE

av

av = −

−ρ τ .

It follows that, approximately, (for large n0 with(n₀−1)/n₀ ≈1) , τˆ_av² =SS/n and ρˆ_av =1−_c^c₋₁⋅^SSE_SS ; the same as the MLE in the normal model.

With these new parameter estimates Tˆ₀is clearly a reasonable predictor also for this distribution-free model, e.g., Kelly and Cumberland (1990) suggests using this predictor (see also Valliant et al., 2000, ch.8.9). The optimal procedure at θ, Tˆ , in _θ l_u is defined to be the predictor in l_u that minimizes

)2

(Tˆ T

E_θ − for Tˆ∈l_u. If Tˆ does not depend on θ it is uniformly optimal. _θ

We see that, by using that E(Tˆ−T)=EE(Tˆ−T|M), with M = (M₁,...,M_N) Tˆ∈lu

∑ ∑

∈ ∈

∀

=

⇔

s

i j saij X E

i

β

β( ) β ,

We note the following result.

Lemma 1. The optimal predictor Tˆ must be a member of the class _θ } ), ( of function is

ˆ ; ˆ :

0 {T T bY b_i M s i s

s i

i i u

u = ∈ =

∑

∈

l l

and

∑

∈

∀

=

s i

i X

b

E_β( ) β , β (21)

(14)

Proof. Using the rule V(Tˆ−T)=EV(Tˆ−T|M)+VE(Tˆ−T|M), we see that, with _i

s

j ij

i a n

a

i

∑

∈ /

= ,

Tˆ∈lu ⇒E_θ(Tˆ−T)² =V_θ(Tˆ−T)

= ²(1 ) [ ²] ² ( _i _i)²

s i s

i j s aij E na

E

i

∑

∑ ∑

∈ ∈ + ∈

−ρ _θ ρτ _θ

τ

- 2 ² ( _i _i)

s

i Mina

Eθ

∑

∈

ρτ +µ

∑

−

∑

+ψ

∈ ∈ )

2 (

s i s i

i niai M

V (22)

Here, ψ is a function of the parameters only. Since ² _i _i²

s

j aij na

i

∑

∈ ≥ , it follows that Tˆ must have a_θ ij = ai, for all j∈si, and b_i=n_ia_i. ♦

We restrict attention to the class Luof model-unbiased predictors in l_u where each aijis a linear function of M(s). We note that Tˆ_HT , given by (1), is a member of Lu. Then, from Lemma 1, it is sufficient to consider the class

} and

ˆ ˆ :

0 {

∑

∈

∑

∈

+

=

∈

= i s

s j

j ij i

i i i u

u T T bY b c c M

L l .

From (21),

β

β =β ∀

⇔

∈

∑

∈

, ) ˆ (

0 E b X

L T

s i

i

u ⇔

∑

+β

∑∑

=β ,∀β.

∈ ∈

∈

X x c c

s

i j s

j ij s

i i

Hence,

X x c c

L T

s

i j s

j ij s

i i

u ⇔ = =

∈

∑ ∑∑

∈ ∈

∈

and ˆ 0

0 . (23) We note that Tˆ₀ can be expressed as

∑

sbi⁰Yi and b_i⁰is linear in M(s). Tˆ₀ satisfies (23) with ci = 0 and hence is model-unbiased when ρ is known (e.g., when ρ = 0) and approximately model-unbiased otherwise.

Lemma 2. The optimal predictor Tˆ in L_θ 0u minimizes with respect to c = (ci, i∈s; cij, i∈s, j∈s), subject to condition (23),

Q(c) = ² ¹( ( ) ( )²) 2 ² ( ) ² [ ( _i)]

s i

i i

s i

i i

i s i

M b V b

M E Eb

b

i V + −

∑

+

∑

−

∑

∈ ∈ ∈

µ ρτ

τ _φ

where

ρ φ ρ

i

i i n

n +

= −

1 .

(15)

Proof. For Tˆ∈L_ou, we see from (22), using (21),

=

− )² (Tˆ T

E_θ ²(1 ) ( ²/ ) ² [ ( _i) ( _i)²]

s i s

i bi ni V b E b

E_θ ρτ _θ _θ

ρ

τ −

∑

∈ +

∑

∈ +

- _i

s

i Mib

Eθ

∑

∈

ρτ²

2 +µ

∑

−

∑

+ψ

∈ ∈ )

2 (

s i s i

i bi M

V

= (1 ) )[ ( ) ( ) ]

( ² ² _i _i ²

s

i i

b E b

n ρ ρτ V^θ ^θ

τ − + +

∑

∈

- _i

s

i Mib

Eθ

∑

∈

ρτ²

2 +µ

∑

−

∑

+ψ

∈ ∈ )

2 (

s i s i

i bi M

V

Result follows since (1−ρ)/n_i+ρ=1/φ_i. ♦

Let now φs ⁼

∑

s iφ , )α=τ²/(τ²+φ_sµ² and mˆ_i=(1−α)m_i+αβ^ˆx_i. Then the following result holds.

Theorem. The optimal predictor at θ in Lu is given by

∑

∈ − +

= i s mi wi miwiyi

Tˆ ( ˆ (1 )ˆ )

ρ

θ µ + µˆ_ρ

∑

i∉sβ^ˆxi (24)

i.e., T_θ y Z_θ

s

i j s

ij

i

ˆ

ˆ =

∑∑

+

∈ ∈

where

] ˆ )

) 1 ˆ ( ˆ [(

i s i

i mi wi miwiyi n y

Zθ =

∑

∈ − µρ + − + µˆ_ρ

∑

i∉sβ^ˆxi. Here wi = ρφi andµˆ_ρ ⁼

∑

sφiyi/φs.

Remarks. (I) The optimal predictor at θ depends only on ρ and the coefficient of variation τ/µ, and is hence uniformly optimal in (µ,β,σ) if ρ and τ/µ are assumed known.

(II) The expression for Tˆ means that for i_θ ∈s,

∑

= mi

j yij

1 is estimated by (mˆ_i(1−w_i)µˆ_ρ +m_iw_iy_i), and for i∉s,

∑

=

mi

j yij

1 is estimated by µˆ_ρβ^ˆx_i, i.e. mi is estimated by βˆ and each yx_i ij by µˆ . _ρ (III) Letµˆ_i =(1−w_i)µˆ_ρ+w_iy_i. Then an alternative expression to (24) is:

∑

∈

= i smi i

Tˆ_θ µˆ + µˆ_ρ

∑

i∉sβ^ˆxi+ R