A calibrated imputation method for secondary data analysis of survey data

(1)

A calibrated imputation method for secondary data analysis of survey data

Da Silva, Damião N and Li-Chun Zhang

"This is the accepted, peer reviewed version of the following article in The Scandinavian Journal of Statistics, which has been published in final form at https://doi.org/10.1111/sjos.12435

This article may be used for non-commercial purposes in accordance with Wiley Terms and Conditions for Use of Self-Archived Versions. It may contain minor differences from the original journal’s pdf-version.

The final authenticated version is available at:

Da Silva, D.N., Zhang, L‐C. A calibrated imputation method for secondary data analysis of survey data. Scandinavian Journal of Statistics. 2019; 1– 17.

https://doi.org/10.1111/sjos.12435

(2)

DOI: xxx/xxxx

ARTICLE TYPE

A Calibrated Imputation Method for Secondary Data Analysis of Survey Data.

Damião Nóbrega Da Silva*¹ | Li-Chun Zhang^2,3,4

1Departamento de Estatística, Universidade Federal do Rio Grande do Norte, Natal, RN, Brazil

2Southampton Statistical Sciences Research Institute, University of Southampton, Southampton, Hampshire, UK

3Statistisk sentralbyrå, Oslo, Norway

4University of Oslo, Oslo, Norway

Correspondence

*Damião Nóbrega Da Silva, Departamento de Estatística, Centro de Ciências Exatas e da Terra, Universidade Federal do Rio Grande do Norte, Natal, RN, 59078-970 Brazil. Email: [email protected]

Funding Information

This research was supported by the Brazilian Research Council (CNPq), Grant Number:

211518/2013-1

Summary

In practical survey sampling, missing data are unavoidable due to nonresponse, rejected observations by editing, disclosure control or outlier suppression. We propose a calibrated imputation approach so that valid point and variance estimates of the population (or domain) totals can be computed by the secondary users using simple complete-sample formulae. This is especially helpful for variance estimation, which generally require additional information and tools that are unavailable to the secondary users. Our approach is natural for continuous variables, where the estimation may be either based on reweighting or imputation, including possibly their outlier-robust extensions. We also propose a multivariate procedure to accommo- date the estimation of the covariance matrix between estimated population totals, which facilitates variance estimation of the ratios or differences among the estimated totals. We illustrate the proposed approach using simulation data in supplementary materials that are available online.

KEYWORDS:

analysis of incomplete data, item nonresponse, missing data, variance estimation

1 INTRODUCTION

In the preparation of survey data for use by secondary analysts, some or all of the sample units are usually assigned estimation weights that can be applied to all the survey variables. In addition to these weights, imputed values may be needed for the units that are subjected to item missingness. It is often possible to choose the imputed values for each survey variable so that, together with the observed and retained values of this variable, the corresponding population total can be estimated by weighting as if

0Abbreviations:ANA, anti-nuclear antibodies; APC, antigen-presenting cells; IRF, interferon regulatory factor

(3)

the sample were completely observed. However, applying standard complete-sample variance estimator formulae to the same imputed sample would generally cause bias (see, e.g., Wolter, 2007, pp. 419, 421).

Variance estimation in the presence of imputed data needs to appropriately account for the underlying data generation mechanism. Some common techniques are Fay’s reverse framework (e.g. Fay, 1991, 1992; Shao & Steel, 1999; Kim & Rao, 2009), two-phase sampling (Särndal, 1992; Deville & Särndal, 1994b) and replication methods (Rao & Shao, 1992; Rao, 1996; Shao

& Sitter, 1996; Chen, Rao, & Sitter, 2000). To choose and apply any of these methods may be difficult for secondary users who are non-specialists, even impossible if the relevant information about the sampling design and data processing is lacking.

The needs for easier secondary analyses ensuring that different users of the imputed data could obtain the same results by simple estimation methods have long been recognized (Kalton & Kasprzyk, 1982). Early work under this estimation perspective for the imputed data was addressed by Lanke (1983), Sedransk (1985) and Kim (2001a). Some later works considered the use of constrained or calibrated imputation for point estimation in different situations (Chambers & Ren, 2004; Beaumont, 2005; Chauvet, Deville, & Haziza, 2011; Gelein, Haziza, & Causeur, 2014). Multiple imputation (Rubin, 1978a, 1987b, 1996c;

Rubin & Schenker, 1986) and fractional imputation (Kim & Fuller, 2004; Fuller & Kim, 2005) are two methods based on multiply imputed values. For instance, provided the multiple imputation procedure is proper or the congeniality condition (Meng, 1994) holds, one can compute valid point and variance estimates by combining the results obtained from applying standard complete-sample formulae to each imputed sample.

In this paper, we propose a calibrated imputation approach that allows for valid point and variance estimation of the population and domain totals (or means), by applying simple complete-sample formulae to the imputed sample. Although this accommo- dates a more limited scope than multiple or fractional imputation, secondary users can achieve the intended analyses based on a single imputed dataset using standard software. Moreover, we provide a multivariate procedure for the estimation of a vector of totals (or means) and the associated covariance matrix, using simple complete-sample formulae. This allows one to estimate the variance of ratios or differences of the estimated totals. Finally, the proposed approach has also benefits to the data producer, such as avoiding the dissemination of multiply imputed datasets, the freedom to choose a suitable inference outlook and apply different missing data treatments from one variable to another.

The rest of the paper is organized as follows. The proposed approach is outlined in Section 2. We explain in Section 3 how it can be applied in some common situations of reweighting and imputation-based estimation, as well as domain estimation

(4)

and estimation under stratified multistage sampling. The multivariate calibrated imputation procedure is described in Section 4.

Some concluding remarks and future research topics are given in Section 5.

2 CALIBRATION OF A SINGLE VARIABLE

2.1 Estimation Setup

Consider a finite population of𝑁 units denoted by𝑈 = {1,2, ..., 𝑁}and let𝑌_𝑁 = (𝑦₁,…, 𝑦_𝑁), where𝑦_𝑘 is the value of a survey variable𝑦for the𝑘th unit,𝑘 ∈ 𝑈. Let𝐴be a sample from𝑈selected by a probability sampling design𝑝(𝐼_𝑁), where 𝐼_𝑁 = (𝑖₁,…, 𝑖_𝑁)and𝑖_𝑘= 1if the𝑘th unit is selected to the sample𝐴and𝑖_𝑘= 0otherwise, and let𝑅_𝑁 = (𝑟₁,…, 𝑟_𝑁), where 𝑟_𝑘= 1if𝑦_𝑘is observed and𝑟_𝑘= 0if the𝑦_𝑘is unobserved or rejected during data processing (𝑘∈𝑈),𝐴_𝑟= {𝑘∶𝑘∈𝐴, 𝑟_𝑘= 1}

be the set of units for which the observations are to be preserved and𝐴_𝑚 = {𝑘∶𝑘∈𝐴, 𝑟_𝑘= 0}be the set for which imputation is needed. We assume the missing information of the variable𝑦for the units in𝐴_𝑚is filled in by imputation and denote the corresponding imputed values by{𝑦^∗_𝑘∶𝑘∈𝐴_𝑚}. We assume, in addition, that the imputed dataset will be accompanied by a set of survey weights{𝑤_𝑘∶𝑘∈𝐴}, as for instance, the inverse of the inclusion probabilities (Horvitz & Thompson, 1952), or weights suitably calibrated for auxiliary population totals (Deville & Särndal, 1992a).

Suppose it is of interest to estimate the population total of the variable𝑦, that is𝑡_𝑦 =∑

𝑘∈𝑈𝑦_𝑘. When it comes to complete- sample estimation of𝑡_𝑦using the imputed data{(𝑤_𝑘, 𝑦^∗_𝑘) ∶𝑘∈𝐴}, where𝑦^∗_𝑘is the value for unit𝑘in the imputed full sample 𝐴with𝑦^∗_𝑘=𝑦_𝑘for𝑘∈𝐴_𝑟, a natural and simple choice for the imputed estimator is

̂𝑡_𝑦𝐼 =∑

𝑘∈𝐴

𝑤_𝑘𝑦^∗_𝑘. (1)

Statistical properties of̂𝑡_𝑦𝐼 are studied by adopting aninference approachfor the imputed data, which is usually specified by postulating explicitly a model for the distribution of the response indicators or a superpopulation model for the values of the variables of interest in the population (Haziza, 2009, pp. 222-223). The properties of̂𝑡_𝑦𝐼 are then evaluated with respect to the joint distribution of the sampling design and the assumed model, allowing the unconditional variance of the imputed estimator to be decomposed into variance components which, when estimated, lead to the estimated variance of̂𝑡_𝑦𝐼.

(5)

Here we consider instead the estimation of the variance of the imputed estimator ̂𝑡_𝑦𝐼 by means of the complete-sample estimator

̂

𝑣_𝐹(̂𝑡_𝑦𝐼)≡ 𝑛 𝑛− 1

∑

𝑘∈𝐴

(𝑢^∗_𝑘−𝑢̄^∗)², (2)

where𝑢^∗_𝑘=𝑤_𝑘𝑦^∗_𝑘(𝑘∈𝐴) and𝑢̄^∗ =∑

𝑘∈𝐴𝑢^∗_𝑘∕𝑛=̂𝑡_𝑦𝐼∕𝑛, which amounts to the with-replacementpps samplingvariance formula and, hence, may be computed more easily by secondary users using standard software. For example, when𝑤_𝑘 = 𝑁∕𝑛 then

̂𝑡_𝑦𝐼 =𝑁 ̄𝑦_𝐼 and𝑉 𝑎𝑟(̂𝑡_𝑦𝐼) =𝑁²𝑠²_𝑦𝐼∕𝑛, where𝑦̄_𝐼 and𝑠²_𝑦𝐼are the sample mean and variance of the imputed variable.

Clearly, naive application of estimators (1) and (2) would lead to incorrect inference generally. In order for these estimators to yield valid estimates, the imputed values need to be created in a controlled manner, as it will be discussed in the next section.

2.2 The calibrated imputation approach

The main goal of the following calibration method for the imputed data is to provide imputed values𝑦^∗_𝑘so that the complete- sample estimators (1) and (2) satisfy

̂𝑡_𝑦𝐼 ≡∑

𝑘∈𝐴

𝑤_𝑘𝑦^∗_𝑘= ̂𝑡_𝑦0, ̂𝑣_𝐹(̂𝑡_𝑦𝐼)≡ 𝑛 𝑛− 1

∑

𝑘∈𝐴

(𝑤_𝑘𝑦^∗_𝑘−̂𝑡_𝑦𝐼∕𝑛)2

=𝑣̂_𝑦0, (3)

wherê𝑡_𝑦0and𝑣̂_𝑦0are validtargetestimates for the population total and its corresponding variance estimate. The method requires the data producer to choose and calculate such targets for the variable specified, as well as to calibrate the imputed values to attain the conditions in (3). These targets should incorporate all the aspects of the sampling design, response mechanism and inference approach for the imputed estimator. However, as a benefit of the calibration method, the suitability of these target estimates is a matter of concern only for the data producer and not for the secondary users, who are no longer exposed to the theoretical and computational complications involved.

The calibration method can be described by the following two-step algorithm.

Calibration algorithm:

Step 1 (Imputation): Using a standard imputation procedure, obtain a set of initial imputed values{𝑦̃_𝑘∶𝑘∈𝐴_𝑚}. For each

̃

𝑦_𝑘, obtain a corresponding adjusted imputed value𝑦̂_𝑘so that

∑

𝑘∈𝐴𝑚

𝑤_𝑘𝑦̂_𝑘=̂𝑡_𝑦0− ∑

𝑘∈𝐴𝑟

𝑤_𝑘𝑦_𝑘. (4)

(6)

Step 2 (Calibration): For each𝑘∈𝐴_𝑟, set𝑢^∗_𝑘=𝑤_𝑘𝑦_𝑘. For𝑘∈𝐴_𝑚, obtain an imputed value𝑢^∗_𝑘by a minimal adjustment to

̂

𝑢_𝑘=𝑤_𝑘𝑦̂_𝑘, where𝑦̂_𝑘is computed in Step 1, so that

∑

𝑘∈𝐴𝑚

𝑢^∗_𝑘=̂𝑡_𝑦0− ∑

𝑘∈𝐴𝑟

𝑤_𝑘𝑦_𝑘 (5)

and

∑

𝑘∈𝐴𝑚

𝑢^∗2_𝑘 = 𝑛− 1 𝑛 𝑣̂_𝑦0+1

𝑛̂𝑡²_𝑦0− ∑

𝑘∈𝐴𝑟

𝑤²_𝑘𝑦²_𝑘, (6)

wherê𝑡_𝑦0and𝑣̂_𝑦0are the targets in (3). Take𝑦^∗_𝑘=𝑢^∗_𝑘∕𝑤_𝑘for𝑘∈𝐴.

The algorithm initiates in Step 1 by choosing an imputation scheme to provide preliminary imputed values𝑦̂_𝑘, for𝑘 ∈𝐴_𝑚, such that applying (1) with these values yieldŝ𝑡_𝑦0. Provided the initial imputed values𝑦̃_𝑘(𝑘∈𝐴_𝑚) already yieldŝ𝑡_𝑦0by (1), one can simply take𝑦̂_𝑘 = 𝑦̃_𝑘, for𝑘∈ 𝐴_𝑚. An example is given in Section 3.1. Otherwise, the𝑦̃_𝑘values need to be adjusted. One simple ratio adjustment of the initial imputed values is

̂ 𝑦_𝑘=

(̂𝑡_𝑦0−∑

𝓁∈𝐴𝑟𝑤_𝓁𝑦_𝓁)

∑

𝓁∈𝐴𝑚𝑤_𝓁𝑦̃_𝓁 𝑦̃_𝑘 (𝑘∈𝐴_𝑚), (7)

which is a special case of thereverse calibrationapproach of Chambers & Ren (2004), originally proposed for the estimation of𝑡_𝑦in the presence of survey outliers. Then, in Step 2, the calibration of the imputed values is made. Optimal imputed values that are calibrated to (5) and (6) could be computed in closed-form by applying Theorem 1 below. The proof of this theorem is shown in the Appendix.

Theorem 1. Consider initial values𝑎̂_𝑘and𝑑_𝑘 >0for all𝑘in a non-null set𝐷 ⊂ 𝐴. Suppose∑

𝑘∈𝐷𝑑_𝑘𝑎̂_𝑘=𝑡₁for some fixed constant𝑡₁and∑

𝑘∈𝐷𝑑_𝑘(𝑎̂_𝑘−𝑡₁∕𝑡₀)² >0, where𝑡₀ =∑

𝑘∈𝐷𝑑_𝑘 >0. Let𝑡₂ > 𝑡²₁∕𝑡₀be a fixed constant . Then, the adjusted𝑎_𝑘 values that minimizeΔ =∑

𝑘∈𝐷𝑑_𝑘(𝑎_𝑘−𝑎̂_𝑘)²subjected to the constraints

∑

𝑘∈𝐷

𝑑_𝑘𝑎_𝑘=𝑡₁, ∑

𝑘∈𝐷

𝑑_𝑘𝑎²_𝑘=𝑡₂, (8)

are given by

𝑎_𝑘=𝑡₁∕𝑡₀+𝛽(𝑎̂_𝑘−𝑡₁∕𝑡₀), (9)

where

𝛽 =(𝑡₂−𝑡²₁∕𝑡₀

̂𝑡₂−𝑡²

1∕𝑡₀ )1∕2

and̂𝑡₂ =∑

𝑘∈𝐷𝑑_𝑘𝑎̂²_𝑘.

(7)

The optimal calibrated imputed values 𝑦^∗_𝑘 of Step 2 are obtained as follows. First, we take the values of the calibration conditions𝑡₁and𝑡₂of (8) as the right-hand sides of (5) and (6), namely

𝑡₁=̂𝑡_𝑦0− ∑

𝑘∈𝐴𝑟

𝑤_𝑘𝑦_𝑘 (10)

and

𝑡₂= (𝑛− 1)𝑣̂_𝑦0∕𝑛+̂𝑡²_𝑦0∕𝑛− ∑

𝑘∈𝐴𝑟

𝑤²_𝑘𝑦²_𝑘.

Then, we set𝐷 = 𝐴_𝑚,𝑑_𝑘 = 1and𝑎̂_𝑘 = 𝑢̂_𝑘 = 𝑤_𝑘𝑦̂_𝑘for all𝑘 ∈ 𝐴_𝑚, where the𝑦̂_𝑘(𝑘 ∈ 𝐴_𝑚) are obtained in Step 1. Thus, it follows from (9) that the𝑢^∗_𝑘values of Step 2 are

𝑢^∗_𝑘≡𝑎_𝑘=𝑡₁∕𝑚+𝛽̂(𝑢̂_𝑘−𝑡₁∕𝑚) (𝑘∈𝐴_𝑚), (11) where𝑡₁is defined in (10) and

𝛽̂={(𝑛− 1)𝑣̂_𝑦0∕𝑛−∑

𝑘∈𝐴𝑟(𝑢_𝑘−̂𝑡_𝑦0∕𝑛)²−𝑚(

̂𝑡_𝑦0∕𝑛−𝑡₁∕𝑚)2

∑

𝑘∈𝐴𝑚(𝑢̂_𝑘−𝑡₁∕𝑚)²

}1∕2

.

The resulting calibrated imputed variable is

𝑦^∗_𝑘=

⎧⎪

⎪⎨

⎪⎪

⎩

𝑦_𝑘, 𝑘∈𝐴_𝑟, 𝑢^∗_𝑘∕𝑤_𝑘, 𝑘∈𝐴_𝑚.

(12)

Remark 1. The calibrated imputation method in (12) does not modify the observed values for units in the respondent set (𝐴_𝑟).

The values that are actually modified are the calibrated𝑦^∗_𝑘=𝑢^∗_𝑘∕𝑤_𝑘values (𝑘∈𝐴_𝑚), where the𝑢^∗_𝑘values minimize the squared distance to the imputed values𝑢̂_𝑘=𝑤_𝑘𝑦̂_𝑘(𝑘∈𝐴_𝑚)obtained in Step 1, that is,Δ =∑

𝑘∈𝐴𝑚(𝑢^∗_𝑘−𝑢̂_𝑘)². The resulting𝑢^∗_𝑘values are obtained analytically as the “best” linear predictor of𝑢_𝑘based on the𝑢̂_𝑘(𝑘∈𝐴_𝑚), where the slope𝛽̂of the regression line, given in (11), dictates how the empirical variance of the𝑢^∗_𝑘relates to that of the𝑢̂_𝑘(𝑘∈𝐴_𝑚). In practice, unless the𝑦̂_𝑘values are created to have greater empirical variance over𝐴_𝑚than𝐴_𝑟, one may expect𝛽 >̂ 1. This is because the formula (2) is ostensibly aimed at a variance of the order𝑛⁻¹, whereas the target𝑣̂_𝑦0is generally aimed at a variance of the order𝑟⁻¹, where𝑟is the size of𝐴_𝑟. Thus, in order for the two to be equal to each other, the imputed𝑦^∗_𝑘values will need to have greater variation over𝐴_𝑚 than the observed𝑦_𝑘over𝐴_𝑟.

Remark 2. Given the set of missing units𝐴_𝑚, the application of Theorem 1 to obtain the optimal solution (11) requires that

̂

𝑣_𝑦0> 𝑛 𝑛− 1

{ ∑

𝑘∈𝐴𝑟

(𝑢_𝑘−̂𝑡_𝑦0∕𝑛)²+𝑚(

𝑡₁∕𝑚−̂𝑡_𝑦0∕𝑛)2}

(13)

(8)

and

∑

𝑘∈𝐴𝑚

(𝑢̂_𝑘−𝑡₁∕𝑚)²>0. (14)

Comparing (13) to (2), it is readily seen that, for the solution to the optimization problem in Step 2 to exist, the target estimate

̂

𝑣_𝑦0 needs to be larger than the full-sample variance estimate (2) that would have been obtained had the missing values been imputed by the common value𝑡₁∕𝑚. The second condition (14) demands that the sampling weights and the imputation scheme are such that the𝑢̂_𝑘=𝑤_𝑘𝑦̂_𝑘values are different from𝑡₁∕𝑚for at least one𝑘∈𝐴_𝑚. This is not the case when mean imputation is used at Step 1 to fill in the missing values of an equal probability sample. In such a situation, the proposed approach could still be applied by adding some initial zero-mean noise to each mean imputed value. The calibration constraints ensure that this added variability will not affect the variance of the imputed estimator.

3 SOME APPLICATIONS

We explain below how the two-step approach and Theorem 1 proposed in Section 2 can be applied in some general situations, which comprise reweighting and imputation-based estimation, as well as domain estimation and estimation under stratified multistage sampling.

3.1 Ratio imputation

Suppose that, in addition to the survey variable𝑦, there is an auxiliary variable𝑥which is not affected by nonresponse. Assume a population ratio model𝜉of the pairs{(𝑥_𝑘, 𝑦_𝑘) ∶𝑘∈𝑈}, under which

𝐸_𝜉(𝑦_𝑘∣𝑥_𝑘) =𝛽₀𝑥_𝑘, 𝑉 𝑎𝑟_𝜉(𝑦_𝑘∣𝑥_𝑘) =𝜎²𝑥_𝑘,

for some unknown parameters𝛽₀and𝜎². By ratio imputation under the model𝜉, the missing𝑦_𝑘values are imputed as

̃

𝑦_𝑘=𝛽̂_0𝑟𝑥_𝑘 (𝑘∈𝐴_𝑚),

(9)

where𝛽̂_0𝑟 = ∑

𝑘∈𝐴𝑟𝑤_𝑘𝑦_𝑘∕∑

𝑘∈𝐴𝑟𝑤_𝑘𝑥_𝑘, and𝑤_𝑘 = 1∕𝜋_𝑘, and𝜋_𝑘is the sample inclusion probability, for𝑘∈ 𝐴. The resulting imputed estimator of the population total𝑡_𝑦is

̂𝑡_𝑦0= ∑

𝑘∈𝐴𝑟

𝑤_𝑘𝑦_𝑘+ ∑

𝑘∈𝐴𝑚

𝑤_𝑘𝑦̃_𝑘=𝛽̂_0𝑟̂𝑡_𝑥,

wherê𝑡_𝑥=∑

𝑘∈𝐴𝑤_𝑘𝑥_𝑘is the Horvitz-Thompson estimator (Horvitz & Thompson, 1952) of the population total𝑡_𝑥=∑

𝑘∈𝑈𝑥_𝑘. Mean imputation is a special case of ratio imputation with𝑥_𝑘= 1for all𝑘∈𝐴, by which the imputed estimator̂𝑡_𝑦0reduces to

̂𝑡_𝑦0 =𝑁 ̄𝑦_𝑟. Under the conditions of Theorem 1 of Kim & Rao (2009), a design and model consistent estimator of the variance of̂𝑡_𝑦0can be expressed as

̂

𝑣_𝑦0=𝑣̂₁+𝑣̂₂, (15)

where

̂ 𝑣₁=∑

𝑘∈𝐴

∑

𝓁∈𝐴

(𝜋_𝑘𝓁−𝜋_𝑘𝜋_𝓁)

𝜋_𝑘𝓁 𝑤_𝑘𝜂̂_𝑘𝑤_𝓁𝜂̂_𝓁, 𝑣̂₂= (̂𝑡_𝑥

̂𝑡_𝑥𝑟 )2 ∑

𝑘∈𝐴𝑟

𝑤_𝑘𝑒̂²_𝑘,

̂

𝜂_𝑘=𝛽̂_0𝑟𝑥_𝑘+ ̂𝑡_𝑥

̂𝑡_𝑥𝑟𝑟_𝑘𝑒̂_𝑘 (𝑘∈𝐴)

̂𝑡_𝑥𝑟=∑

𝑘∈𝐴𝑟𝑤_𝑘𝑥_𝑘and𝑒̂_𝑘=𝑦_𝑘−𝛽̂_0𝑟𝑥_𝑘.

However, to compute𝑣̂₁, the secondary user needs to have access to the matrix of the second-order inclusion probabilities {𝜋_𝑘𝓁 ∶𝑘𝓁∈ 𝐴}, which are almost never disseminated together with the imputed sample. The proposed approach avoids this complication. To calibrate the ratio imputed values𝑦̃_𝑘=𝛽̂_0𝑟𝑥_𝑘(𝑘∈𝐴_𝑚), we notice that𝑦̂_𝑘=𝑦̃_𝑘already satisfies Step 1, since 𝑡₁=𝛽̂_0𝑟̂𝑡_𝑥𝑚in Theorem 1. For Step 2, by (12) and (11), the calibrated imputed values are

𝑦^∗_𝑘=𝑢^∗_𝑘∕𝑤_𝑘, (16)

where𝑢^∗_𝑘=𝑤_𝑘𝑦_𝑘(𝑘∈𝐴_𝑟), and for𝑘∈𝐴_𝑚, 𝑢^∗_𝑘=

𝛽̂_0𝑟̂𝑡_𝑥𝑚 𝑚 +𝛽 ̂̂𝛽_0𝑟

(

𝑤_𝑘𝑥_𝑘− ̂𝑡_𝑥𝑚 𝑚

)

(𝑘∈𝐴_𝑚)

and

𝛽̂²=

(𝑛−1)

𝑛 (𝑣̂₁+𝑣̂₂) − ∑

𝑘∈𝐴𝑟

(

𝑤_𝑘𝑦_𝑘−^𝛽^̂^0𝑟^̂𝑡^𝑥

𝑛

)2

−𝑚 ̂𝛽_0𝑟² (_̂𝑡

𝑥

𝑛 − ^̂𝑡^𝑥𝑚

𝑚

)2

𝛽̂_0𝑟² ∑

𝑘∈𝐴𝑚

(

𝑤_𝑘𝑥_𝑘− ^̂𝑡^𝑥𝑚_𝑚)2 . In the case of mean imputation and simple random sampling without replacement, (15) reduces to

̂

𝑣_𝑦0 =𝑁²(1 𝑟 − 1

𝑁 )

𝑠²_𝑦𝑟, (17)

(10)

where𝑠²_𝑦𝑟 =∑

𝑘∈𝐴𝑟(𝑦_𝑘−𝑦̄_𝑟)²∕(𝑟− 1)and𝑦̄_𝑟is the observed respondent mean.

3.2 Domain estimation

As a realistic setting for domain total estimation, in addition to the population total, consider a domain population partition 𝑈 =𝑈₁∪⋯∪𝑈_𝐷. Let the population total of domain𝑈_𝑑be

𝑡_𝑑𝑦= ∑

𝑘∈𝑈𝑑

𝑦_𝑘= ∑

𝑘∈𝑈

𝛿_𝑘𝑑𝑦_𝑘,

where the domain indicator𝛿_𝑘𝑑,𝛿_𝑘𝑑 = 1if𝑘 ∈ 𝑈_𝑑 and𝛿_𝑘𝑑 = 0 otherwise, is observed for all units in the sample𝐴(𝑑 = 1,…, 𝐷). Let ̂𝑡_𝑑𝑦 be the target domain total estimator and 𝑣̂_𝑑𝑦 its variance estimate. Domain estimation can be handled by separate calibration for each domain by the producer and application of the domain complete-data formulae by the secondary users, yieldinĝ𝑡_𝑑𝑦𝐼 =̂𝑡_𝑑𝑦and𝑣̂_𝐹(̂𝑡_𝑑𝑦𝐼) =𝑣̂_𝑑𝑦, as explained in Section 2.

However, one is still interested in estimating the population total, in addition to the domain totals. Directly applying the complete-sample formula (1) to the domain-calibrated imputed sample would correctly estimate the population total. One can combine the domain variance estimates, as if the sampling were stratified by the domains. However, the resulting variance estimate is incorrect even when the domain total estimators are independent of each other, due to the additional term

𝑣_𝑏= 𝑛 𝑛− 1

∑𝐷 𝑑=1

𝑛_𝑑(̂𝑡_𝑑𝑦𝐼∕𝑛_𝑑−̂𝑡_𝑦0∕𝑛)² = 𝑛²

𝑛− 1𝑉_𝑛(̂𝑡_𝑑𝑦𝐼∕𝑛_𝑑),

where𝑉_𝑛(̂𝑡_𝑑𝑦𝐼∕𝑛_𝑑)is the variance of̂𝑡_𝑑𝑦𝐼∕𝑛_𝑑with respect to the empirical sample domain distribution function(𝑛₁∕𝑛,…, 𝑛_𝐷∕𝑛), since

𝑉_𝑛(̂𝑡_𝑑𝑦𝐼∕𝑛_𝑑) =

∑𝐷 𝑑=1

𝑛_𝑑

𝑛 (̂𝑡_𝑑𝑦𝐼∕𝑛_𝑑−̂𝑡_𝑦0∕𝑛)² and ̂𝑡_𝑦0∕𝑛=𝐸_𝑛(̂𝑡_𝑑𝑦𝐼∕𝑛_𝑑) =

∑𝐷 𝑑=1

𝑛_𝑑

𝑛 (̂𝑡_𝑑𝑦𝐼∕𝑛_𝑑).

We propose to introduce adomain estimation effect factor, denoted by𝛾, and use

̂

𝑣_𝐹(̂𝑡_𝑦𝐼) =𝛾² 𝑛 𝑛− 1

∑

𝑘∈𝐴

(𝑤_𝑘𝑦^∗_𝑘−̂𝑡_𝑦0∕𝑛)²=𝑣̂_𝑦0. (18) The factor𝛾can be calculated after domain-calibrated imputation, and disseminated together with imputed sample.

In the separate domain calibration above, 𝑣̂_𝐹(̂𝑡_𝑑𝑦𝐼)is built on the squared errors around ̂𝑡_𝑑𝑦𝐼∕𝑛_𝑑. Consider using another complete-sample formula𝑣̂_𝐹(̂𝑡_𝑑𝑦𝐼), built around̂𝑡_𝑦𝐼∕𝑛instead, where

̂

𝑣_𝐹(̂𝑡_𝑑𝑦𝐼) = 𝑛_𝑑 𝑛_𝑑− 1

∑

𝑘∈𝐴

𝛿_𝑘𝑑(𝑤_𝑘𝑦^∗_𝑘𝑑−̂𝑡_𝑦0∕𝑛)².

(11)

We need to extend the calibration constraints as follows:

⎧⎪

⎪⎪

⎪⎨

⎪⎪

⎩

̂𝑡_𝑑𝑦𝐼 =∑

𝑘∈𝐴𝛿_𝑘𝑑𝑤_𝑘𝑦^∗_𝑘= ̂𝑡_𝑑𝑦 for𝑑= 1, ..., 𝐷

̂

𝑣_𝐹(̂𝑡_𝑑𝑦𝐼) = ^𝑛^𝑑

𝑛𝑑−1

∑

𝑘∈𝐴𝛿_𝑘𝑑(𝑤_𝑘𝑦^∗_𝑘𝑑−̂𝑡_𝑦0∕𝑛)²=𝑣̂_𝑑𝑦 for𝑑= 1, ..., 𝐷

̂

𝑣_𝐹(̂𝑡_𝑦𝐼) =𝛾² ^𝑛

𝑛−1

∑

𝑘∈𝐴(𝑤_𝑘𝑦^∗_𝑘−̂𝑡_𝑦0∕𝑛)²=𝑣̂_𝑦0.

(19)

In other words, we use𝛿_𝑘𝑑 to identify the relevant observations for domain estimation, including the special case of𝑈_𝑑 =𝑈 and𝛿_𝑘𝑑 ≡1, and usê𝑡_𝑦0∕𝑛in all the ultimate variance estimators, including domain variance estimation. We refer to (19) as the centred domain calibration approach.

Minimum adjustments of{𝑦̂_𝑘;𝑘 ∈ 𝐴_𝑚}from Step 2 of the proposed approach can be achieved by Theorem 1 as well. To focus the idea, suppose negligible1∕𝑛and1∕𝑛_𝑑. Let{𝑢^∗_𝑘;𝑘∈𝐴_𝑚𝑑}be the calibrated imputations in domain𝑑, given by

𝑢^∗_𝑘=𝑡_1𝑑∕𝑚+𝛽_𝑑(𝑢̂_𝑘−𝑡_1𝑑∕𝑚),

where𝑡_1𝑑=̂𝑡_𝑑𝑦−∑

𝑘∈𝐴𝑟𝑑𝑤_𝑘𝑦_𝑘is the constrained total of𝑢^∗_𝑘=𝑤_𝑘𝑦^∗_𝑘in𝐴_𝑚𝑑. However, instead of choosing𝛽_𝑑such that 𝛽_𝑑² ∑

𝑘∈𝐴𝑚𝑑

(𝑢̂_𝑘− 𝑡_1𝑑 𝑚

)2

=𝑣̂_𝑑𝑦− ∑

𝑘∈𝐴𝑟𝑑

(𝑢_𝑘− ̂𝑡_𝑑𝑦 𝑛_𝑑

)2

−𝑚_𝑑(̂𝑡_𝑑𝑦 𝑛_𝑑 −𝑡_1𝑑

𝑚 )2

,

as under separate domain calibration, we should now choose𝛽_𝑑such that 𝛽_𝑑² ∑

𝑘∈𝐴𝑚𝑑

(𝑢̂_𝑘− 𝑡_1𝑑 𝑚

)2

=𝑣̂_𝑑𝑦− ∑

𝑘∈𝐴𝑟𝑑

(𝑢_𝑘− ̂𝑡_𝑦0 𝑛

)2

−𝑚_𝑑(̂𝑡_𝑦0 𝑛 − 𝑡_1𝑑

𝑚 )2

.

This allows us to estimate the domain variance𝑣̂_𝑑𝑦as in (19). The domain estimation effect factor𝛾can be calculated afterwards to satisfy (19). The conditions for the existence of solution are formally the same as discussed in Section 2.2. Provided domain- specific calibration, it is feasible as long aŝ𝑡_𝑦0∕𝑛does not differ too much from̂𝑡_𝑑𝑦∕𝑛_𝑑in the different domains.

In practice one may be interested in multiple sets of (overlapping) domains. For example, a user may want to have estimates by region as well as estimates by industry. Insofar as the need is known in advance, the producer can apply the approach above to the ‘atomic domains’, which arise from crossing region and industry. In addition to the separate atomic-domain calibrated sample, one can supply a domain estimation factor for the population total, a set of domain estimation factors for each of the regions, and another set of factors for each industry.

(12)

3.3 Stratified Multistage Sampling

Let the population𝑈be partitioned into𝐻strata of𝑛primary sampling units (PSUs), where a sample𝐴_ℎof𝑛_ℎPSUs is selected separately within theℎth stratum (ℎ = 1,…, 𝐻;𝑛₁+⋯+𝑛_𝐻 = 𝑛). From each PSU in 𝐴_ℎ, additional stages of sampling are undertaken until the selection of the ultimate sampling units (USUs). Let𝑤_𝑖,𝑦_𝑖and𝑟_𝑖 be, respectively, the weight, the𝑦- value and the response indicator for the𝑖th USU. Let𝐴_ℎ𝑘be the set of USUs in the𝑘th selected PSU of theℎth stratum, where 𝐴_𝑟ℎ𝑘= {𝑖∶𝑖∈𝐴_ℎ𝑘, 𝑟_𝑖= 1}and𝐴_𝑚ℎ𝑘 = {𝑖∶𝑖∈𝐴_ℎ𝑘, 𝑟_𝑖= 0}.

By setting𝑦^∗_𝑖 =𝑦_𝑖if𝑟_𝑖= 1and letting𝑦^∗_𝑖 be the calibrated imputation value if𝑟_𝑖 = 0, the imputed estimate of the population total𝑡_𝑦can be written aŝ𝑡_𝑦𝐼 =∑^𝐻

ℎ=1̂𝑡_{𝑦𝐼 ℎ}, wherê𝑡_{𝑦𝐼 ℎ}=∑

𝑘∈𝐴ℎ𝑢^∗_ℎ𝑘and𝑢^∗_ℎ𝑘=∑

𝑖∈𝐴ℎ𝑘𝑤_𝑖𝑦^∗_𝑖 =∑

𝑖∈𝐴𝑟ℎ𝑘𝑤_𝑖𝑦_𝑖+∑

𝑖∈𝐴𝑚ℎ𝑘𝑤_𝑖𝑦^∗_𝑖. For calibrated imputation that enables (3), we can apply Theorem 1 and the 2-step approach directly at the level of USUs, ignoring the clustering structure of the multistage sampling.

Survey data analysis softwares (such as STATA, R, SAS) commonly use the stratified ultimate variance formula for variance estimation. It is therefore convenient if the secondary user can simply input the imputed sample, and let the software carry on as usual. Thus, as another possibility of full-sample variance estimator, we consider

̂ 𝑣_𝐹(̂𝑡_𝑦𝐼) =

∑𝐻 ℎ=1

𝑛_ℎ 𝑛_ℎ− 1

∑

𝑘∈𝐴ℎ

(𝑢^∗_ℎ𝑘−𝑢̄^∗_ℎ)²,

where 𝑢̄^∗_ℎ = ∑

𝑘∈𝐴ℎ𝑢^∗_ℎ𝑘∕𝑛_ℎ = ̂𝑡_{𝑦𝐼 ℎ}∕𝑛_ℎ. This choice fits naturally with the standard approach of ultimate-cluster variance estimation under stratified multistage sampling (e.g. Skinner, 1989, Section 2.13).

Given̂𝑡_𝑦0ℎfor the population total in theℎth stratum and its associated variances𝑣̂_𝑦0ℎ=𝑣(̂𝑡̂ _𝑦0ℎ)(ℎ= 1,…, 𝐻), consider the problem of finding the values𝑦^∗_𝑗 starting with𝑦̃_𝑗,𝑗∈ ∪_𝑘∈𝐴

ℎ𝐴_𝑚ℎ𝑘, so that

̂𝑡_{𝑦𝐼 ℎ}=∑

𝑘∈𝐴ℎ𝑢^∗_ℎ𝑘=̂𝑡_𝑦0ℎ,

∑

𝑘∈𝐴ℎ(𝑢^∗_ℎ𝑘−𝑢̄^∗_ℎ)² =∑

𝑘∈𝐴ℎ𝑢^{∗ 2}_ℎ𝑘−̂𝑡²_𝑦0ℎ∕𝑛_ℎ= (𝑛_ℎ− 1)𝑣̂_𝑦0ℎ∕𝑛_ℎ.

(20)

We propose to obtain a solution of this problem in two stages. First, the initial imputed PSU totals are adjusted minimally subject to the two constraints above, yielding the adjusted PSU total𝑢^∗_ℎ𝑘. Second, the initial imputed values𝑦̃_𝑗 are adjusted, separately within each PSU, to agree with the corresponding calibrated PSU total from the first step.

For the first stage, we can apply Theorem 1 within theℎth stratum similarly as in Section 2. Let𝐴_ℎ0= {𝑘∈𝐴_ℎ∶ #(𝐴_𝑚ℎ𝑘) = 0},𝑢̂_ℎ𝑘=𝑢_ℎ𝑘for𝑘∈𝐴_ℎ0and𝑢̂_ℎ𝑘=𝑢̃_ℎ𝑘(̂𝑡_𝑦0ℎ−∑

𝑘∈𝐴_ℎ0𝑢_ℎ𝑘)∕(∑

𝓁∈𝐴ℎ∖𝐴_ℎ0𝑢̃_ℎ𝓁)for𝑘∈𝐴_ℎ∖𝐴_ℎ0. Then, take𝐷=𝐷_ℎ=𝐴_ℎ∖𝐴_ℎ0,

(13)

𝑑_𝑘 = 𝑑_ℎ𝑘 = 1,𝑎̂_𝑘 = 𝑎̂_ℎ𝑘 = 𝑢̂_ℎ𝑘,𝑡₀ = ∑

𝑘∈𝐴ℎ∖𝐴ℎ0𝑑_𝑘 ≡ 𝑚_ℎ,𝑡₁ = 𝑡_1ℎ = ̂𝑡_𝑦0ℎ−∑

𝑘∈𝐴ℎ0𝑢_ℎ𝑘and𝑡₂ = 𝑡_2ℎ = (𝑛_ℎ− 1)̂𝑡_𝑦0ℎ∕𝑛_ℎ+

̂𝑡²

𝑦0ℎ∕𝑛_ℎ−∑

𝑘∈𝐴ℎ0𝑢²_ℎ𝑘. For eachℎ= 1,…, 𝐻, the optimal solution that minimizes the squared distanceΔ_ℎ=∑

𝐴ℎ(𝑢^∗_ℎ𝑘−𝑢̂_ℎ𝑘)²=

∑

𝐴ℎ∖𝐴ℎ0(𝑢^∗_ℎ𝑘−𝑢̂_ℎ𝑘)²subject to (20) are given by𝑢^∗_ℎ𝑘=𝑢_ℎ𝑘for𝑘∈𝐴_ℎ0, and 𝑢^∗_ℎ𝑘= ̂𝑡_1ℎ

𝑚_ℎ +𝛽̂_ℎ (

̂ 𝑢_ℎ𝑘− ̂𝑡_1ℎ

𝑚_ℎ )

(21)

for𝑘∈𝐴_ℎ∖𝐴_ℎ0, where

𝛽̂_ℎ={(𝑛ℎ− 1)𝑣̂_𝑦0ℎ∕𝑛_ℎ−∑

𝐴_ℎ0(𝑢_ℎ𝑘−̂𝑡_𝑦0ℎ∕𝑛_ℎ)²−𝑚_ℎ(̂𝑡_𝑦0ℎ∕𝑛_ℎ−𝑡_1ℎ∕𝑚_ℎ)²

∑

𝑘∈𝐴ℎ∖𝐴_ℎ0(̂𝑢_ℎ𝑘−𝑡_1ℎ∕𝑚_ℎ)²

}¹₂ .

Having thus obtained𝑢^∗_ℎ𝑘, we adjust the𝑦̃_𝑖’s separately within each PSU so that𝑢^∗_ℎ𝑘= ∑

𝑖∈𝐴𝑟ℎ𝑘𝑤_𝑖𝑦_𝑖+∑

𝑖∈𝐴𝑚ℎ𝑘𝑤_𝑖𝑦^∗_𝑖, which is a single constraint. For givenℎand𝑘 ∈ 𝐴_ℎ∖𝐴_ℎ0, the values 𝑦^∗_𝑖 that minimize the distance∑

𝑖∈𝐴𝑚ℎ𝑘(𝑦^∗_𝑖 −𝑦̃_𝑖)²∕2subject to

∑

𝑗∈𝐴𝑚ℎ𝑘𝑤_𝑖𝑦^∗_𝑖 =𝑢^∗_ℎ𝑘−∑

𝑖∈𝐴𝑟ℎ𝑘𝑤_𝑖𝑦_𝑖≡𝑢_ℎ𝑘0are 𝑦^∗_𝑖 =𝑦̃_𝑖{

1 +(𝑤_𝑖

̃ 𝑦_𝑖

)(𝑢_ℎ𝑘0−∑

𝑖∈𝐴𝑚ℎ𝑘𝑤_𝑖𝑦̃_𝑖)

∑

𝑖∈𝐴𝑚ℎ𝑘𝑤²_𝑖 }

(𝑖∈𝐴_𝑚ℎ𝑘). (22)

4 CALIBRATION OF MULTIPLE VARIABLES

Let𝒚_𝑘= (𝑦_𝑘1,…, 𝑦_𝑘𝑝)^⊤denote a𝑝-dimensional vector of values for the𝑘-th unit and𝒖^∗_𝑘= 𝑤_𝑘𝒚^∗_𝑘, where𝒚^∗_𝑘= (𝑦^∗_𝑘1,…, 𝑦^∗_𝑘𝑝)^⊤ denote the calibrated imputed values having the restriction that𝑦^∗_𝑘𝓁 =𝑦_𝑘𝓁if𝑦_𝑘𝓁is observed and fixed (𝑘∈𝐴and𝓁= 1,…, 𝑝).

Following the basic algorithm of Section 2, consider the problem of finding the𝒖^∗_𝑘satisfying

∑

𝑘∈𝐴

𝒖^∗_𝑘=̂𝒕₀, 𝑛 𝑛− 1

∑

𝑘∈𝐴

(𝒖^∗_𝑘−̂𝒕₀∕𝑛)^⊗2=𝑽̂₀,

where𝒕̂₀denotes a𝑝-dimensional vector of target estimates for the population total𝒕_𝑦=∑

𝑘∈𝑈𝒚_𝑘,𝑽̂₀denotes the target estimated variance-covariance matrix of̂𝒕₀and𝒂^⊗2 =𝒂𝒂^⊤. To obtain𝑽̂₀in the presence of multivariate missing data is a difficult issue.

See, e.g., Skinner & Rao (2002) and Chauvet & Haziza (2012) for a fully efficient approach in the bivariate case, and Im et al.

(2018) and Sang & Kim (2018) for two fractional imputation methods in the multivariate setting. Below we propose a two-phase calibration procedure, where at the first phase the problem is solved for transformed vectors𝒗^∗_𝑘’s (𝑘 ∈ 𝐴), and at the second phase the results are back-transformed to𝒖^∗_𝑘’s as required. Assume without loss of generality the weighted values

̂

𝒖_𝑘=𝑤_𝑘𝒚̂_𝑘≡(𝑢̂_𝑘1,…, ̂𝑢_𝑘𝑝)^⊤