"The worse the passage the more welcome the port."
Thomas Fuller
Writing this cand.scient. thesis has been an exhausting, rewarding and time-consuming pro- cess. During this period of hard work I have appreciated the support and inspiration of others. I hereby want to acknowledge my supervisor professor Nils Lid Hjort for his guidance through this nonparametric Bayesian journey; not only for coming up with the original idea for the thesis, but aslo for invaluable advice on the various stages of the project. I especially appreciate his ability to immediately enter any discussion on the strategical as well as techni- cal issues related to my work and his preference for suggestions rather than directions (which makes me comfortable with the term 'my work').
I am also grateful for the support of Solveig, my loving, caring wife. She has shown a genuine interest for what I'm doing, pushing me forward with her strong ambitions on my behalf. She has neglected my superficiality in domestic matters and has been left with the lion's share of the responsibility offollowing up our eight months old daughter Ingvild, whose enthusiasm about my return in the end of the day made me forget all unsolved problems.
Contents
List of Figures iv
List of Tables v
1 Introduction 1
1.1 Introduction and summary . . . 1
1.2 Kernel density estimation . . . 3
1.2.1 Estimation of the density function 3
1.2.2 Estimation of the density function derivative 4
1.2.3 Scaling of kernels . . . 5
1.3 Bayesian histogram and the Dirichlet process 5
1.3.1 Bayesian histogram . . . 5
1.3.2 Smoothed Dirichlet processes 6
1.4 Abbreviations and conventions 7
2 The local likelihood 9
2.1 Motivation and justification of the local likelihood 9
2.1.1 Hazard rate motivation . . . 10 2.1.2 Kernel weighted likelihood motivation . . . 11 2.2 The generalized local likelihood used as data model in Bayesian calculations 11 2.2.1 The k(O)
=
1 suggestion . . . 12 2.2.2 The variance corrected suggestion . . . 12 2.2.3 The 'general kernel -local constant' argument 13 2.2.4 A "true" local likelihood . . . 142.3 The modified local likelihood of this paper . 15
2.3.1 The final decision on local likelihood 15
2.3.2 The TEU kernels . . . 17
3 The local constant model: Basic construction 20
3.1 Local Bayesian calculations . . . 20 3.2 The joint prior and posterior distributions . . . 23 3.3 Further correspondence between the histogram and the LC estimator . 25 3.4 A closer look at the confidence parameter c and window width h 26 3.4.1 Confidence measured in data point equivalents 26
3.4.2 Confidence related to the density function 27
3.4.3 Pros & cons of the different views 27
3.5 Generalization to other kernels . . 28
11
3.5.1 Confidence measured in data point equivalents 3.5.2 Confidence related to the density function 3.6 The generalized construction . . . .
4 The local constant model: Empirical Bayes and hierarchical Bayes 4.1 Empirical Bayes: Automatic estimation of the confidence parameter
4.1.1 Moment estimation . . . .
4.1.2 Maximum likelihood estimation . . . . 4.2 Some plausible confidence functions . . . . 4.3 Modelling of confidence with various functional forms
4.3.1 Direct fitting to the weight function
4.3.2 Generalized method of moments . . . . 4.3.3 Joint likelihood . . . . 4.4 Hierarchical Bayes: Prior guess at parametric family
4.4.1 Distribution of data 4.4.2 Simplifications 4.5 A few examples . . . .
5 The 'prior guess times a local constant' model 5.1 Local calculations . . . . 5.2 Estimation of empirical confidence . . . . 5.3 Correspondence with the local constant model .
6 MISE comparison between the kernel-, LC- and PGLC estimators 6.1 The MISE measure . . . .
6.2 The simulation algorithm . . . . 6.3 Precision of the simulated MISE values . 6.4 Simulation results
6.5 Asymptotics . . . . 7 Extensions and improvements
7.1 The local log-linear model . . . . 7.2 A justification of wide windows in the LC estimator 7.3 A model involving local level, slope and curvature 7.4 Concluding comments and future work . . . . A Kernel estimation of density derivative
B A derivation of the local likelihood
C Further smoothing of the gamma process D Approximate marginal distributions of data E Optimization
Bibliography
28 29 29 32 32 33 34 36 38 39 40 42 44 44 45 46 52 52 55 55 58 58 59 60 61 67 69 69 71 72 73 75 76 77 79 81 82
List of Figures
2.1 Three kernels, triangular and narrow and wide uniform. 13 2.2 Some TEU kernels, degree = 1,2,4,6,8,12,20,50,oo. . 18 2.3 Kernel density estimates with different TEU kernels. . . 19 3.1 Density estimate, uniform kernel, N(0,1) guess, specified confidence 22 3.2 Priors and posteriors from a uniform kernel . . . 22 3.3 Realizations of gamma- and Dirichlet processes . . . 25 3.4 Density estimate, Epanechnikov kernel, specified guess and confidence 30 3.5 Posteriors from a Epanechnikov kernel . . . 30 4.1 Weight functions from pointwise empirical Bayes procedures . 35 4.2 Density estimates based on pointwise empirical Bayes weights, specified guess 36 4.3 Some confidence functions for parametric normal- and mixed confidence 38 4.4 Eight confidence functions, constant and parametric normal confidence. 43 4.5 Density estimate, generalized method of moments, specified guess . 43 4.6 Density estimate, family guess, empirical confidence . . . 46 4.7 10 estimates drawn from posterior distribution, parametric family. 47 4.8 Constant empirical confidence in 10 000 draws. . . 47 4.9 The six normal mixture test densities. . . 49 4.10 Prior guess normal family, true density Gaussian (#1). . . 49 4.11 Prior guess normal family, true density moderately kurtotic (#B). 50 4.12 Prior guess normal family, true density skewed unimodal ( #2). 50 4.13 Prior guess normal family, true density bimodal (#6). . . 50 4.14 Prior guess normal family, true density skewed bimodal(#8). . . 50 4.15 Prior guess normal family, true density separated bimodal ( #7). 51 5.1 Density estimate, LC and PGLC, normal family guess, empirical confidence. . 57 7.1 Bias with the log-linear local model . . . 70
IV
1.1 Abbreviations and conventions . 8
2.1 Properties of some TEU kernels . 18
4.1 Description of the six normal mixture test densities. 48 6.1 Confidence interval for MISE, case #1, n = 15 . . . 60 6.2 Hypothetical standard deviation estimates for differences between MISE esti-
mates, case #1, n = 15. . . 60 6.3 Correct standard deviation estimates for differences between MISE estimates,
case #1, n = 15. . . 61 6.4 MISE simulations for the Gaussian distribution (#1). . . 62 6.5 MISE simulations for the moderately kurtotic distribution (#B). 64 6.6 MISE simulations for the skewed unimodal distribution (# 2). 64 6.7 MISE simulations for the bimodal distribution (# 6).. . . 65 6.8 MISE simulations for the skewed bimodal distribution ( # 8). . 66 6.9 MISE simulations for the separated bimodal distribution ( # 7). 66
v
Introduction
The purpose of this chapter is to give a brief introduction to some aspects of non parametric density estimation. Section 1.1 gives a summary of the main ideas of this paper, in section 1.2 we summarize the properties of the familiar kernel density estimator, before we in section 1.3 take a look at the Bayesian histogram. Finally we give a list of general notation in this paper.
1.1 Introduction and summary
The origin of this cand.scient. thesis is an article written by Nils Lid Hjort, 'Bayesian ap- proaches to non- and semiparametric density estimation' (Hjort 1996a) which presents three different approaches to estimation of densities. We will be concerned with the last of Hjort's suggestions, utilizing a local likelihood construction.
The Bayesian approach treats model parameters as random variables with prior probability distributions. When the parameter space is infinite dimensional it can be hard to construct reasonable prior distributions, the choices are less obvious and we must expect a broad range of possible solutions. Where subjective reasoning is needed we make an effort to defend our choices as plausible.
Construction of the estimators will follow the scheme outlined in Hjort (1994b) which deals with Bayesian nonparametric regression. We repeat it in a version adapted to density estimation:
1. Specify a global start density fo ( x). This is our prior guess of the density f.
u. Specify a local parametric model
f (
t; 0) to be used for t close to x, i.e. for t in C ( x) =[x-
h/2,x +
h/2] for a suitable bandwidth h.m. For each given x place a prior distribution 1r on the parameters in the local model, in accordance with the prior guess and prior beliefs.
iv. Do the local posterior calculations, employing the local likelihood Ln(x; 0). Our Bayes estimate is the local posterior mean.
v. Obtain estimates of the hyperparameters in 1r by employing empirical Bayes tools.
v1. Introduce a hierarchical approach to account for uncertainty in the prior guess by using a background parameter ~ with a background prior. The prior guess is then fo ( x; ~).
1
2 Introduction
One main reason for using Bayesian methodology is that tools are provided for incorporat- ing prior information as previous research or expert opinions in the model. In (i) we assume that such information is available in the form of a completely specified prior guess
f
0 •As local models demanded by (ii) we will mainly be concerned with the 'local constant' (LC) and the 'prior guess times a local constant' (PGLC) models, but we will also discuss the log-linear, linear and quadratic local models. The local model plays the role of 'an adequate local description of data', and contains the parameters on which the Bayesian calculations are carried out. The LC estimator and the PGLC estimator is scrutinized in chapter 3 and 5, respectively, whereas chapter 7 touches on more complicated local models.
What is most important when specifying a prior distribution in (iii) is to have a broad enough support for 1r to be guaranteed to include all possible values of the parameters. A characteristic in the general Bayesian programme is that once a parameter region is ruled out by the prior distribution the posterior distribution puts no probability mass in this region no matter how strongly data point out this as the region of interest. In our one-parametric models we use gamma distributions to allow density values in [0, oo). The prior guess fo ( x) typically enters the arena as the expectation in the distribution 1r for each given x.
In (iv) the central ingredient in going from prior to posterior distribution is the local likelihood. We immediately admit that we have no traditional well-defined likelihood except in special cases, but we reach a reasonable construction that does the job and can be regarded as an approximation. This key issue is addressed in chapter 2 where we adapt the local likelihood of Hjort & Jones (1996) to our Bayesian programme. For a given x the joint distribution of the local 0 and data is
1r(
O)Ln ( x; 0),
(1.1)and the posterior distribution is then
7r(O)Ln(x; 0) 0 I
data"'J 7r(O)Ln(x; O)dO"
Our estimate is calculated as
](x)
=E{f(x; 0) I
data}where the expectation is taken over the posterior distribution of the parameters.
In an ideal situation (i)-(iv) is sufficient to come up with an estimate. However, the strong assumptions of a fully specified prior guess in (i) and a fully specified prior distribution in (iii) limit the scope of our methods considerably. That is why two additional steps are added to our programme; for the LC model this is outlined in detail in chapter 4.
The centre of the prior distribution is obtained from
J
0 , but still we lack a specification of spread which tells about the confidence in our prior guess. In (v) we will try to assess confidence automatically in an empirical Bayes routine. Empirical confidence is then related to the agreement between the prior guess and data.The last step (vi) allows guesses at parametric families
f
0(x;
~)which is the flexibility we need. The price to be paid is the introduction of a background prior Jro. For each global parameter~ an estimator](x;
~) occurs in the above manner. Finally, averaging over~ gives the estimator](x)
=E{](x; ~) I
data}=J ](x; ~)7ro(~ I data)d~.
Our final estimator will need as input a parametric family prior guess and a bandwidth h.
In addition some tailoring is required for the prior and posterior distributions of the global parameters in the parametric family. We will carry this out for the normal family.
Asymptotic results of the traditional form are harder to obtain for our density estimators, as our estimators involve both local and global parameters. Typically, asymptotic behaviour is not as interesting in Bayesian models as data then completely dominate the prior information.
However, this is not a general truth in nonparametric Bayesian analysis where the picture is more complicated and we are not even guaranteed consistency, see Diaconis & Freedman (1986a,b). In chapter 6 we do a finite n comparison of the kernel estimator, the parametric normal estimator and some of our Bayesian estimators in six special cases. We also show that the Bayesian estimators behave well asymptotically.
Other related work
A milestone in the terrain of Bayesian nonparametric density estimation is the paper by Thomas S. Ferguson, 'A Bayesian analysis of some nonparametric problems' (Ferguson 1973).
As mentioned above Hjort suggests two additional schemes, one based on a nonparametric Dirichlet prior built around a parametric model, the other based on orthogonal Hermite expansions. Escobar & West (1995) use mixtures of Dirichlet processes in their estima- tion scheme. Hartigan (1996) and Andreev & Arjas (1996) address the issue of Bayesian histograms. Petrone (1996) approximates the Dirichlet process with Bernstein polynomials to obtain smoother distributions than the Dirichlet which is discrete with probability 11 .
Verdinelli & Wasserman (1996) start with a parametric model and describe departures from this with Gaussian processes as a central ingredient, a scheme similar to Hjort's expansions.
Fosen (1996) investigates some estimators originating from a local likelihood in a frequentist setting, building on Hjort & Jones (1996) and Loader (1996).
1.2 Kernel density estimation
The histogram is the most frequently used nonparametric density estimator in the history of statistics. Nevertheless, the kernel density estimator with roots back to the 19. century has better statistical properties. The method is well understood and several good introductory books are available, e.g. Silverman (1986), Scott (1992) and Wand & Jones (1995).
1.2.1 Estimation of the density function
Let Xt, ... , Xn be i.i.d. with density function
f.
The kernel density estimate fn is a smoothed version of the empirical distribution (1/n)2::
8(xi) where 8(xi) is unit point mass at Xi. Instead of putting probability mass 1/n in each Xi, fn places this probability mass close to each Xi, resulting in an absolutely continuous distribution,(1.2) where I< is typically a symmetric unimodal density function centred at the origin. The uniform kernel I<(t) =I( -1/2::;
t::;
1/2) gives nhfn(x) =#(xi E [x- h/2, x+
h/2]). Note1This is shown in Ferguson (1973).
4 Introduction
that (1.2) is a mean of i.i.d. random variables (for each x); a Taylor expansion of bias and variance gives the AMISE (Asymptotic Mean Integrated Squared Error),
E fn(x) ~ f(x)
+
-h1 2af<J"(x), (1.3)2
Var fn(x) ~ R(K)f(x) f(x)2
(1.4)
- - -
nh n
AMISE(h) R(K)
+ ~h4af<
R(j")nh 4 (1.5)
where af< is the variance of a random variable with density function K and R(g)
= J
g( u)2du is used as general notation. The asymptotic expression (1.5) holds true under mild regu- larity conditions, see Scott (1985) for details. Epanechnikov (1969) shows that the AMISE- minimizing kernelK
is the Epanechnikov kernelK(t) =
(3/2)(1-4t2)+ (or any scaled version of this). The AMISE-minimizing h-value and corresponding AMISE areh* [ R(K) ] 1/5
af<R(j")n (1.6)
AMISE*
~[aK
4 R(K))4/5 R(j") 1/5 /n4/5 . (1.7) Our use of (1.6) will be as bandwidth selector when comparing the Bayesian estimators to the kernel estimator. If we had been in the position to use the AMISE optimal bandwidth we get the limiting distributionr-;-
(1
R(K) )vnh(fn(x)- f(x)) -+v N
2
J"(x) R(f"), R(K)f(x) (1.8)which is essentially the Linde berg extension of the central limit theorem, see Hjort (1980) for details. We keep the bias term because Evlnh(fn(x)- f(x)) is approximately af< J"(x )Vnf5/2 and nh5 is the constant R(K)/[af<R(f")]. It is not at all clear that choosing h oc n-1/ 5 is optimal in the Bayes world (though it is likely to be a good starting point); West (1991) suggests a rate of n-112 , based on a coherence argument in a formal Bayesian framework.
Hjort (1996a) gives a different argument for the same rate. This gives smaller bandwidths for large n. As the number of data points increases, we can afford to take a smaller portion of them into account. The rationale for doing so is that we seldom trust the local model to be a completely correct description.
1.2.2 Estimation of the density function derivative
The kernel method can also be used to estimate derivatives. The straightforward solution is to use f~r) as an estimate of j(r), i.e. we use the r-th derivative of the density estimate to estimate the r-th derivative of the density. With a standard normal kernel rp we then get
n-1 ~ h-3(xi- x)rp(h-1(xi- x)) as estimate off'. Using this expression for a general kernel we get
( ) 1
~
1 ( ) ,(Xi- X)
9n
X = -:;;, t:t
h3Xi - X [\_
- h - (1.9)as an estimator of the !'-proportional quantity
a}.; f'
(with a standard normal kernela}.;
= 1).A Taylor expansion gives the approximate moments of
9n,
Egn(x) ~ a}.;J'(x) + ~h 2 f"'(x) I w
4I<(w)dw,
(1.10) Vargn(x)~ f(x) I w
2I<(w)
2dwj(nh
3 ),see appendix A for a derivation of these results. These are similar to but different from the corresponding results for f~.
Later we will experience that
fn
and9n
appear as natural sufficient statistics in our calculations which is why we prefer9n
to f~.1.2.3 Scaling of kernels
The scale of the kernel is not important; by tuning the parameter h we get to any scaled version of the kernel
I<.
However, we will scale kernels to have support on [-1/2, 1/2] which can be done since we in this paper will restrict our attention to kernels on a bounded support.This is comfortable when we give the kernel an interpretation in a Bayesian context.
Consider the uniform kernels
I<1(t)
= I(ltl<
1/2) and K2(t) = (1/2)I(Itl<
1). ThenI<1
(t) = 2I<2(2t) and all estimates made with one kernel can be repeated with the other.However, in K1([t-
x]jh)
the window width ish whereas in K 2([t-x]jh)
the window width is 2h. In later chapters we will pay much attention to cells (or windows) and then it is convenient to be able to use the terms 'window width', 'cell width', 'band width' and 'h' interchangeably, without worrying about scale.Extensions to kernels on unbounded supports are possible. Then we need another defini- tion of "window". Above we implicitly use 'the smallest interval containing all (100%) of the probability mass of
I<'.
With infinite supports we can change to e.g. 'the smallest interval containing 95% of the probability mass ofI<',
or some other fraction.For the rest of the paper 'scaling' will have the interpretation we give it in this section, whereas 'normalization' will mean multiplying the kernel with a constant. Then 'normalized' kernels do not necessarily integrate to 1.
1.3 Bayesian histogram and the Dirichlet process
The histogram is the simplest nonparametric density estimator. One could argue how non- parametric a specific histogram is, but embedded in a sequence of histograms where the cellwidth declines to zero as the number of observations grows it is fair to regard it as non- parametric.
Scott (1992, page 47) says that there is no difference in the roles of a histogram as a data summary and as a density estimator. We claim that in the Bayesian world this no longer holds true. When making a Bayesian density estimate we should put available prior information into the model. Still we may want to use a frequentist histogram as a pure data summary.
Hartigan (1996) and Andreev & Arjas (1996) deal with Bayesian histograms.
1.3.1 Bayesian histogram
We have the data set
x1, ... ,
Xn, i.i.d. fromf.
Our task is to estimatef
with a Bayesian histogram. LetC(xD
U · · · U C(x~) be a partition of the sample space, whereC(xi)
=6 Introduction
[x~- hi/2, x~
+
hi/2] . For each of these cells employ a local constant modelf(t;
0) = (}i fort
E C(x~),where(}= (01, ... , Om)· It is reasonable and convenient to place a Dirichlet prior on the local constants because of the restrictions
where hi is the length of cell i2 . Our initial guess is
f
=f
0 • If we letr
fo(x)dx=
Po(xD =PiJc(xD
we can set up a prior distribution based on prior beliefs by letting h(}"' Dir(ap1, ... , apm) = Dir(ap)
where h(} = (h1(}1, ... , hmOm)· This gives correct expectation Ehi(}i = Pi, Var(hi(}i) = Pi(1- Pi)/(a
+
1) and Cov(hi(}i, hjOj) = -PiPj/(a+
1) for i =1- j. Then a is a measure of strength of belief in our initial guessf
0 • What is relevant to the histogram is the number of observations in each cell, with likelihoodY
I (} "'
multin ( n, hO),where Y = (YI, ... , Yn) and Yi is the number of observations in cell C(x~). The posterior distribution is a new Dirichlet,
L(O
I
data) oc~
Dir(ap+
y),assuming all his are equal. As estimator we use posterior expectation,
/(x) = E(Oi
I
data)= _a_Po(x~)/h+
_n_fn(xD, x E C(xDa+n a+n
which is the Bayes estimator under quadratic loss. Here fn is the uniform kernel density estimate (recall that Yi = nhfn(xD). Note that our prior beliefs correspond to a observations with ap0 expected successes. As density estimate the Bayesian histogram suffers from the fact that it is dependent of the partition. To avoid this binning of data it is convenient to introduce a stochastic process.
1.3.2 Smoothed Dirichlet processes
In the histogram setting we place a Dirichlet prior on a certain mesh of the distribution function F(t) = Pr{X ::; t},
(F[C(x~)], ... , F[C(x~)]) "'Dir(aFo[C(xD], ... , aFo[C(x~)])
2With infinite supports h1 and/or hm are infinite, fh and/or Om are zero, but h181 and hmBm still represent probability mass.
for each partition
C(xD
U · · · U C(x~). Using the Dirichlet process introduced by Ferguson (1973), we avoid discretising the sample space into cells. Instead we place a prior on the distribution function itself,F
rvDP(aFo).
The definition is that for each partition B1U· · ·UBr of the sample space, (F(B1), ... ,
F(Br))
is Dirichlet with parameters(aFo(Bl), ... ,aFo(Br)).
The process is easily updated given data,F I
data rvDP(aFo + nFn),
where
Fn
is the empirical distribution function. This is proved in Ferguson (1973). Then for eachx
and cellC(x),
the local() given data is a beta,B(ap
0(x)+nhfn(x),a[1-p
0(x)]+n[1- hfn(x)]),
giving Bayes estimator~ a n
f(x) = -po(x)jh + -fn(x),
a+n a+n
valid for all x simultaneously. We are relieved not to be restricted to operate with bins, but still the results are valid for uniform kernels and the local constant model only. This smoothing of the Dirichlet process gives an absolutely continuous distribution as estimate, but the density function itself inherits the discontinuities from
fn·
1.4 Abbreviations and conventions
Throughout this paper we use some symbols and abbreviations repeatedly. They will usually be explained the first time we run into them; in table 1.1 we give an overview of the most important ones.
8
ko R(g) RK
8
fn(x) fn,h K
gn(x) X =v Y i.i.d.
u
Dir DP G GP ML LC LC-ML PGLC PGLC-ML ISE
MISE AMISE
K(O)
for a kernelK
J
g(u)2duR(K)
nh/RK
VarY where Y rv K
~</>( x~11)
{ 1+ ( x~11)
4 }, 4> is standard normal~
2::7=
1kK(x;f:x),
density estimatefn produced from kernel K and bandwidth h
~
I:f=
1 ~3(Xi -
x)K ( x;hx),
ex: derivative estimate X and Y follow the same distributionindependently identically distributed Uniform distribution (or density, kernel) Dirichlet distribution
Dirichlet process Gamma distribution
Gamma process with independent increments Maximum Likelihood
Local Constant model
LC model with a ML prior guess
Prior Guess times a Local Constant model PGLC model with a ML prior guess Integrated Squared Error
Mean (expectation) of ISE Asymptotic MISE
Table 1.1: Abbreviations and conventions
Introduction
The local likelihood
There have recently been several efforts to derive and justify a local likelihood for density estimation. Hjort (1996a) bases his argument on a hazard rate parallel, Hjort & Jones (1996) use a locally weighted Kullback-Leibler distance measure whereas Loader (1996) derives a different global likelihood than the ordinary and localizes this.
In a Bayesian context the problem is more complex than in a frequentist application.
The likelihood brings data information into the model and is therefore required to carry information properly; in particular it is important to get a "correct" balance between prior- and data information. Our local likelihood will serve as data model and must fit into the Bayesian framework, whereas the (pragmatic) frequentist is content with a likelihood that produces good (frequentist) estimates via maximization.
We will use the likelihood in Hjort (1996a) with a slight modification. We argue that this modification is reasonable when using small bandwidths (trying to be nonparametric), whereas Hjort's construction is preferred when using larger bandwidths (semi- or fullpara- metric approach). In Hjort (1996b) there is a parallel result for hazard rates that avoids the asymptotical argument that is required in the derivation of the local likelihood for densities.
In section 2.1 we review the essential local likelihood ingredients of Hjort, Hjort & Jones and Loader, whereas section 2.2 and 2.3 establish the local likelihood of this paper.
2.1 Motivation and justification of the local likelihood
In the i.i.d. situation the full likelihood takes the form
n
IJJ(xi;{1),
i=l
which can be used if we have a global data model. To get a local likelihood the straightforward modification
IJ f(xi;
0)x;EC(x)
does not work. This modified likelihood is no useful data model and gives meaningless es- timates in a frequentist approach. To see this let
fo
be the constant (} over C (x);
then the modification gives ()Y(x) as local likelihood with a uniform kernel, where Y (x) is the number of observations inC(x).
A better suggestion is to use information about XiS to the right of9
10 The local likelihood
x-
h/2 and what happens to these duringC(x),
as in Hjort (1996a) (or alternatively the reverse argument).2.1.1 Hazard rate motivation
FortE
C(x)
the appropriate distribution is the conditional distribution givenX>
x- h/2.For X
>
x+
h/2 we just record the probability of this survival given X>
x - h/2. We ignore XiS less than x- h/2 since they give no information about our conditional distribution.Define the (parametric) survival function S(s; 8)
=
Pre(X>
s). Then the local likelihood components take the form{
f(t;il) S(x-?;il) '
S(x+?;il) h
S(x-?;il) ' t
> X+
2 .t E
C(x)
The local likelihood at x is
L
(x· 8)= II f(xi;8) II S(x+~;8)
O,n ' S( h. 8) S( h. 8)
x;EC(x) X - 2' x;>x+h/2 X - 2'
(2.1)
{ } ( 1 f(t; 8)
)= II f(xi; 8)
exp -n {log[S(t;8)]dFn(t) + Sn(t) S(t· 8) dt} .
x;EC(x) C(x) '
A complete derivation of (2.1) is found in appendix B. We are not yet satisfied due to the global parametric component
S(t; 8).
Thef(t; 8)
component is a local description to be used in C ( x) and is fully acceptable. We localize the parametric description by replacing the(global) parametric
S(t; 8)
with the non parametric estimatorSn(t) =
(1/n) ~!(xi> t),
{ II f(xi;
8)} exp(-n 1 f(t; 8)dt)
x;EC(x) C(x)
discarding a factor independent of the parameter 8. This should not alter things very much asSn is a good estimator with squared error of order O(n-1 ). By letting
K(z)
=I(z
E [-1/2, 1/2]) so thatK((xi- x)/h) =!(xi
EC(x))
(i.e.k
is the uniform kernel), we arrive atL.(x; 0)
= {}]f(x;; O)K(";i"'l}
exp (-n j K C ~ x) f(t; O)dt) ,
(2.2) which is the local likelihood to be used in the Bayesian calculations. Note that for large bandwidths hLn(O) "' { u f(x;; 0)}
e-•which is proportional to the ordinary likelihood.
We want to use (2.2) for other kernels than the uniform, only demanding that K is unimodal and symmetric, and that K is properly normalized, i.e. K = K / dK for a reasonable dK. For the rest of this paper we restrict attention to kernels supported on [ -1/2, 1/2] (see section 1.2.3). It is convenient in the interpretation of general kernels to operate with a bounded support; we return to this later.
2.1.2 Kernel weighted likelihood motivation
Let us digress for a second and see how things go in the frequentist world. We can rewrite the local loglikelihood as
logLn(x;8)
=t.R(xi~x)logf(xi;8)-n JRC~x)f(t;8)dt
n J K C ~ x)
(logf(t;8)dFn(t)- f(t; 8)dt).
It follows that
1
J- (t- X)
;-log
Ln(x; 8)
-+p K -h-{f(t)
logf(t; 8) - f(t; 8)}dt
=Hx(8).
when h
>
0 is fixed. Maximization ofHx(8)
corresponds to minimization ofdx(f,J(·;8))
=J RC~x)
{f(t)log~~;;~) -U(t)-f(t;8))}dt
J K C ~X) {f(t)
logf(t)-f(t)}dt- Hx(8)
which is a locally weighted version of the Kullback-Leibler distanceJ f(t)
logf(t; 8) dt f(t)
=J { f(t)
logf(t; 8) - (f(t)- f(t; 8)) dt. f(t)
} (2.3) To the frequentist this argument from Hjort & Jones (1996) gives some assurance of the right to view (2.2) as a local likelihood for a general kernel. They also provide additional justification arguments for the local likelihood, one based on score functions and the other based on a parallel to nonparametric regression. The Bayesian wants to use (2.2) as a model for local data and is not yet satisfied. Loader (1996) argues that the full likelihood actually takes the form{g f(x;; B)}
exp(-n j f(t; B)dt)
(2.4) where the last constant is usually discarded. We can view (2.2) as a localized version of (2.4);then a general kernel corresponds to localizing with a different kernel than the uniform.
We will return to the issue of general kernels later in this chapter and in section 3.4 and 3.5 where we complete the discussion in a Bayesian context. Kernel generalization has implica- tions for the cell width h and the balance between prior and data in the Bayesian calculations.
The conclusion to this preliminary discussion is that we have a well-founded local likelihood for uniform kernels whereas there is less support for the general kernel case.
2.2 The generalized local likelihood used as data model in Bayesian calculations
Hjort's hazard rate argument is the only one that can be expected to result in a true likeli- hood, but then only for the uniform kernel. With a general kernel there is no distributional
12 The local likelihood
interpretation as 'the conditional distribution given X
>
x - h/2'. Locally weighting the Kullback-Leibler distance and the global likelihood more or less incidentally give a true data model for the uniform kernel, but then in the pure forms (2.3) and (2.4) only. In a frequentist application the lack of legitimacy of (2.2) as a genuine likelihood is less worrying than in a Bayesian approach where we use the likelihood not only as a measure of information but as a data model. Nevertheless such a likelihood can be motivated and seen to be fruitful.We want to use the local likelihood for other kernels than the uniform, but as noted above this generalization is not straightforward. First note that in a frequentist approach the normalizing of the kernel makes no difference whereas in the Bayesian calculations this is a crucial issue. If we use the kernel
k' = ak
in the local likelihood (2.2) we getLn(x; 0)'
=
(Ln(x; O)tso that log L~
=
a log Ln. When maximizing the likelihood both L~ and Ln produce the same maximand, but in a Bayesian calculation we can increase the influence from data by increasing a and vice versa, see (1.1).2.2.1 The
K(O)
= 1 suggestionIt is intuitively appealing to normalize kernels to have k(O)
=
1, i.e. work with kernels of the form I(= I</ko where I< is unimodal and symmetric. (We use K(O)=
k0 as general notation.) We interpret this as giving weight one in central areas and smaller weights just outside. For large h we get a global likelihood proportional to the ordinary likelihood. This is what Hjort suggests (Hjort 1996a). However, this normalization has some unpleasant implications.In chapter 3 we deal with the local constant model f(t; 0)
=
0. Then the local likelihood takes the formLn (x; 0)
=
onhfn(x)fko exp( -nhO /ko)where fn is the kernel density estimate (1.2). This would mean that nhfn(x)/ko follows the Poisson distribution Po(nh0/k0 ), which is an approximate likelihood. It is a minor inconsis- tency in that the Poisson distribution is discrete on the non-negative integers while the true distribution of fn(x) is typically continuous or a mixture of a discrete and continuous distri- bution on a bounded support. When calculating the moments of fn(x) from the implicitly suggested distribution we get
E(Jn(x)
I
0)=
0 and Var(Jn(x)I
0)=
koO nh ·The local likelihood gives correct expectation to fn, but the variance, at least asymptotically, should have been RKO/nh, where RK = R(I<).
2.2.2 The variance corrected suggestion
It is a widely accepted fact in kernel density estimation that the performance of the estimator mainly depends on the choice of bandwidth and is less dependent on choice of kernel. In- spired by this we try to approximate the distribution of
J:f
for a general kernel I< with the distribution ofj}!
for the uniform kernel with a proper bandwidth. The result.;:;;h(Jn(x)- f(x)) -tn N(O, RKf(x))
Figure 2.1: Three kernels, triangular and narrow and wide uniform.
is (1.8) under the local constant model (f"(x) = 0) and gives N(O, RxO/nh) as approximate·
distribution for fn(x). By tailoring the bandwidths hu = hx/Rx we get
f
n,huu
""D ,...,1
n,hK'x
that is, the two density estimators approximately follows the same distribution. When recall- ing that for the uniform kernel we have the approximation nhu fuh "' n, u Po( nhuO) we get the new approximation nhu fnKh "' , } ( Po( nhuO) or
This approximate likelihood is proportional to
and and coincides with what we get from the local likelihood when using
K
= K / Rx. Still the expectation of fn given (}is 0, but the variance is RxO/nh which is consistent with the true asymptotic variance.2.2.3 The 'general kernel- local constant' argument
In chapter 3 we show that a reasonable estimator of
f
based on prior guess and data with a local constant model is}(x) = cfo(x)
+
nhfn(x)/d c+ nh/dwhere cis a measure of strength of belief in the prior guess j0 , and dis the normalizing factor of the kernel. We have presented two alternative ds, either d
=
k0 or d=
Rx. Note that}(x)
is a weighted mean of a prior guess and a kernel density estimate.Consider the three kernels K1(t) = I(ltl
<
1/2), K2(t) = 2I(Itl<
1/4) andT(t)
= 2(1-l2tl)+, i.e. K1 and K2 are uniform kernels and Tis a triangular kernel. Further assume we use these kernels in their given form (h = 1) to produce the in-estimate. The weight put on fn is proportional to the quantity hjd. Figure 2.1 displays the three kernels. To keep focus on the issue of normalization of kernels we assume the constant model is an adequate local description off.
14 The local likelihood
By choosing d
=
ko the h/ d-quantity takes the value 1 and 1/2 for the kernels K1 and Kz, respectively. This is reasonable since K2 on average includes one half of the data points included by K1. The corresponding value for the triangular kernel is 1/2 which gives exactly the same weight as with the narrow uniform kernel. Then estimates produced with the K2-and T kernels get the same weights whereas intuition tells us the weight put on T should be somewhere between that of K1 and that of K2 .
If we use d
=
RK the h/d-quantities for the uniform kernels remain the same whereas the corresponding value of the triangular kernel is 3/4, which is more in accordance with intuition. What happens is that K(O) only measures the value of the kernel in one point so that quite different kernels have the same normalizing factor. The RK quantity utilizes more information about K.The triangular kernel is not smooth around zero and is used to make a point. We could repeat the argument by replacing it with the kernel 1
+
cos(2rrt) on [ -1/2, 1/2] which has ko=
2 and RK=
3/2, giving h/ RK-value 2/3.2.2.4 A "true" local likelihood
We may go one step further and find an exact local likelihood under the local constant model by changing point of view. For a moment we forget the local likelihood and take the nonparametric density estimator fn to be the information carrying statistic. Then we are interested in the distribution of fn which will be used to update the prior information. For the local constant model with a uniform kernel this is consistent with Hjort's local likelihood in that this likelihood depends on data only through fn·
Let Y(x) =#(xi E C(x)) so that Y follows the binomial distribution Bin(n, M). Then for Y
>
0y
nhfniY
D
LK(Ui) "'jy(ry)i=l
where the Uis are i.i.d. "' U( -1/2, 1/2) and 'fJ = nhfn· When Y = 0, fn takes the value zero. Then the distribution of nhfn
IB
is a mixture of point mass Pr(Y=
0) at zero and the continuous partL
n ]y(TJ) Pr(Y=
y), 'fJ>
0.y=l
If we use a uniform kernel Pr(ry
=
yiY=
y)=
1 (as nhfn(x) then equals the number of observations in C(x)), the ]y(ry)s are replaced with I(ry = y)s and the discrete distribution of nhfn is Pr(nhfn=
ry)=
Pr(Y=
ry). By approximating the binomial probabilities with Poisson probabilities and summing to oo instead of n we get back the local likelihood (2.2) for the uniform kernel. With a triangular kernel K(Ui) is uniform on (0, 2) and ]y(TJ) is the density for a sum of y such variables. When K (U) has a normal distribution all ]y ( 'fJ) are normal and the continuous part of the distribution is a normal mixture. This is only achievable if we let K be unbounded, but we can use the normal mixture as an approximation for bounded kernels. This approximation is best when the probability of small Y s is small.We can easily obtain the moments of fn· We have EK(U)
J
K(u)du = 1Var K(U)
J
K(u)2du- 1 = RK- 1 (2.5)which give
E(nhfniY) Var(nhfniY) Then the unconditional moments are
E(nhfniO) Var(nhfniO)
y
Y(RK- 1).
nhO
nhO(RK- hO).
With a binomial distribution on Y we then get E(fniO)
=
0 and Var(fniO)=
RKO/nh-02 jn, which fit in with the two first terms of a Taylor expansion of variance for a general kernel, see (1.4). By passing to the limit (or using the Poisson approximation) the variance is RKO/nh, the correct asymptotic expression from (1.4).
Note that (2.5) shows that RK ~ 1 for all kernels supported on [ -1/2, 1/2] since we then can interpret RK - 1 as the variance of a variable K (U) where U is uniform on [ -1/2, 1/2].
This section points out an alternative route to Bayesian nonparametric density estima- tion. We forget the local likelihood construction and instead we rely on the kernel estimates of certain derivatives (included the Oth derivative) as local information carriers of the local properties of the density. By constructing a local model with parameters connected to these derivatives we obtain the Bayesian density estimate by placing a prior distribution on the parameters and updating this with the distribution of the derivative estimators given the parameters. However, using this approach we are likely to end up with numerical approxima- tions to solutions instead of closed form interpretable expressions. In section 7.3 we outline the main ingredients in a local model with three parameters.
2.3 The modified local likelihood of this paper
We restrict ourselves to working with local likelihoods as in (2.2). Then it all boils down to a question about the normalization of the kernel.
2.3.1 The final decision on local likelihood
Let
K =
Kjd. We present three candidates ford, namely d=
1, d=
RK and d=
ko. Note that with unimodal symmetric kernels we have RK ::; k0 ,RK
= J
K(u)2du::;J
K(O)K(u)du=
K(O),where equality holds for the uniform kernel. Then we have 1 ::; RK ::; k0 for the three candidates. In section 2.2.2 we showed that choosing d = RK preserved the moments of the kernel estimate satisfactorily. When using the local model f(t)O (prior guess times a local constant) the same choice of d gives correct asymptotic variance of fni this is shown at the beginning of chapter 5. Can we expect this to hold true even for more complicated local models? The answer is no. A more complicated model involves more statistics and we have only one parameter (the bandwidth) to fine-tune in our approximation. Consider the local model f(t;0,{3) = Oexp(f3[t- x]) (local log-linear model, also considered in more detail in section 7.1) which gives the local likelihood
Ln(x; 0, {3) = exp{log(O)nhfn(x)jd
+
f3nh3gn(x)jd- nh01/J(f3h)jd}16 The local likelihood
where ,P(f3h) =
J
K(z) exp(f3hz)dz and 9n is the estimator (1.9) that aims for a'kf'. This is clearly an exponential distribution family where we can obtain the moments of the sufficient statistics fn and 9n by differentiation of the function -nM'lj;(f3h)/d, see Bickel & Doksum (1977) for details on exponential families. The moments are found to beE fn(x) Var fn(x)
0'1/J (f3h)' Od 'l/J(f3h)
nh '
h~d :{3
'lj;(f3h)=
:d '1/J' (f3h).The function ,P(f3h) is also a function of the kernel. A Taylor expansion of the exponential function gives a more manageable expression,
'lj;(f3h) ,P' (f3h)
We can then approximate the moments from the local likelihood with Efn(x) ~ 0(1
+ ~{3 2 alh 2 ),
Var fn(x) Od
~ nh' Egn(x) ~ Of3al/d.
From (1.3), (1.4) and (1.10) we get the corresponding correct asymptotic moments in the traditional frequentist framework. Under the local log-linear model the first terms of the moments are
E fn(x)
~
0(1+ ~{3 2 alh 2 ),
Var fn(x) ~
Egn(x) ~
ORK --;;I;:' Of3al.
Again the choice d = RK gives pleasant results for the (asymptotic) expectation and variance of fn whereas to match the (asymptotic) expectation of 9n(x) the proper choice is d = 1. For the uniform kernel RK
=
k0=
1 and everything is all right. However, for a general kernel we cannot satisfy both of these requirements, but it is fair to say that we should pick a d in [1, RK] (where k0 is not included, except in the case of a uniform kernel).We illustrate the consequences of misspecification of a data model in a simple example:
Example 1 Consider a situation with i.i.d. random variables YI, .. ·Yn which are normally distributed with unknown expectation JL and known variance, say Vary
=
1. We place a conjugate prior on JL, centred at J.Lo with variance 1,JL""' N(J.Lo, 1).
Then we specify the data model for the sufficient statistic f), f)
I
JL""' N(JL+a,
1/(nb))which is correct for a= 0 and b = 1, giving posterior distribution
where r1-2
=
1+
(nb)-1 • The posterior expectation estimate of J.L is thenA J.Lo
+
nb(Y- a) J.L = f.L1 = .1
+
nbIf the data model has correct expectation with wrong variance b =/:- 1 the estimate is
A J.Lo
+
nby J.L = 1+
nbwhich is a weighted average between prior guess and data guess, but puts too much weight on data for b
>
1 and too much weight on prior guess for b<
1. For moderate misspecifications this is less dramatic than misspecification of expectation a =/:- 0 (and correct b=
1) which givesA A n
J.L = f.Lcorrect - a 1
+
n 'that is, we get a biased estimator which is consistent for the value J.L - a.
If the two-parametric local model was used to estimate the derivative off with the quantity 0(3 we should worry a lot with d
=
k0 or d=
RK, because then neither of them will do.Choosing d = 1 at least gives correct centring of 0 and (3. However, we concentrate on the density itself, estimated by 0. Then the question to be asked is whether we take out the full potential of the two-parametric model, which is far less dramatic. An important conclusion to this discussion is that even though the local log-linear model seems to invite estimation of the derivative this is a dangerous practice.
In this paper we focus on narrow windows in an attempt to be as nonparametric as possible. Based on results in this chapter we prefer to normalize with RK, i.e. K = K / RK.
When dealing with the one-parameter models presented we then get correct centring and spread. When using a more complicated model as the two-parametric above this still holds true for fn, but is violated for statistics as 9n· If we want to use semi- or full parametric models normalizing with k0 as in Hjort (1996a) may be preferred.
Realizing that we can't have the cake and eat it too, we turn to the TEU (Triangular, Epanechnikov, Uniform) sequence of kernels where importance of the choices mentioned above diminishes with the degree of the kernel.
2.3.2 The TEU kernels
The TEU kernel of degree p is defined by
Then the triangular kernel is TEU 1, the Epanechnikov kernel is TEU 2 and the uniform kernel is TEU00 • The quantities RK
=
(2p+ 2)/(2p+ 1) and k0= (p+ 1)/p
are of special relevance to the discussion above. Further it is of interest to measure the efficiency of these kernels.From (1.7) we see that the minimum AMISE is proportional to [oxRK]415 ; this quantity is
18 The local likelihood
-0.4 -0,2 0.0 0.2 0.4
Figure 2.2: Some TEU kernels, degree = 1,2,4,6,8,12,20,50,oo.
degree
K(O) Rx
(12 K(axRx)
47
5 eff.1 2.000 1.333 0.042 0.353 1.011
2 1.500 1.200 0.050 0.349 1.000
4 1.250 1.111 0.060 0.352 1.008
6 1.167 1.077 0.065 0.355 1.017
8 1.125 1.059 0.068 0.358 1.024
12 1.083 1.040 0.072 0.361 1.033
20 1.050 1.024 0.076 0.364 1.042
50 1.020 1.010 0.080 0.367 1.052
00 1.000 1.000 0.083 0.370 1.060
Table 2.1: Properties of some TEU kernels; eff. is the ratio of a kernels
(ax Rx )
415-value to the corresponding value of the optimal Epanechnikov kernel.then a useful (frequentist) measure of (asymptotic) efficiency. The kernel density variance is
a'f<
=(p
+ 1)/[12(p+ 3)].Figure 2.2 displays a few of these kernels. In table 2.1 we investigate the properties of some of the kernels. The main objection to the uniform kernel is not it's lack of AMISE- e:fficiency; for the frequently used normal kernel the eff. value is 1.041. Our focus here is on the appearance of the estimate, and with a lot of discontinuities the estimate based on a uniform kernel is less appealing than a continuous estimate.
Example 2 In figure 2.3 we have made some kernel density estimates based on different TEU kernels with AMISE-optimal bandwidths. The true density is a standardized1 gamma with shape 3 from which we have drawn 60 i.i.d. variables. The same data set is used for all estimates. We see that TEU kernels of high degree produce rugged estimates with a lot of modes whereas the low degree kernels give smoother estimates. Nevertheless one might argue that the visual impressions are approximately the same.
Another interesting feature about figure 2.3 is that there appears to be significantly more
1 Expectation
=
0, variance=
1p=1 p=2 p=4
p=6 p=8 p
=
12p= 20 p= 50 p= 00
Figure 2.3: Kernel density estimates with different TEU kernels of degree p, based on the same 60 i.i.d. standardized gammas with shape 3.
variance present in the estimates for large p. This is not the case, what gives the impression of variability stems from kernel construction. Statistically the increased roughness is interpreted as decreasing correlation between neighbouring xs when p increases. This points out the limitation of the eye when it comes to estimation of variability.
Chapter 3
The local constant model: Basic construction
In this chapter we deal with the simplest local model: the local constant. In section 3.1 we present Hjort's local constant construction (Hjort 1996a) with some more details. In section 3.2 we fit this local description into a joint distribution for all x simultaneously before we in section 3.3 investigate similarities between the Bayesian histogram and the local constant model. Then we introduce general kernels; in the first three sections we use the uniform kernel.
Sections 3.4 and 3.5 discuss central elements as confidence parameter, bandwidth and general kernels. These elements are not specific to the local constant model, but at this time we are ready for a thorough discussion of the related themes. In section 3.6 we present the general kernel construction.
3.1 Local Bayesian calculations
Assume we have the i.i.d. random variables x1 , ... , Xn from the density
f
and we want to make a Bayesian estimate off at the point x, using the local (non-smoothed) likelihood (2.2).This likelihood needs a local model
f ( t;
0), which in this chapter is a constant,f(t; 0) = 0 for t E C(x) = [x- h/2, x
+
h/2]. (3.1) Although meaningless as a global description, this makes sense locally.The Bayesian model needs prior information as input. In our case we have to specify a prior guess of the estimand at
x,
say fo (x).
Then we place a gamma prior 1r on the local level 0 in accordance with the prior guess,which gives
0"' G(cfo(x), c),
fo(x) EO= fo(x) and VarO = - - ,
c
(3.2)
where c measures our confidence in
f
0 . The gamma prior is convenient in that it gives variables in [0, oo). However, Pr(M>
1)>
0 which means that the probability mass in C(x) exceeds 1 with positive probability. This will be commented on below.20
With the local constant model (3.1) the local likelihood takes the form
Ln ( x;
B)
= {g
9K( •;,;-•)} exp{ -nf K ( t ~
x)Bdt}
= onhf,(x) exp( -nhB) (3.3)where K is the uniform kernel on [-1/2, 1/2], K(t) = I(ltl
<
1/2). This shows that nhfn(x) is su:fficient1 for 0. For the uniform kernel this statistic equals# (Xi
E C ( x)). Since the probability function of nhfn(x) (given 0) is proportional (in 0) to Ln(x; 0), (nhfniO) must follow a Poisson distribution with momentsE(nhfn(x) IO) = Var(nhfn(x) IO) = nhO.
The fact that
# (Xi
E C ( x)) is an integer between 0 and n, inclusive, is mildly inconsistent with the Poisson distribution. This inconsistency is due to an approximation in the derivation of the local likelihood, cf. section 2.1.1. We will return to this later.Local data and local 0 have joint distribution
ccfo(x)
onhfn(x) exp( -nhO) ocfo(x)- 1 exp( -cO) r(cfo(x))
Ln(x; 0)7r(O)
ex ocfo(x)+nhfn(x)-1 exp(- ( c
+
nh )0).The distribution for 0 given local data is then
L(Oilocal data) ex Ln(x; 0)7r(O) ex G(cfo(x)
+
nhfn(x), c+
nh). (3.4)This nice closed expression comes as no surprise as the gamma family is conjugate for the Poisson distributions. As in the general case, the posterior distribution depends on data only through the sufficient statistic2 • We use the term 'local data' in (3.4) to emphasize the local structure of our estimators. For convenience we just write 'data' from now. This comment applies to corresponding situations later, as we restrict our attention to kernels on bounded supports.
We choose to use posterior expectation as estimate of
f ( x).
In a decision-theoretic frame- work this estimator is motivated as the minimizer of expected posterior quadratic loss. The estimator takes the formwhere
](x) = E(OI data)= cfo(x)
+ n~fn(x)
= pfo(x)+
(1- p)fn(x)c+n
c p
p= or c=nh--.
nh
+
c 1-pThe posterior variance is Var(OI data)= ](x)j(c+ nh).
(3.5)
By letting c --+ 0 we obtain an noninformative reference prior which in the limit is pro- portional to the improper density 1/0 (logO uniform on (-oo, oo)). Use of such a prior gives
j
= fn, the classical frequentist estimator. The variance is fn(x)jnh which is an estimate of the variance of fn based on the leading term in the Taylor expansion, as R(K) = 1 for the uniform kernel.1Strictly speaking, nhfn(x) is not sufficient for() as Ln is only an approximate likelihood. However, nhfn(x) plays the role of a sufficient statistic in this approximation. We will permit ourselves to use the term 'sufficient' in these situations.
2This is a direct consequence of the factorization theorem for sufficient statistics, see Bickel & Doksum (1977).