2
3
3
2
y X e
X e 0 I
;
n p ;
x Abstract
x
x
IngeHelland
1 Introduction.
Department of Mathematics, University of Oslo, Box 1053 Blind ern, N-0316 Oslo, Norway.
(ingeh@math.ui o.no)
Theregression mo del in itsusual form
= + (1)
where is and is N( ),is one ofthemostsuccessfull knownstatistical
mo dels from an applied p oint of view;yetits very form is defective in one resp ect,
since anymo del thatisconditioned up ona setof variables likethe -variables here
necessarily contains no information ab out the distribution of these variables them-
selves.Even in the common situationwhere these areobserved variables, notxed We lo ok at predicti on in regressi on mo dels under meansquare loss for the
random case with manyexplanatory variables. Mo del reduction is done by
conditioning up on only a small numb er of linear combination of the original
variables. For each dimension asimple theoretic al condition on the select ion
matrixis motivated from the mean square error. The corresp onding reduced
mo del will then essenti ally b e the p opulation mo del considere d earlier for the
chemometricians' partial least squares algorithm. Estimation of the select ion
matrix under this mo del is briey discussed,and analoguous results for the
case with multivariate resp onse andfor the classi cation case are formulated.
Finally,it isshownthat anassumptionofmultinormalitymayb eweakenedto
assuming ellipti cal ly symmetric distributi on,and that some of the results are
validwithoutanydistributi onal assumptionatall.
KEYWORDS AND PHRASES:classic ation; exp ected squared prediction
error;invariantspace;mo delreduction; partialleastsquaresregression; predic-
tion;random ; regression analysis.
x
2
x
R
x
y
p
n
y
x
take all information from the mo del conditioned up on all the 's as in (1).As an
example of a conict arising from this, it is very dicult to interpretethe squared
multiplecorrelation in anyreasonablewaywithout takingthedistribution ofthe
-variables into account (see Helland, 1987 and references there).Some arguments
forthe ordinary conditioning can b e given when thedirection of prediction isfrom
to ,but other formsofconditionin g are p ossible, and mayalso b euseful, aswill
b eseen b elow.
An even more well-known problem is implie d by the situation when is large
- say of the same order as or even larger. Then cannot (or can hardly) b e
estimated by least squaresb ecause of collinearity.As a consequence, the standard
regressionmetho dcannotb eused directlytopredictnew -variables fromanewset
of -variables.This is in fact one of the greatparadoxes of statistics: An increase
in information in terms of an increase in the numb er of explanatory variables may
typically in thissense makeprediction moredicult, noteasier.
There are lots of statistical metho ds whose object is to improve up on this sit-
uation: Subset selection, ridge regression, shrinkage metho ds, principal comp onent
regression,partial leastsquaresregression and soon,and alot hasb een writtenon
thepros and consof the various metho ds.Inthis pap er we will notconcentrateon
metho ds, but on mo dels.It isknown that theordinary regression metho ds usually
function well when the numb er of explanatory variables is not to o large compared
tothenumb er ofobservations.It seems also tob egenerally acceptedthat,roughly
sp eaking,a largedatasetrequiresa morecomplicatedmo del thanasmall dataset.
Takingtheconsequences ofthiswayofthinking,a naturalquestion is: Withagiven
datasize, howcan aregression mo delb ereduced in anoptimal ornearoptimal way
from the p oint of view of prediction? In general there are two ways to achieve a
mo del reduction: through a change of conditionin g and throughparameter restric-
tion,andinthesimplestcasetheseareequivalent, aswillb eshownb elow.Themost
imp ortant task,though, is tond theb est reduced mo del, or at least somenearly
b estmo del,and this isnota trivial taskin general.
Traditionally, statisticians are accustomedto keeping the samesingle mo del all
the wayfrom the initial mo del buildi ng to the nal data analysis, but informally,
mo delreductionhasb eenusedinallbranchesofappliedstatistics,b othinestimation
and in prediction problems.It iseasytond examples whereit maypaytoreduce
the numb er of parameters in mo dels when the data set is limited; a systematic
likeli ho o d-based theory for this has recently b een given by Hjort (1998).Here we
willpresentsomemainideasforageneralapproachaimedatpredictioninregression
mo dels, rst forthe case withmultinormal observations.Inspiration forthe theory
comesfrommetho ds develop ed in chemometry,butweemphasize againthatwewill
primarilydiscuss mo dels,notmetho ds,and thattheargumentsusedheretoreduce
!
0 0
0 0
0
x
6
x xx xy
xy y y
x
x
;y
:
n
y linear combinations
2 Reduction of regression models by choice of condi-
tioning.
Whenreadingthepap er,itmayb euseful tohavetheanaloguetovariableselec-
tionin mind.Asiswellknown,thistermdenotesthemetho dswhereonestartswith
theclassof all p ossible regression mo dels withsubsets of theoriginal -variables as
explanatory variables (or some large sub class of this class), and then use data to
cho ose b etween themo dels.In thepresent pap er welo ok at theclass of regression
mo dels witha setof of theoriginal 's asexplanatoryvariables
andlimittheclasstothosewhotheoretically seemtogivetheb estpredictions.This
choice will dep end up on unknown parameters, and the estimationof these param-
eters corresp ond to the use of data to select mo del in the variable selection case.
Also,there isa nal choice of thesize of the mo del tob e made.We will give some
hints b elow on p ossibiliti es fordeveloping simple criteria forthis, but from whatis
known up tonow, crossvalidation seems tob e the b estavailable to ol.This is also
themetho dusually employed in chemometricalmo dels.
Sincetheinitialclassofmo dels inourapproachisconsiderablylarger thanwhat
we have in the variable selection case, one should exp ect tond b etter predictions
with this approach.For thecase of chemometric metho ds, this exp ectation is also
conrmed by simulation studies (see, e.g.,Frank and Friedman, 1993).Much work
remains tob edone in evaluating sp ecic predictions, however.
Inthenextthreesectionswediscussmo delreductioninmultipleregressionmo d-
elsassuming multinormality, andthen thereduced mo del is presentedin Section 5.
In Section 6 we lo ok atparameter estimationin thereduced mo del.Section 7con-
sidersthe corresp onding situationwhen there areseveral resp onse variables, and in
Section 8 we lo ok at classication problems.In Section 9 we generalize the basic
results to other distributions than the multinormal distribution, and discuss some
consequences ofthegeneral results obtainedhere.
In this Section we will make the ideal assumption that ( ) has a multinormal
distribution withzero exp ectation and joint covariancematrix
(2)
(In fact,themost centralresults b elowmayb egeneralized toobservationsthat are
notmultinormal; seeSection 9.) We assumethat oursample consistsof indep en-
dent observationsfromthis p opulation,andwewanttopredict from ,sampled
from thesame p opulation, i.e.,having the samejoint distribution.This mo del will
0 0 0
0 0
0 0
0
0
1 2 1
1 1
+1
1
2
1
2 0
111
111
111
5
2
2 0
xx
xy y y
xy xx xy
p p
k p
k
k
6 6
x
x
Lemma 1.
Z x x
y Z ~e ~e 0 I
z z
R
z Rx
z
U
RU 0 R U
x
p
k k
;:::; x ;:::;x
; ;
;
y e x ; ;x e ;
k
p k k
k
p p k
p
Theregressionmodel obtained fromthe basicmultinormalmodel by conditioning
upon all variables and then putting has the same form as the
model obtained by just conditioningupon In the basic model. This
form is with .
dertosimplifynotation;inpractice thisessentially meansthatwewilldoregression
on centered variables.A mo del includi ng exp ectations could have b een usedat the
exp ense of a morecumb ersome notation.Ifwe condition thebasic mo del up on all
the -variables in all the samples, we get a regression mo del of the form (1) with
= and = , but the basic mo del asit stands contains
moreinformation.
Asin theintro duction weassume thedimension of tob efairly large,sothat
theregressionestimatorfrom (1)willb e nonexistent orunstable.Asimple solution
is then to pick out variables, say the rst , and do regression up on them.Let
=( ) and =( ).
= = = 0
=( )
= + N( ~ )
Pro of.
Simplecalculationshowsthatineachcasewegetamo delforeachunitoftheform
= +~,where =( ) and ~ N(0 ~ ).The relationship b etween the
parametershereand theparametersin the original mo del isin general dierent for
thetwocases,butthisdo esn'tmatterifthenewequationistob eusedtodeveloppre-
dictors,saybyleastsquares.Aninteresting p ossibili ty,which isrelated towhatwe
dolater,istoadjusttheparametersoftherestrictedmo delsothattheytwiththez-
conditioned mo del.
Obviouslythesameresultholdsifsomeothersetofregressionvariablesthanthe
rst iskeptin themo del.Thereexistmanymetho dsaimingatpickingtheoptimal
set of variables, i.e., the b est subset regression mo del to use, but here we want to
lo okataconsiderably largersetofmo delsforseekingonethatisgo o dforprediction
purp oses: Let b e a matrix of full rank , and consider the new variables
= ,a general set of linear combinations of the original regression variables.
Note that subsetselection is a sp ecial case of this, and that regression up on a of
thisformalsocanb erelatedtoseveralwell-knownmetho dslikeprincipalcomp onent
regression.As in Lemma 1, concentrating up on such a smaller-dimensi onal set of
variables can either b e interpreted in terms of a mo del reduction or in terms of a
sp ecial choice of conditioning in the mo del.Let b e any ( ) matrix such
that = andsuch that( ) hasfull rank .
5
j
j
0
j j
j
0
2
0
2
1
1
1 2
2 1
2 2
1 2
y y
xy xx xy
y y
xy
xx xy
0
0
0 0 0 0 0
0 0
0 0 0 0 0
0 0 0 0
0 0
0 0 0 0
;
:
k<n
;
k
X U 0
z Rx
y Z ~e ~e 0 I Z XR
R
^
R RXXR RXy
RXXR
^
^
X R RXXR RXX
^
X R RXXR R
6
X
R
^
X
^
XR R
^
X
R R6 R R
R
Condition 1: R
R R
For the multinormal case,the regressionmodel obtained by
(a) conditioning upon and formally assuming in the basicmodel,
andthe modelobtained from the basicmodel by
(b) conditioningupon only for each unit,
have the same form with and .
=
=
= + N( ~ ) =
Pro of.
Same asforLemma1.
Assume now that is xed, and that regression is done under the restricted
mo delformulated in Lemma2, i.e.,
= ( ) (3)
Under the assumption that has full rank (which it will almost surely if
) theexp ectation and covariancematrix of are
E( )= ( ) (4)
V( )= ( ) (5)
where = .
This evaluation is done underthe basic mo del conditioned up onthe full matrix
, which is a common pro cedure in statistics.Both under the restricted mo del in
Lemma2(a)and undertheconditioned mo del in Lemma 2(b)wewill have =
for some , and hence E( ) = , resp ectively E( ) = = . Under
the restricted mo del there is no change in V( ); under the conditioned mo del
of Lemma 2(b) we get the same formula, but with replaced by ~ =
( ) ,which in generalis largerthanorequal to .
Nowtothequestion ofhowthematrix canb echosenintheb estp ossible way
whenthepurp oseistogetgo o dpredictions.As canb eexp ected, theoptimalchoice
will dep end up on the parameters of the mo del, but we will not b e to o concerned
ab outthis problemnow.In thenextSections wewilllo ok atotherconditions while
havingin mindthefactthatparametersneed tob eestimated,andafterthatwewill
turn tothe estimationproblem itself.For nowwe will just formulate the following
simple condition,in practice tob elo oked up onasan unachievable ideal goal:
span( ).
Here span( ) means the dimensional space spanned by the columns of .
Since the mean square prediction error is uniquely determined by the exp ectation
j
0
2
5
0
2 2
1
1
1 1 1
1
1 1
2
2 1
0 0
2
0 0
1
0 0 0
0
0 0 0 0 0 0
0
0 0 0
0
0 0 0
0
0 0 0 0
3 The mean square prediction error.
x
;
PRE y
;y
n
n
i
i
xx
xx xx xy
xx xy
xx
xx xx xx xx
xx
xy
xy
xy xx
xx R
Theorem 1.
^
X
6
6
~
R R6 R R
6
R
R R6 R R 6 6 U U6 U U6
U 6 R
6 U U
R
^
x
x
X
RXXR R6 R
(a) IfCondition 1 holds,then
(b)Assumingthat isinvertible,wehavethat ifandonlyifCondition
1 holds.
(c) Assuming that is invertible, is equal to
if and onlyif Condition 1 holds.
and variance of , it follows from the formulae (4)-(5) that conditions of interest
must dep end only up on this space, noton the wholematrix.It mayb e instructive
tonotice whatCondition1 meanswhen is a simple variable selection matrix: It
justmeansthatall `correct'variables haveb een selected: All variables thathave
b eenleft out,have =0.
The following results arenotunexp ected, but fundamental:
E( )= .
~ =
= ( ) =
Pro of.
(a)We havealready noted that = implie s unbiasedness.
(b)We usethewell-known generalidentity
( ) = ( ) (6)
(which canb eproved bymultiplying b othsides by and by andnotingthat
these two matrices combine to a matrix of full rank.) Multiplyin g this identity by
from theleft and by fromtheright,weseethat theformulaefor and for
~ give the same value if and only if = = 0, which is equivalent to
span( ),i.e.,Condition 1.
(c)Similar, using(6).
In the evaluation of predictors we will take as a p oint of departure the exp ected
squared prediction error = E( ) . An imp ortant question is under
whatconditioningthisexp ectationshouldb eevaluated.Ouranswerisrelatedtothe
main objecthere, namely todevelop a general theoryof prediction which functions
well forthewholetargetp opulation: At leastfor( ) theexp ectation should b e
with resp ect to the unconditional basic mo del.Notquite so strong argument can
b egiven fortakingexp ectation overthedistribution ofthematrix ofexplanatory
variables in the calibration set, one argument is that this gives an explicit formula
which is easy to discuss. Alternatively, we can assume that the sample size is
solargethat can b e approximatedby in the lastpart of the
2 0
0
0 0
0
0 0
0
j 0
j
0
5 1
2
2 2 1
2
1 2 1
2
1
1
0 0
2
1 1 2
2 1
2 1
1
2 0 0
0 0 0 0
0
0 0 0 0 0 0
0 0 0
0 0 0 0 0
0
0 0
0 0
0 0 0 0 0 0 0 0 0
0 0 0 0
0
0 0 xx
xx xx xx xx
y y
xy
xx xy y y
xy xx
xy xy xx
xx xy
y y xx xx
y y
xy
xx xy xx
xx
n
n k
n k
p
p k k k <n
PRE
n
n k
;
PRE y
PRE :
PRE
:
PRE
PRE
k
XX 6
Theorem 2.
R
6 6 R R6 R R6
z Rx
R R6 R R 6 6
R6 R R
^ RXXR RXy
^
z ^
^
^
R6 R
^
R6 R ^
XR
XR
R R6 R R R6 R RXXR
R6 R RXXR
XR
Corollary 1.
Let be a xed matrix of full rank and assume , and let the
estimated regressionvector begiven by equation (3). Then
where .
The expected squared predictionerror is if and only if Condition
1 holds; in all other cases islarger than this.
muchb etterthantheconvergenceof against ifthedimension ismuch
smallerthan .)
1
=~
1
1
(7)
~ = + ( ( ) )
Pro of.
Conditionup on = .In thisconditionedmo deltheresidualvarianceis~ =
( ) .Since = ,and since = ,
theformulain theTheoremfor ~ follows.
In the same conditioned mo del the regression vector is = ( )
with least squares estimator = ( ) . Expanding the square in
=E( ) andconditioning up on ,we nd
= 2 ( ) + ( )
Takingtheconditional exp ectation of this,given ,gives
= 2 ( ) +tr[( )( +( ) ~ )]
=~ (1+tr[( )( ) ])
(8)
Taking the exp ectation over and using a wellknown result from multivariate
analysis(Anderson 1984,lemma 7.7.1)thengives(7).
Thustoobtain go o dpredictionsone shouldrst trytoachieve asituationwhere
we can b e as condent as p ossible that Condition 1 is satised at least in some
approximate sense,and atthesametime we should trytokeep the dimension as
smallas p ossible if this can b edone without a substantial increase in ~ .
0
0 0
0
0 0
0 0 0
0
0
62
2
1 0 0
0
0
62
0 +1
2 2
1
1
2
2 2
0
2 1
1
2 2
+1
1
2
+1
2 2
+1
2
R
R
R
R R
R R d d
Theorem 3.
Q 6 6 R R 6 R R 6
R R R d d
R
Q d
d Q d k
k k k k
k
n
n k
k
n
k
k
k
k
k
k
y y p
n
n p
k k
k xx xx k k xx k k xx
k k k k k
k
k k
k k
k k k
k
k p k k
PRE PRE k
PRE
PRE k
PRE k
PRE PRE p<n p n
k
:
: undermost circumstances
(a) Let
The decrease in when going from to (assuming
) isalways nonnegativeand is givenby
Theregressionvector isanunknownquantity,soinwhateverway isdetermined,
from dataorin other ways,it isimp ossible toguaranteethatthe ideal Condition1
holds.This problem is extra accute since it is imp ossible to estimate accurately
when the numb er of variables is large. It is therefore imp ortant to lo ok at the
b ehaviour of the mean square prediction error also when span( ). In this
Section we will nd a new reasonable condition to imp ose up on , a condition
which applies to situations where the mo del dimension is increased stepwise and
whichis notsosensitive tothevalueof .
Thuswestartwithasimplemo delofdimension =1,andincreasethedimension
stepwise.At each step wethen have = asa matrixof rank ,and at
thenext step =( ) for somevector .Then in general ~ = ~ in (7)
will decrease or stayconstant, while the factor 1+ will increase.The
typical net eectwill b e that = will b e a convex function of witha
certain minimum.The aim in the end is to get a low minimum of , and to
achievethis,itisimp ortanttohavetheinitialdecreasein ~ ateachstepaslargeas
p ossible.Forthediscussionwhichfollows,itisuseful tohaveinmindahyp othetical
plotof asafunctionof withtheidealcurvecorresp ondingto~ = atthe
b ottomof theplot.(See Fig.1).In theplotof against there aretwoxed
p oints: = and = if 1.(If 1therighthand
endof thecurvewill essentially tendtoinnity. ) Togeta minimum whichis aslow
as p ossible b etween these two p oints, it is essential that the decrease ~ ~ is
largeforsmall .
Itturnsoutthattheoptimalconditionofthiskindagainissensitive tothevalue
of .However, by b eing satised with a decrease which is as large as p ossible or
nearlyso ,wegeta conditionwhich,whilestill dep endi ng
up on unknown parameters, involve parameters which are more easy to estimate
accurately.
The following result gives the formula for the decrease in the variance and the
mathematically optimal value forthis.
= ( )
~ = ( )
span( )
~ ~ =
( )
(9)
1
0
0
1
!
0
5
0
0
0
0
0
0 0 0 0
0
0
0
0 2
+1
2
2 1 2
+1
2
2 2
+1 2
2
0
2
1
2
1
+1
2
+1
2
+1
+1
;
:
k
:
k
k
; Q
d R
R R 6 0
d
6
6 R R6 R R6 R R
R R d
Q R 0
d R
d Q
R
d
6
6
R
R R
R
R
d d d
d
Q 6 R
Condition P: R 6 R
R k
k k
k k k k k xx k
k
k
y y
xx
k
y y
xx
xx xx k
k
k k
k k k
k k
k k k k k
k k k
k
k k
y y
xx
k
y y
y y
xx
k
k
k
k
k
k k
k
k xy xx k
k xy xx k
k
andthisisachievedifandonlyif plusarbitrary valuesin . It
maybeconvenienttoreplace hereby ,givenby with .
(c) With this choice of , will always achieve its smallest possible value
, so the expected meansquareerror atstep isthenas small
as possible.
=const span( )
= + =
~
= +1
Pro of.
We have ~ = ( ) with = , and ~ is
given by the same formula with = ( ). By a straightforward calculation
(see App endi x 1) we nd (9).Since = , we can replace by in (9),and
we can without loss of generality let b e in the space orthogonal to span( ).
ThenbySchwarz'inequalityweseethatthetermsubtractedin (9)ismaximized for
=const ,and thevalue ofthedierence isthen .
For the last assertion, insert = + into the formula for ~ in the
b eginnin g of this pro of, use (9) with = , and the same formulafor in
= toprove ~ = .
This Theorem holds also for = 0 with ~ = , and then it shows that the
smallestp ossible ~ isjustthesmallestglobally,namely = ,thatis,
wegetfullreductionin onestep,andthisisobtainedbytakingspan( )=span ( ),
i.e.,aversionofCondition 1again.
The interesting case, however, is when Condition 1 do es not hold exactly, and
we at a later step have some more or less arbitrary selection matrix = .
Whatconditions should onethenimp ose on in thenextstepin order thatthe
reduction in ~ should b eas large asp ossible? Onemayexp ect that it isnot very
crucialtohavespan( ) determinedin an exactlyoptimal wayaslongas(i) one
go esin adirection which leadtoa denite reductionin ~ ,(ii)thedeterminationof
thedirectiondep endsonparameterswhichareeasytoestimaterelativelyaccurately.
On this background we will normalize by = 1 and concentrate on the
numerator of (9). This numerator then gets its largest value when is chosen
along ,a vectorwhich hastheform ,where b elongs tospan ( ).
Thisleads tointro ducing a new condition:
span( ) span( ).
If we disregard complications that may rise from a p ossible increase of the de-
nominator in (9), it turns out that a choice of span( ) which do es not satisfy
+1
+1
+1
2 2
2
1
1
1 6
0 0
2
2
5
/ 2
111
0
0 0
3 3 3
3
0 3
0
3 3
0
0
0 0 0
3
3
0
;
PRE
;
L
L
L
k
p k
; ; ; :
xx k k k
k k k
k k k k
k k
xy xx k
k k
k
k
k
k
k
k k
k
k k
k k k
k
k
k
k xx xx k k xx k k xx xy
xx k xy xx k
k k
k
k
k
k
xy k
k xy xx xy
k
xx xy
xy Theorem 4.
6 R R d
R R d
d Q d d P Q P d
P 6 R
d R R d
R
Q d
d Q d
Q d
d Q d
Q
Q 6 6 R R 6 R R 6
6 R 6 R
Q d
R d d
d
R R
Condition P': R 6 6
Remarks.
Q PQP Q P
Assume that . Suppose that we have found and
hence in such a way that Condition 2 does not hold. Assume
furtherthat
where isthe projectionoperator uponthe spacespanned by and . Then
one can always nd another vector , and hence such that
(i)Condition P holds for .
(ii)We have
for all . The reduction in and therefore in is therefore larger than or
equal to what wasobtained by the unstarred spaceextension.
squaredprediction error.
span( ) span( )
= ( )
(10)
=( )
( ) ( )
(11)
~
Pro of.Fromthe formulafor wehave
= ( ) =
where span ( ).Let b e the space spanned by and ,so that
. Without loss of generality we can assume that is p erp endicu lar to
span( ), and then byassumption it follows that , the projection of onto
will b e nonzero. With this the numerators will b e the same on b oth sides of
(11), while the denominator on the left hand side will b e less than or equal to the
denominatoron therighthandside.
As already stated,Theorem4 is valid also for =0, when Condition Pimplie s
.Since the dimension of is always assumed tob e ,we therefore
geta unique solution ateachstep from this:
span( )=span( )
1.The technical condition (10) is really needed; it is not in general true that
is nonnegative denite when isp ositive denite and isa projection.
ThisisthereasonwhyConditionPleadstothesolution in therststepinstead
ofthetheoretically optimal choice .
k
2
1 0
0 0
k
k
0
0
0
0 Q d
dQ d
6
2
2
( )
( )
+1
1
1 d
d 6
6 R R
R R
Theorem 5.
R x
x
6 6 6
6 6 6
6
R 6
p
; ;:::;
; ;:::;
k
k k <
k k
5 The reduced model with relevant components.
xy
k
k xx
xx k k
k k
k
k
xx
xx
k
xx
k
xx
xy xy xx xy
k
xx xy
xx
xx
(a) In the sense formulated in Lemma 2, the -conditioned regression model
satisfying Condition P has the same form as the ordinary -conditioned regression
modelunder the following equivalentsetsof parameter restrictions:
(i) .
(ii) .
(iii)Thereexistsa setof eigenvectorsof suchthat belongstothe space
spanned by theseeigenvectors.
(b)Ifthecondition(a)(i),(a)(ii)or(a)(iii)holdfordimension ,butnotfor
, thereis a unique invariant space under of dimension containing contrastto ,which isdicult toestimate forlarge .
3.Ifcovariances arereplaced byestimated covariancesin theab ove formulation
of Condition P',we getthe partial least squares regression algorithm,as shown by
Helland (1988).An early discussionof PLSR can b efound in Woldet al.(1984);it
isnowusedroutinely throughout thechemometricalliterature.
4.Thechoicemadeherein ConditionPisnotunique; anotheralternativetothe
hard-toestimate = , which maximizes (9),is to maximize for some
xed .Thesampleversionofthisleadstocontinuumregression(StoneandBro oks,
1990),which Sundb erg (1993)has shown is closely related toridge regression.Yet
another alternative is to let the b e eigenvectors of . We will show later
that this leads to a p opulation mo del which is equivalent tothe one resulting from
ConditionP.
5.When the condition span( ) span( ) in Theorem 4 do es not hold,
we seefrom Condition P'that span( ) span( ).This case will de discussed
in thenext section.
Via Condition P we have now arrived at a sequence of mo dels which are `nearly
optimal' in terms of exp ected squared prediction error at each step, and which is
formulatedin terms oflinearcombinations oftheoriginal variables withco ecients
that areeasily estimable fromdata.This mo del is conditional with resp ect to just
this rather p eculiar set of linear combinations, however, and it this sense b ears no
relation to the original regression mo del, which was conditional up on all the x-
variables.To nda connection,wemustreplacetheconditioning byarestrictionof
theparameters,which isdone byusing Lemma2.
span( )
span( )
span ( )
R 0
2
5
2
0 0
0 0
0
0 1
1
1
2 2
+1
2
k xy xx xy
k
xx xy
xx
xy k xx
k
xx
k
xx k k
p
k xx k k
k k
k
k
k
k
p k
; ;:::;
k
k
k
:
p
k<p
k
k
;:::;
k
z Rx
R
U
R 6 6 U
6 R 6
R
6
Condition 1: R
Condition 2: 6 R R
R 6 R v v
v
v .
(c) If the conditions in (a) are satised, we canlet bethe linear space
mentionedin(a)(i),equivalentlythe linearspacementionedin(a)(ii)orequivalently
the spacespannedby the eigenvectors in (a)(iii).
should be a -
dimensionalinvariant spaceunder containing
=
span( )
Pro of.
Let b e a matrix of full column rank whose columns are orthogonal
to span( ) = span( ). Then = 0 is equivalent to
= span( ).Multiplyin g theresulting linear relation by , we see
thatthis is equivalent tothestatement thatcondition (a)(ii) holds eitherfor this
orsome lowervalues of .The equivalenc e b etween (a)(i)and (a)(ii)is easytosee
bya similarreasoning,and theequivalen ce to(a)(iii)wasprovedin Helland (1990).
Therest of theTheoremfollows from Helland (1990)and Nsand Helland (1993),
wherealso other equivalent setsof conditions aregiven.
One way to state these conditions together is that span( )
:
span( )
span( )=span( )
It is imp ortant for understanding that nowwill b e a parametervectorin the
reduced mo del, in general not equal to the true regression vector of the original
mo del.
Asp ointedoutbyvon Rosen(1994),there isa substancialmathematical theory
ofinvariantspaces,andtheconcepthasalsob eenusedtosomeextentinlinearmo del
theory(see,e.g.,Kruskal,1968).Itisobviousthattherealwaysisaninvariantspace
containing of dimension , namely the whole space .Theorem 5(a) expresses
equivalent conditions implying that there exist invariant spaces (containing ) of
smallerdimensions.In all cases this imp oses restrictions up onthe parameter
space.The restrictionsimp osed arenaturallynested in .
Theorem 5 was arrived at by taking Condition P as a p oint of departure, a
condition that aims at making (9) large for small . Another p ossible p oint of
departure is to let thecoloumns of b e eigenvectors of : =( ).
Inserting this successively into (9), we nd ~ ~ = ( ) , where is
the eigenvalue corresp onding to .Again this reduction should b e taken as large
as p ossible for small , but since is hard to estimate, it is dicult to say what
successionofeigenvectorsthisleadsto.InanycaseTheorem5showsthatthisleads
toexactlythesamereduced mo del asConditionP.
1 1
1
0 0 0
0 0
R
R s S s S s S XX
s Xy
R
R xx
xy xx xy
k
xx
xy xx
xy
6 Estimation of the matrix R.
k
PRE
k PRE
k
k
; ;:::; n
n
k
k<p
k
and the results (except for the reference to Lemma 2) do not dep end on multinor-
mality.Thequestionofnonnormalitywillb etakenupinabroadersettinginSection
9.
Theorem5gives formulaeforthedesired invariant spaceof dimension and condi-
tionsforits existence, everything expressedin terms of theunknown mo delparam-
eters.Fortheuseofthepresent ideas in practice,this space mustb e determinedin
some approximate way, either by using data or by other means.Heuristically one
should exp ect that it is notto o imp ortant todetermine this space very accurately:
If Condition 1 is satised exactly (with as thetrue regression parameter),we do
not need further conditions; if it only holds approximately, then (exact or approx-
imate) validi ty of Condition 2 will help minimizi ng the eect of this.If Condition
2 is only approximately satised, the result is only small additional terms in the
exp ected squaredprediction error .Acrucial p oint isthedetermination ofthe
dimension .Ifit isto o large,Theorem2 shows that will increase; ifit isto o
small,therequirement neededtogetan approximateinvariant spaceof dimension
mayb e to osevere.
Tob eginwith,wewillx andlo okatthedeterminationofspan( ).Anobvious
solutionistousethespaceinTheorem5(a)(ii)withcovariancesreplacedbyestimat-
ed covariances, i.e., take
^
=( ) where = and
= .AsshowninHelland(1990),thisisequivalent tothewell-known,but
stillsomewhatcontroversialpartialleastsquaresmetho dPLSR prop osedbychemo-
metricians.Numerouspublications, in particularin thetwochemometricaljournals
haveshownthatthemetho d functionsreasonablywell,but critiquehasb eenraised
bystatisticians, forinstancein Frankand Friedman(1993).From thepresent p oint
ofview a critical remarkagainst PLSR is thefollowing: The assumption thatthere
existsaninvariantspacecontaining ofdimension impliesarestrictionofthemo d-
elparameters,asexpressedbyTheorem5(a)(ii).Theresultingparameterestimates
resulting from the ab ove PLSR-formula for
^
do es not satisfy the corresp onding
restriction(with probability1)when .Thismayimply alossin eciency.
Nevertheless,asymptoticdevelopmentsinHellandandAlmy(1994)andsimula-
tionsinAlmy(1996)seemtoindicatethatPLSRfunctionswellundertherestricted
mo del when is small.Less is known in precise terms ab out its prediction ability
undertheoriginal basic mo delasformulated ab ove.
Principal comp onent regression isanother much usedand reasonably well func-
tioning metho d. In our setting it is given by the estimates
^
connected to the
2
k 0 k
0
!
!1
0
p
0
0
0 0
! ! 1
0
0 0
0 0
0
0 0
0 0 0
0
0 0 0 0 0
0
xx
n
n
n n n n
L L
n
n
n n
n
n
n
n
xx xx xx xx
n
n n
n n
n
n
n
( )
( ) ( ) ( ) 1
( )
( )
0 0
2
( ) ( )
( )
2
P
( )
2
( ) 2
1
( )
2
1
2
2 2 2
2
2
S R
X y R
R
^
R R XXR R Xy v v
v v v v
v
^
x
R R
R
R
R R
WQW
Q 6 6 R R6 R R6 W
R R
y X
^
y X
^
^
R
^
R XX
^
R
^
R Xy
Q
t y
; n
p k
L d ;L
L
L PRE y
PRE
d ; :
PRE
n
PRE
k
n n
o
n
;
n
n k
;
n
O =n k
there are several ways todetermine which eigenvectors of toinclud e in
^
, the
two most common solutions b eing those with the largesteigenvalues or those with
thelargestvalues ofa -statistics connectedto .
A full discussion of various estimation pro cedures is b eyond the scop e of the
present pap er.We will limit ourself tolisting somemathematical results that may
b e of relevance in this connection.Most of the results are relatively easy toprove,
though some pro ofs, while based on relatively simple ideas, require more detail-
s. We will assume a sequence of data sets ( ) (size ) with
^
b eing a se-
quence of estimators of (all of the same dimension ), and we take =
^
(
^ ^
)
^
. Foravector andavectorspace ,let ( )b e
thedistancefrom to ,i.e.,thelength P ,whereP istheprojectionof
on .Finally,theexp ected squaredprediction erroris =E( ) .
1.If
^
is determined from an indep ende nt training set or if
^
converges
in probability to a constant matrix , then the exp ected squared prediction error
convergesto if and only if
( span(
^
)) 0 (12)
Ifthis condition do es nothold, every subsequence limit of will b elarger or
equal to ,withstrict inequali ty foratleastone subsequence limit.
2.If Condition 1 holds for the limiting matrix so that = forsome ,
then,as in Helland and Almy (1994),as
= (1+ )+ 1
E( ) + (
1
) (13)
where = ( ) and is the limit in distribution of
(
^
).
3.Asa p ossible alternative tocrossvalidation for determiningmo del dimension,
one might hop etodrawup on theestimated residual variance
^
~ = 1
( )( ) (14)
where
^
= ( ) . It is easy to prove that this estimator is
asymptotically unbiased in the sense that E(
^
~ ) ~ as with ~ = +
.
A serious diculty though is that the rate of convergenceof E(
^
~ ) towards ~ ,
although of order (1 ) will in general dep end up on the dimension . To get
X Y
j
j
j
0 j j
2
0
C
s ;
k
s
0 0 0 0 0
0
0
0
0 ( )
2 1
1
1
0
2
=1
2
=1 1
^
R
XR
s R RS R Rs RS R
^
S s
^
R
^
R S
6 6
s e
6 S
R
p
n
y xy
xx xy
xx
LS
xx xy
xx
xx
y k
j xy
i
i
k
j i
xx xx
may b e corrected for. This issue will not b e pursued here. Note that for mo del
choice situations where criteria like Mallow's and the Akaike criterion are used,
noestimation ofa subspaceis needed, sothis problemdo es noto ccur.
4. One obvious candidate for an estimator is the maximum likeliho o d
estimatorunderthe restricted mo del,which is develop ed in Helland (1992).Unfor-
tunately,this estimatorrequires heavy computation,and the empirical results in a
prediction setting are notto o convincing (Almy, 1996).An alternative candidate
istheconditional maximumlikelih o o d estimator,whereone conditionsup on in
therestrictedmo del.Using essentially thesamecomputations asinHelland (1992),
we getthatthis estimatoris found byminimizin g
( ( ) ) (15)
Notethatthepro ductoftwofactorsistob eminimiz ed in(15);therstfactoris
minimizedifandonlyif = span ( )andthesecondfactorisminimized
i is spanned by the eigenvectors of with the largest eigenvalues.In this
waytheresulting predictor will have a relation b oth toordinary regression and the
mostcommonform of principal comp onent regression.
As in Helland (1992) the minimization here can b e done stepwise, rst in one
dimension and thenbysuccessively increasing thedimension, if weimp ose thecon-
straint that the resulting estimated subspaces should b e nested within each other.
The minimization for each dimension can also b e done for instance as in Helland
(1992).
A simpler solution is achieved by assuming that = is known, when we
maximize thelikeliho o d byminimizi ng
(
( )
) (16)
with an obvious notation for eigenvectors and eigenvalues.One should exp ect this
tob eareasonablesolution alsowhen isunknownandeigenvectorsfrom are
used.A further simplic ation is toapproximate the rst paranthesis in (16) by a
pro duct,sothatnominimization oversubsetsisneeded.Unfortunately,simulations
indicatethatthecorresp onding predictorseems tob ehaveroughlylike theprincipal
comp onentpredictorbasedup onselecting comp onentsbyat-test(T.Almy,private
communication), a predictor which in most cases is known to b e inferior to the
ordinary princip al comp onentpredictor.
5.Other estimators of can b efound by taking intoaccount invariance prop-
erties ofthemo del.This ispresently b eing investigated.
2 2
0 0
0
2
j
j
j
j
j
1
0 0 0 0 0 0
0
0 0
1
0 0
1 1 1
1
1
1
1
1 2 1
;
n q p q
;
;
n
;
;
PRE
;
p k k<p
:
y
PRE
n
n :
xx
xx xy
y y xy xx
y y xx
xx xx
xx xy
xx
xx xx xx xx
0
0
0
0 0 0 0
0 0 0 0 0
0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0
0
0 0 0 0 0 0 0 0 0 0
variables.
Y XB E
Y B E
0
x 0 6
B 6 6
x y y Bx e e
0
^
B
^
Bx
y
^
Bx y
^
Bx
6
^
B6
^
B6
^
B
6 B6 B
R
^
B R RXXR RXY
R
Y X
^
BX R RXXR RXXB
^
BX B
Condition 1. B R
RXXR RS R R6 R
XR
^
BXR
R R6 R R6
^
B
^
B6
^
BXR
B6 R R6 R R6 B EXR R6 R RXEXR
Amultivariate extension of(1) is
= + (17)
where is , is an parameter vector, and where the rows of are
indep end entmultivariate normal( ).Againwealsohaveinterestinthemarginal
-distribution, which is assumed multinormal ( ).(For mostpurp oses, multi-
normality may b e disp ensed with; see Section 9.) If then (17) is lo oked up on as
representing indep endent conditional distributions, thejoint distribution will also
b emultinormal, and = .
Take ( ) from the same p opulation, so that = + with
N( ).Foran estimator onegetsa predictor ,andwhenevaluatingthis,it
isnaturalto weightthe dep endent variables bythe inverse errorcovariancematrix.
Thisgives
=E(( ) ( ))
=tr( ) 2E(tr( ))+E(tr( ))
(18)
where = + .
Consider nowsomexed matrix offull rank and theestimator
= ( )
Using this for prediction gives a similar predictor for each -variable as used in
Section 2,butwiththesame foreachvariable.Takingconditional exp ectation of
,given we get
E( )= ( )
Inparticular,E( )= if
span( ) span( ).
As in the case with one resp onse variable, this condition also minimizes .
Tolo okfurtherup ontheexp ected squaredprediction errorwhentheconditiondo es
not hold, we will use the approximation = , and
we will condition up on , as in the pro of of Theorem 2. Then E( ) =
( ) .Using (17)in the formulafor
E( )
( ) + E[ ( ) ]
(19)
1
2
1
2 1
2
j 0
j 1
0
0
0
1
1 1
1 1
+1
1
+1
1
:
:
PRE q
k
n :
:
k
;
k
0 0 0 0
0 0 0
0 0 0
0 0 0 0 0
0 0 0
0 0 0
0
0
i
i xx xx xx xx
i i
i
xx xx xx xx
k k k
k xx xx k
k
xx k k xx
k
k
k k
k k
k k
k k k
k
k k
xx k k
k
xx
e E
e XR B 6 6 R R6 R R6 B
f e
f
E E XR I
B 6 6 R R6 R R6 B
R R R d
Q 6 6 R R 6 R R 6
R R
d Q B BQ d
d Q d
Theorem 6.
I
d d R R
Condition 2 6 R R
R
6 B
Assume .
(a)ThepopulationPLS2algorithmminimizesateachstepthenumeratorof(21)
if . This algorithm terminates at if and only if Condition1 then
holds.
(b) Ifthisalgorithm terminatesat step , thenalso
spansa minimal spacesatisfyingboth Condition1 and Condi-
tion2.
(c) Alternatively, this space can be characterized as the smallest space spanned
by eigenvectors of which contains .
(d) The parameter restrictions formulated in (b) or (c) above constitute the re-
strictionsgeneralizing those of Theorem 5 to the multivariate case.
Toinsert theexpressionsab ove into (18)weneed that therows of satisfy
V( )=
~
= + ( ( ) )
Then = areindep en dent with covariance matrix
~
, and writing
outin terms ofthe 's,wend that
E( )= tr(
~
)
Sotakingrstconditional andthenunconditionalexp ectation inequation(18)gives
( +tr( ( ( ) ) ))(1+ ) (20)
This isthe multivariate generalization offormula(7).Again itis naturalto de-
terminethespacespan( )successively: =( ).Verymuchoftheprevious
development is exactly asb efore.Let = ( ) as
b efore, and let b e the rst factor on the righthand side of formula (20) when
= isinserted.The generalization of formula(9)is then
= (21)
(See App endi x 1.) For the formulation of the following Theorem we refer to the
p opulation PLS2algorithm dened and discussed brieyin App endi x 2.
=
=1 =
: span( )=span( )
andwehave that
span( )
1
1
2
1
2
k
; :
k
k k
p k
0
0
0
5
0 0
0
0 0 0
0 0 0
0 0
0 0 0 0
+1
( +1) ( +1)
( +1) ()
+1
1
1
1
1
+1 +1
+1
+1
k k k
k
k
xy k
xy
k
xy
i
xx i
xy
i
xx
i k k
k k
k
k k
k xy xx k
k xx k
k xx
k k k k k
k xy xx k
k k
xy xx k k
xx k
k
xx
xx
d w Q BBQ 6 6
6 0 D 6 6
S 6 R D I S S S S D 0
D S B R
Q B 6 6 R R 6 R R 6
w d R R d
Condition P R 6 6 R
R R
6 6 B B R R
6 R
R
B
6
U 0 UB 0 U
6
R
^
wXY
^
Y Xw
(a) In the notation of App endix 2, the numerator of (21)is maximized if and
only if = is an eigenvector of = with maximal
eigenvalue, and this is just the waythe algorithm is dened in the app endix.The
algorithm terminatesatstep i = ,which in thenotation =
and = means =( ( ) ) = .This is equivalent to
span( ) span( ),hence span( ) span( ).
(b)Since = ( ) determinesthenext vector
= in =( ), itisclear that
: span( ) span ( )
When thealgorithmstopsatstep ,then mayb ereplacedby here.Since
we know that = with span( ) span( ), it follows that span( )
.Sincethematricesonb othsideshavethesamerank,equalityfollows.From
thepro of in (a) itfollows that the space span( ) is minimal with theprop ertyof
containingspan( ) amongallnontrivial sequences of spacessatisfying ConditionP
ateach step.
(c)A space ofdimension satisesCondition 2if and only ifit isspanned by
eigenvectorsof .
(d) Lemma 2 can immediately b e generalized to the multivariate case. The
previous restriction = now reads = , where is spanned by
eigenvectorsof .This gives clearly thesamerestriction asin (c).
So far for the p opulation algorithm corresp onding to a restricted p opulation
mo del.The natural sampleestimate of arising from this - again see App endix 2
-then givesthe PLS2-predictorfrom chemometry.Therearesomevariants ofthese
estimators/predictors(seeHolcombetal.(1997)andreferences there),andatleast
someofthem seemtop erformp o orlycomparedtootherpredictorsprop osed bythe
statisticalcommunity(Breimanand Friedman,1996).Thereis probablyscop eb oth
for improvements and for b etter comparisons, and one p ossible p oint of departure
mayb e thepresent reducedmo del.
Note,however,thattheab oveformulationalreadyp ointsatoneweaknessofthe
PLS2algorithm.Togettherelationsketchedb etweenthealgorithminitsp opulation
form and the reduction in mean square error, we had to assume that the residual
covariancematrix wastheidentity.Thecasewheretheresidualsb etweendierent
dep endentvariablesarecorrelated,isonecasewhereweexp ecttogetsomegainfrom
ajointprediction.Itseemslikelythatamo diedalgorithm,whererstapreliminary
estimate isfound, and thena maximization of is done in each
i
i
i
i
2
0
0 0
0 2
0 0
0 0
0 0 0 0
0 8 Classication.
1 1
2 2
1 1 2 2
2
1 2
1
1 2
1
1 2
y
k
p ;
;
n
p
n
k
p k
; i ;
;
k
:
wXYYXw
R
B
x 6
x 6
R
z Rx z Rx z
R R6R
z
z
R R6R R
R
Condition 1. 6 R
step instead of that of , will have the p ossibili ty of leading to b etter
predictions.
Both in thechemometric literature and in the statistical literature (see theref-
erencesab ove)therehasb een somediscussionon whenit pays todo singlevariable
predictionand whenitpaystoinclude several -variables simultaneouslyinthepre-
diction.The present mo del formulation may throw some light up on this question.
Themultivariate prediction means parsimonyin thesense that the same is used
for all variables.On the other hand, if the dimension of the invariant space has
tob e increased muchto ensure that all columns of b elong tothis space, the net
gainmayb enegative.
Thesimplest classication metho dis linear discriminant analysis, whereweassume
classicationvariables thatareobserved ineachof2 classes, N( )in the
rst class and N( ) in the second class.Again the mo del parameters are
estimated by a training set, say observations from each of the two classes, and
again the estimation is dicult if the numb er of variables is large compared to
.In an interesting recent article Friedman (1997)argue that this problem is less
in classication thanin regression, but theproblem may nevertheless b eserious in
manyapplications.
So assume that we reduce the numb er of variables to by letting b e some
xed matrix and taking = and = . Then of course
N( ) ( = 1 2), and standard linear discriminant analysis can b e done
withthenew variables .
Concentrate on the simple symmetric case with equal prior probability for the
two classes and equal cost. Then (see for instance Ripley, 1996) the asymptotic
probability ofmisclassication fortheclassication based up onthe 'swill b e
=28(
1
2
) (22)
where 8 is the cumulative standard normal distribution function, and where is
p ositive with =( ) ( ) ( ).Again we are relatively con-
dent with theformulae resultingfrom asymptoticcalculation when the dimension
ofthevariables involved isonly .
The probabilityof misclassication is small i is notto osmall, andthis is the
objectiveforourconditions forthe(theoretically) b estp ossible choice of .
( ) span( )
0 0
0
0 0
0
0 0
5
0
j j
j
0
0
5 0 0
0
0 0
0 0
0 0
0 0
0 0 0 0 0 0 0 0
0
0 0 2
1 2
1
1 2
1 2
1 2
1
1 2
1
1 2
2
2
= =
=
1
1 2
2 1
k
k
>
:
Q
t t t
t =Q
6
U6 0
Condition 2. 6R R
Theorem 8.
R
6 R R 0
R6R
6 R R 0
R6R 6R R6R R6 R6
R 0
R6
R
We have with equality if and only if Condition 1
holds. Hencethis condition minimizes the probability of misclassication for given
, and thisminimum is (asymptotically to the lowest order)the samefor all values
of .
Assume a general is such that Condition 1 may or may not hold, and put
, where . TheneitherCondition2 holdsand
, hence for the probability of misclassication ,
or Condition2 does nothold and .
( ) ( )
Pro of.
Multiplying the identity (6) from the left by ( ) and from the right by
, we see that the minimum is reached for ( ) = , which is
equivalent to Condition1.
A similar discussion asin the regression case involving stepwisemo del selection
can b e made.Instead we take a simpler approach showing a dierent prop erty of
thereduced mo del.We thenintro duce atonce
span( )=span( ).
( )= + = inf =
= sup =
sup
Pro of.
When ( )= + with = ,wend
= + ( ) +2
Thelasttwotermson therighthandsideheredep end up on .IfCondition2holds,
they vanish when = .Assume that Condition 2 do es not hold.Then can
b e chosen such that the quadratic term in - call it - is p ositive.For such an
replace by ,where issomescalar.Minimization over leads tothatthesumof
these termsis negativewhen = .
We hop eto discuss estimatorsof and corresp onding classication pro cedures
elsewhere.
Remark.
Asimilar discussionofclassication errorusing astepwiseincrease in dimension
can b e done in exactly the same way as in the regression case.The formula (22)
0 0
0 i i
i i
xx 0
f g
j j
j 0
j j 0 j
0 0
0
0
0 0
2
0 0
0 0 0 0 0
0 0 0 0
0 0 0 0 0 0
2 R
6
R 0 R6
0
x
x x
x
x x
x
x x x
^
6
^
x
z x
x
9 General theory and discussion.
k k
y
y
PRE y y ;
y ;y i ;:::;n
;y i ; ;:::;n
y y y
y y y
PRE y y y :
p
h k<p
with a small numb er of columns.When is large, sample variation will cause
theclassicationerrortob elarger thanwhatisgiven by(22),againparallell tothe
prediction error in the regression case.It is therefore again imp ortant to keep the
numb eroflinear combinationsofvariables onwhichclassications should b ebased,
low.Again a restricted mo del, stating that only a small numb er of eigenvectors of
contribute in thecassication,seemto b euseful.
On theother hand, the prop ertyusedin Theorem8 ab ove, thatCondition 2 in
general implie s an equivalence b etween the orthogonalities = and =
, has useful consequences in the regression case, also. (It can b e used to derive
alternative expressions for the exp ected squared prediction error.) Furthermore, a
minimax prop erty of prediction error - analoguous to that for classication error
givenin Theorem8- can b eformulated fortheregression case.
For deniteness we will return to the situation of predicting a scalar variable
froma vector of variables, but generalizations tocoverthesituationin Section 8
should b e fairly straightforward to make.We will also assume quadratic loss (and
will assumethat hasnite variance), sothegeneraltask istominimize
=E( ^ ) (23)
where ^ is a function of and of atraining set ( ); =1 of indep en-
dent observations, such that all ( ); = 0 1 have thesame, more or less
unknown distribution.
Theordinarylinearornonlinearregressionpro cedure istogoviatheconditional
exp ectation E( ) and use as ^ an estimate
^
E( ) of this from the training
set.By adding and subtracting E( ) to ^ in the resulting equation (23),
we nd thatitis equal to
=E[Var( )]+E[E( )
^
E( )] (24)
The rsttermhere isunrelated tothetraining set,soitis thesecondtermthat
one should trytoreduce in order togive go o dpredictions.To geta feeling forthis
lastterm;in thelinearregressioncaseit isE(( ) ( )).Thereductionof
this termwill b ea problem if thenumb er of parametersislarge, which itgenerally
will b e if the numb er of variables in is large.So we may cho ose to reduce the
mo delbyconditioningonasmallervectorvariable = ( )(sayofdimension )
instead of on the whole vector .Note that in this general setting the analogueof
2
2
0
0
0
0 0
j j
j j j j
j j 0 j
j j j j 0 j
j j j
j
j j
5
j
j 0 j j
y y y
y y y ;
PRE y y y
y y y y
:
y y
y
y y
y
y y
PRE
y y :
z
x z
x x z
z x x z
z x
z z z
x x z z z
x z x z x z
z
Condition 1. x z
Theorem 9.
x z
z Rx
z Rx
x
x x x
x z
z x xRx
(a) Therst term in (24) is constantin goingfrom to if Condition1 holds.
Inall other casesthisterm will increase.
(b) For the case of a linear function and of multinormal observations
(ormoregenerally,assumingthat all conditionalexpectations arelinear; seebelow),
the Condition1 hereis equivalent tothe previousCondition 1.
conditiononisin generalsomethingessentially dierent fromequatingsomesp ecic
parametersin themo deltozero.
An imp ortant question ishowtonda sensible newvector tocondition up on.
Theoretically, the rst requirement to b e satised is that the change from to
should notincrease substantially the rst term in (24).It is easy to nd a simple
theoretical condition forno such increase to take place: Lo ok at theversionof the
well-known identity Var( ) = E[Var( )]+Var[E( )], conditioned up on , and
taketheexp ectation ofthis toget
E[Var( )]=E[Var( )]+E[Var(E( ) )] (25)
sobyconditioning up on instead ofup on in (24)we nd
=E[Var( )] +E[E( )
^
E( )]
=E[Var( )]+E[Var(E( ) )] +E[E( )
^
E( )]
(26)
Hence theterm indep endent ofestimation in (24)do es notincrease when going
from to if and onlyif Var(E( ) )=0,i.e.,ifE( )is afunction of (almost
surely),which then necessarilymustb eE( ).This leads to
E( )=E( ) (a.s).
=
Pro of.(a)is alreadyproved.For(b),useTheorem1,either(b)or(c)there; the
generalization toother cases withlinear conditional exp ectation is straightforward,
usingTheorem 10b elow.
To nd further conditions we will limit ourselves to linear functions = .
Furthermore,wewill assumethat theconditional exp ectation of ,given islinear
in ;i.e., E( ) = .Then, from (25), theincrease in conditional variance of
when going from conditionin g on to conditioning on and hence the increase in
thenon-estimative part of ,will b e
E[Var( )] E[Var( )]=E[Var( )] (27)
1
1
1 2
2 2
1
1
1
1
1
1
1
1
1
1
1
1
1 1
0
j
j
0
j
0
j j
j
j
j
j j
j 0 j j
0
0 0
j j
;
;
y y
y
y
p
;
;
xx xx xx xx
xx
xx xx xx xx
xx xy
xx xx
xx
xx
xx
xx
xx
xx xx
xx xx
xx
xx xx
0 0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0
0 0 0 0
0 0 0
0
0 0 0
0 0
0
0 0
0 0
0 0 0 0 0 0
0
0 0 0
0
0
0 0
0 0 0 0 0 0
0 0
0 0 0
0 0
0
0 0 0 0 0
0 0 0 0 0
6 6 R R6 R R6
Theorem 10.
x 6
x Rx xRx ARx
xRx Q
Q 6 6 R R6 R R6
xRx
~
Q
~
Q Q QMQ M
~
Q
x Rx z Rx
z z R6 R R x
z
~
Q U 0 RU 0
R U x z
xRx ARx xR
A 6 R R6 R
R U RU 0 R U
Q U U6 U U
x 6 R R6 R Rx Q6 x
xRx Q6 xRx
Q6 xx Rx xRx xRx 6 Q
Q6 xx 6 Q QMQ
U U6 U U QMQ Q QMQ
M 6 xRx xRx 6
Assume that has a nitecovariance matrix . Then we have the following:
(a) If the conditional expectation of , given is linear: ,
then
where .
(b) Ingeneral
where for a nonnegative denitematrix which issuch that is
nonnegativedenite. Sothedierencein variance,hencetheapproximate dierence
inmeansquareerrorwhenwedisregard estimationerror, isgivenby thisexpression.
(c)Ifthe conditionalexpectationof ,given islinear, thenwith we
have , where . Furthermore, if ,
wehave . Themodelrestriction , where
and has full rank, leads tothe same conditionalexpectation .
( ( ) )
This was the basic result needed for the mean square error calculations of Section
2, and it is imp ortant, though p erhaps surprising to some that no distributional
assumptions atall (except for nite variance) is needed for a closely related result
tob evalid.
E( ) =
E[Var( )]=
= ( )
E[Var( )]=
=
=
E( ) = = ( ) Var( ) =
Var( )= ~ = + = =
( ) E( )=
Pro of.
Incase (a)wend bymultiplyin g E( )= fromthe right by and
thentakingexp ectation that = ( ) .
In general, given , cho ose such that = and such that ( ) hasfull
rank .Then from equation (6) we have = ( ) , and by the same
equation
= ( ) + (28)
so
E(Var( ))=E(Var( ))
=E( [E( ) E( )E( )] )
= E( )
= ( ) =
(29)
where = E[E( )E( )] and equation (6)hasb een usedagain.
0
0
0
0
0 j
j
j
5
j
y
y
y
y
x x
y x
Q
x x
z Rx z
x
U 0
x Rx
x x
Q Q
R Q
R
is nonnegative denite. Nonnegative deniteness of
~
follows since the exp ected
variance calculated ab ove mustb enonnegative.
The pro of of(c) isfound byrst noting that E( )= ,and thentaking the
conditionalexp ectation ofthisgiven = .TheformulaforVar( )followsfrom
the same result used to prove (25).The formula for E( ) under the restriction
= follows from equation(6).
The consequence of this is that all essential results ofthe Sections 2-5 arevalid
if we assume that the conditional exp ectation of , given is linear.The class
ofdistributionsforwhichthisisvalid,include stheellip tical (orellip ticall y symmet-
ric) distributions; see Devlinet al.(1976)and references there,in particularKelker
(1970).Inaddition,Theorems3and4holdwithsome-admittedlynontrivial-mo d-
ications essentially without anydistributional assumptions atall on the variables,
assumingonly nite moments and E( )= .The detailed pro of of this will b e
omitted, but the main idea is that these pro ofs are directly based on the formula
fortheexp ectedsquaredprediction error,which byTheorem10(b)isasymptotically
valid in general if we only replace the matrix by
~
, and on the fact that this
matrixalsohasavanishin g pro ductwith andthatitissmallwhen issmall.As
a consequence, the two basic conditions, Condition 1 and the invariance condition
on span( ) on the reduced mo del are of somerelevance forany linear mo del with
random 's, whatever the distribution of these 's, and whatever the conditional
distribution of , given these 's.The discussion on estimation in Section 6 is also
quitegenerally valid,but themaximumlikelih o o d estimate ofSection 7 isof course
distribution dep endent.
One way to put this, is that the chemometricians in some sense seem to have
b eenontherighttrackwhen theyhaveusedtheterm`softmo dels'inconnection to
their PLS regression.Thegeneralfeeling amongstatisticiansstill is thatchemome-
triciansare imprecise in some of their terminology, but on the other hand itseems
to b e easier to develop new and fruitful ideas on an intuitive level than if full rig-
or is demanded at each step.Recent issues of the chemometrical journals contain
ideas that go farb eyond whathas b een discussed in this pap er.(This has also re-
cently reachedstatistical journals;seefor instanceDurand and Sabatier(1997)and
references there.)
The wayof thinking of statistical mo dels that is promotedin this pap er, p oints
furtherthansp ecic chemometricmetho ds,however.Wemakeexplicit thefactthat
inthecasewithmanyunknownparametersitisusefultohaveatleasttwodierent
statistical mo dels underconsiderations at the sametime: The `correct'mo del that
adequately describ es reality in all details and the `simplied', reduced mo del which
:
;
p p
0
0
0 0
0
p
0
1
+1
+1
+1
+1 1
+1
1
0 0 0
0
0
0 0 0 0 0 0
0
0 Appendix 1. Proof of (9) and related formulae.
S 6 R c 6 d P S S S S I P c
S
P P
I P c c I P
c I P c
6
6 R R 6 R R 6 6 R R 6 R R 6
Q d d Q
d Q d
k xx k k xx k k k
k k
k
k k
k
k k
k k
k
k
k
k k
xx
xx k
k
xx k
k
xx xx k
k xx k
k xx
k k
k k
k k k pap erweintro duce forthelinearmo delcasesp ecictheoreticalconditions toassure
thatthereducedmo delfunctions aswellasp ossible forprediction purp oses.Inthis
way we get a nested sequence of reduced mo dels, and the order can b e found by
cross-validation orin other ways.
The p ossible danger of overtting of mo dels that may result from this way of
thinking,needs to b efurtheranalyzed.If theorder of the mo del isfound bycross-
validation, one will probably b ereasonably safe, but in themultinormal case there
also seemtob e somep ossibili ty of using estimated prediction error found from the
ordinary regressionmean square,a p ossibility thatshould b einvestigatedfurther.
There may b e some intuitive arguments to the eect that b ecause irrelevant
informationisthrownaway,theestimatesfromthereducedmo delmayhavecertain
robustness prop erties.Exact results in this direction may b e dicult, but will b e
welcome.
Practicalalgorithmsforcomputingestimatesarenottouched up onatallin this
pap er.It iswell known from partial leastsquaresregression that theb estformulas
or algorithms for theoretical understanding are usually not the b est for numerical
computations.
As a nal p oint, once the ideas b ehind this pap er are accepted, there seems
to b e p otential for extending in several directions. Logistic regression and other
loglinear mo dels(withorwithoutlink functions)withmanyexplanatoryvariablesis
an immediate p ossibili ty, likewise multivariable mo dels withmanyparameters.An
interesting challange is the situation where the parameters are not in linear form,
but where one neverless p erhaps maygive a similar theoryto thepresent theory if
theparametersarerelatedbysomegroup symmetry.
Let = , = and = ( ) .Then ( ) -
if nonzero -is orthogonaltospan( ), andtherefore
=
( ) ( )
( )
Multiplyi ng this equation fromthe leftand from theright by thengives
( ) ( ) =