Model reduction for prediction in regression models

(1)

2

3

2

y X e

X e 0 I

;

n p ;

x Abstract

x

IngeHelland

1 Introduction.

Department of Mathematics, University of Oslo, Box 1053 Blind ern, N-0316 Oslo, Norway.

(ingeh@math.ui o.no)

Theregression mo del in itsusual form

= + (1)

where is and is N( ),is one ofthemostsuccessfull knownstatistical

mo dels from an applied p oint of view;yetits very form is defective in one resp ect,

since anymo del thatisconditioned up ona setof variables likethe -variables here

necessarily contains no information ab out the distribution of these variables them-

selves.Even in the common situationwhere these areobserved variables, notxed We lo ok at predicti on in regressi on mo dels under meansquare loss for the

random case with manyexplanatory variables. Mo del reduction is done by

conditioning up on only a small numb er of linear combination of the original

variables. For each dimension asimple theoretic al condition on the select ion

matrixis motivated from the mean square error. The corresp onding reduced

mo del will then essenti ally b e the p opulation mo del considere d earlier for the

chemometricians' partial least squares algorithm. Estimation of the select ion

matrix under this mo del is briey discussed,and analoguous results for the

case with multivariate resp onse andfor the classi cation case are formulated.

Finally,it isshownthat anassumptionofmultinormalitymayb eweakenedto

assuming ellipti cal ly symmetric distributi on,and that some of the results are

validwithoutanydistributi onal assumptionatall.

KEYWORDS AND PHRASES:classic ation; exp ected squared prediction

error;invariantspace;mo delreduction; partialleastsquaresregression; predic-

tion;random ; regression analysis.

(2)

x

2

x

R

x

y

p

n

y

x

take all information from the mo del conditioned up on all the 's as in (1).As an

example of a conict arising from this, it is very dicult to interpretethe squared

multiplecorrelation in anyreasonablewaywithout takingthedistribution ofthe

-variables into account (see Helland, 1987 and references there).Some arguments

forthe ordinary conditioning can b e given when thedirection of prediction isfrom

to ,but other formsofconditionin g are p ossible, and mayalso b euseful, aswill

b eseen b elow.

An even more well-known problem is implie d by the situation when is large

- say of the same order as or even larger. Then cannot (or can hardly) b e

estimated by least squaresb ecause of collinearity.As a consequence, the standard

regressionmetho dcannotb eused directlytopredictnew -variables fromanewset

of -variables.This is in fact one of the greatparadoxes of statistics: An increase

in information in terms of an increase in the numb er of explanatory variables may

typically in thissense makeprediction moredicult, noteasier.

There are lots of statistical metho ds whose object is to improve up on this sit-

uation: Subset selection, ridge regression, shrinkage metho ds, principal comp onent

regression,partial leastsquaresregression and soon,and alot hasb een writtenon

thepros and consof the various metho ds.Inthis pap er we will notconcentrateon

metho ds, but on mo dels.It isknown that theordinary regression metho ds usually

function well when the numb er of explanatory variables is not to o large compared

tothenumb er ofobservations.It seems also tob egenerally acceptedthat,roughly

sp eaking,a largedatasetrequiresa morecomplicatedmo del thanasmall dataset.

Takingtheconsequences ofthiswayofthinking,a naturalquestion is: Withagiven

datasize, howcan aregression mo delb ereduced in anoptimal ornearoptimal way

from the p oint of view of prediction? In general there are two ways to achieve a

mo del reduction: through a change of conditionin g and throughparameter restric-

tion,andinthesimplestcasetheseareequivalent, aswillb eshownb elow.Themost

imp ortant task,though, is tond theb est reduced mo del, or at least somenearly

b estmo del,and this isnota trivial taskin general.

Traditionally, statisticians are accustomedto keeping the samesingle mo del all

the wayfrom the initial mo del buildi ng to the nal data analysis, but informally,

mo delreductionhasb eenusedinallbranchesofappliedstatistics,b othinestimation

and in prediction problems.It iseasytond examples whereit maypaytoreduce

the numb er of parameters in mo dels when the data set is limited; a systematic

likeli ho o d-based theory for this has recently b een given by Hjort (1998).Here we

willpresentsomemainideasforageneralapproachaimedatpredictioninregression

mo dels, rst forthe case withmultinormal observations.Inspiration forthe theory

comesfrommetho ds develop ed in chemometry,butweemphasize againthatwewill

primarilydiscuss mo dels,notmetho ds,and thattheargumentsusedheretoreduce

(3)

!

0 0

0

x

6

x xx xy

xy y y

x

;y

:

n

y linear combinations

2 Reduction of regression models by choice of condi-

tioning.

Whenreadingthepap er,itmayb euseful tohavetheanaloguetovariableselec-

tionin mind.Asiswellknown,thistermdenotesthemetho dswhereonestartswith

theclassof all p ossible regression mo dels withsubsets of theoriginal -variables as

explanatory variables (or some large sub class of this class), and then use data to

cho ose b etween themo dels.In thepresent pap er welo ok at theclass of regression

mo dels witha setof of theoriginal 's asexplanatoryvariables

andlimittheclasstothosewhotheoretically seemtogivetheb estpredictions.This

choice will dep end up on unknown parameters, and the estimationof these param-

eters corresp ond to the use of data to select mo del in the variable selection case.

Also,there isa nal choice of thesize of the mo del tob e made.We will give some

hints b elow on p ossibiliti es fordeveloping simple criteria forthis, but from whatis

known up tonow, crossvalidation seems tob e the b estavailable to ol.This is also

themetho dusually employed in chemometricalmo dels.

Sincetheinitialclassofmo dels inourapproachisconsiderablylarger thanwhat

we have in the variable selection case, one should exp ect tond b etter predictions

with this approach.For thecase of chemometric metho ds, this exp ectation is also

conrmed by simulation studies (see, e.g.,Frank and Friedman, 1993).Much work

remains tob edone in evaluating sp ecic predictions, however.

Inthenextthreesectionswediscussmo delreductioninmultipleregressionmo d-

elsassuming multinormality, andthen thereduced mo del is presentedin Section 5.

In Section 6 we lo ok atparameter estimationin thereduced mo del.Section 7con-

sidersthe corresp onding situationwhen there areseveral resp onse variables, and in

Section 8 we lo ok at classication problems.In Section 9 we generalize the basic

results to other distributions than the multinormal distribution, and discuss some

consequences ofthegeneral results obtainedhere.

In this Section we will make the ideal assumption that ( ) has a multinormal

distribution withzero exp ectation and joint covariancematrix

(2)

(In fact,themost centralresults b elowmayb egeneralized toobservationsthat are

notmultinormal; seeSection 9.) We assumethat oursample consistsof indep en-

dent observationsfromthis p opulation,andwewanttopredict from ,sampled

from thesame p opulation, i.e.,having the samejoint distribution.This mo del will

(4)

0 0 0

0 0

0

1 2 1

1 1

+1

1

2

1

2 0

111

5

2

2 0

xx

xy y y

xy xx xy

p p

k p

k

6 6

x

Lemma 1.

Z x x

y Z ~e ~e 0 I

z z

R

z Rx

z

U

RU 0 R U

x

p

k k

;:::; x ;:::;x

; ;

;

y e x ; ;x e ;

k

p k k

k

p p k

p

Theregressionmodel obtained fromthe basicmultinormalmodel by conditioning

upon all variables and then putting has the same form as the

model obtained by just conditioningupon In the basic model. This

form is with .

dertosimplifynotation;inpractice thisessentially meansthatwewilldoregression

on centered variables.A mo del includi ng exp ectations could have b een usedat the

exp ense of a morecumb ersome notation.Ifwe condition thebasic mo del up on all

the -variables in all the samples, we get a regression mo del of the form (1) with

= and = , but the basic mo del asit stands contains

moreinformation.

Asin theintro duction weassume thedimension of tob efairly large,sothat

theregressionestimatorfrom (1)willb e nonexistent orunstable.Asimple solution

is then to pick out variables, say the rst , and do regression up on them.Let

=( ) and =( ).

= = = 0

=( )

= + N( ~ )

Pro of.

Simplecalculationshowsthatineachcasewegetamo delforeachunitoftheform

= +~,where =( ) and ~ N(0 ~ ).The relationship b etween the

parametershereand theparametersin the original mo del isin general dierent for

thetwocases,butthisdo esn'tmatterifthenewequationistob eusedtodeveloppre-

dictors,saybyleastsquares.Aninteresting p ossibili ty,which isrelated towhatwe

dolater,istoadjusttheparametersoftherestrictedmo delsothattheytwiththez-

conditioned mo del.

Obviouslythesameresultholdsifsomeothersetofregressionvariablesthanthe

rst iskeptin themo del.Thereexistmanymetho dsaimingatpickingtheoptimal

set of variables, i.e., the b est subset regression mo del to use, but here we want to

lo okataconsiderably largersetofmo delsforseekingonethatisgo o dforprediction

purp oses: Let b e a matrix of full rank , and consider the new variables

= ,a general set of linear combinations of the original regression variables.

Note that subsetselection is a sp ecial case of this, and that regression up on a of

thisformalsocanb erelatedtoseveralwell-knownmetho dslikeprincipalcomp onent

regression.As in Lemma 1, concentrating up on such a smaller-dimensi onal set of

variables can either b e interpreted in terms of a mo del reduction or in terms of a

sp ecial choice of conditioning in the mo del.Let b e any ( ) matrix such

that = andsuch that( ) hasfull rank .

(5)

5

j

0

j j

j

0

2

0

2

1

1 2

2 1

2 2

1 2

y y

xy xx xy

y y

xy

xx xy

0

0 0 0 0 0

0 0

0 0 0 0 0

0 0 0 0

0 0

0 0 0 0

;

:

k<n

;

k

X U 0

z Rx

y Z ~e ~e 0 I Z XR

R

^

R RXXR RXy

RXXR

^

X R RXXR RXX

^

X R RXXR R

6

X

R

^

X

^

XR R

^

X

R R6 R R

R

Condition 1: R

R R

For the multinormal case,the regressionmodel obtained by

(a) conditioning upon and formally assuming in the basicmodel,

andthe modelobtained from the basicmodel by

(b) conditioningupon only for each unit,

have the same form with and .

=

= + N( ~ ) =

Pro of.

Same asforLemma1.

Assume now that is xed, and that regression is done under the restricted

mo delformulated in Lemma2, i.e.,

= ( ) (3)

Under the assumption that has full rank (which it will almost surely if

) theexp ectation and covariancematrix of are

E( )= ( ) (4)

V( )= ( ) (5)

where = .

This evaluation is done underthe basic mo del conditioned up onthe full matrix

, which is a common pro cedure in statistics.Both under the restricted mo del in

Lemma2(a)and undertheconditioned mo del in Lemma 2(b)wewill have =

for some , and hence E( ) = , resp ectively E( ) = = . Under

the restricted mo del there is no change in V( ); under the conditioned mo del

of Lemma 2(b) we get the same formula, but with replaced by ~ =

( ) ,which in generalis largerthanorequal to .

Nowtothequestion ofhowthematrix canb echosenintheb estp ossible way

whenthepurp oseistogetgo o dpredictions.As canb eexp ected, theoptimalchoice

will dep end up on the parameters of the mo del, but we will not b e to o concerned

ab outthis problemnow.In thenextSections wewilllo ok atotherconditions while

havingin mindthefactthatparametersneed tob eestimated,andafterthatwewill

turn tothe estimationproblem itself.For nowwe will just formulate the following

simple condition,in practice tob elo oked up onasan unachievable ideal goal:

span( ).

Here span( ) means the dimensional space spanned by the columns of .

Since the mean square prediction error is uniquely determined by the exp ectation

(6)

j

0

2

5

0

2 2

1

1 1 1

1

1 1

2

2 1

0 0

2

0 0

1

0 0 0

0

0 0 0 0 0 0

0

0 0 0

0

0 0 0

0

0 0 0 0

3 The mean square prediction error.

x

;

PRE y

;y

n

i

xx

xx xx xy

xx xy

xx

xx xx xx xx

xx

xy

xy xx

xx R

Theorem 1.

^

X

6

~

R R6 R R

6

R

R R6 R R 6 6 U U6 U U6

U 6 R

6 U U

R

^

x

X

RXXR R6 R

(a) IfCondition 1 holds,then

(b)Assumingthat isinvertible,wehavethat ifandonlyifCondition

1 holds.

(c) Assuming that is invertible, is equal to

if and onlyif Condition 1 holds.

and variance of , it follows from the formulae (4)-(5) that conditions of interest

must dep end only up on this space, noton the wholematrix.It mayb e instructive

tonotice whatCondition1 meanswhen is a simple variable selection matrix: It

justmeansthatall `correct'variables haveb een selected: All variables thathave

b eenleft out,have =0.

The following results arenotunexp ected, but fundamental:

E( )= .

~ =

= ( ) =

Pro of.

(a)We havealready noted that = implie s unbiasedness.

(b)We usethewell-known generalidentity

( ) = ( ) (6)

(which canb eproved bymultiplying b othsides by and by andnotingthat

these two matrices combine to a matrix of full rank.) Multiplyin g this identity by

from theleft and by fromtheright,weseethat theformulaefor and for

~ give the same value if and only if = = 0, which is equivalent to

span( ),i.e.,Condition 1.

(c)Similar, using(6).

In the evaluation of predictors we will take as a p oint of departure the exp ected

squared prediction error = E( ) . An imp ortant question is under

whatconditioningthisexp ectationshouldb eevaluated.Ouranswerisrelatedtothe

main objecthere, namely todevelop a general theoryof prediction which functions

well forthewholetargetp opulation: At leastfor( ) theexp ectation should b e

with resp ect to the unconditional basic mo del.Notquite so strong argument can

b egiven fortakingexp ectation overthedistribution ofthematrix ofexplanatory

variables in the calibration set, one argument is that this gives an explicit formula

which is easy to discuss. Alternatively, we can assume that the sample size is

solargethat can b e approximatedby in the lastpart of the

(7)

2 0

0

0 0

0

0 0

0

j 0

j

0

5 1

2

2 2 1

2

1 2 1

2

1

0 0

2

1 1 2

2 1

1

2 0 0

0 0 0 0

0

0 0 0 0 0 0

0 0 0

0 0 0 0 0

0

0 0

0 0 0 0 0 0 0 0 0

0 0 0 0

0

0 0 xx

xx xx xx xx

y y

xy

xx xy y y

xy xx

xy xy xx

xx xy

y y xx xx

y y

xy

xx xy xx

xx

n

n k

p

p k k k <n

PRE

n

n k

;

PRE y

PRE :

PRE

:

PRE

k

XX 6

Theorem 2.

R

6 6 R R6 R R6

z Rx

R R6 R R 6 6

R6 R R

^ RXXR RXy

^

z ^

^

R6 R

^

R6 R ^

XR

R R6 R R R6 R RXXR

R6 R RXXR

XR

Corollary 1.

Let be a xed matrix of full rank and assume , and let the

estimated regressionvector begiven by equation (3). Then

where .

The expected squared predictionerror is if and only if Condition

1 holds; in all other cases islarger than this.

muchb etterthantheconvergenceof against ifthedimension ismuch

smallerthan .)

1

=~

1

(7)

~ = + ( ( ) )

Pro of.

Conditionup on = .In thisconditionedmo deltheresidualvarianceis~ =

( ) .Since = ,and since = ,

theformulain theTheoremfor ~ follows.

In the same conditioned mo del the regression vector is = ( )

with least squares estimator = ( ) . Expanding the square in

=E( ) andconditioning up on ,we nd

= 2 ( ) + ( )

Takingtheconditional exp ectation of this,given ,gives

= 2 ( ) +tr[( )( +( ) ~ )]

=~ (1+tr[( )( ) ])

(8)

Taking the exp ectation over and using a wellknown result from multivariate

analysis(Anderson 1984,lemma 7.7.1)thengives(7).

Thustoobtain go o dpredictionsone shouldrst trytoachieve asituationwhere

we can b e as condent as p ossible that Condition 1 is satised at least in some

approximate sense,and atthesametime we should trytokeep the dimension as

smallas p ossible if this can b edone without a substantial increase in ~ .

(8)

0

0 0

0

0 0

0 0 0

0

62

2

1 0 0

0

62

0 +1

2 2

1

2

2 2

0

2 1

1

2 2

+1

1

2

+1

2 2

+1

2

R

R R

R R d d

Theorem 3.

Q 6 6 R R 6 R R 6

R R R d d

R

Q d

d Q d k

k k k k

k

n

n k

k

n

k

y y p

n

n p

k k

k xx xx k k xx k k xx

k k k k k

k

k k

k k k

k

k p k k

PRE PRE k

PRE

PRE k

PRE PRE p<n p n

k

:

: undermost circumstances

(a) Let

The decrease in when going from to (assuming

) isalways nonnegativeand is givenby

Theregressionvector isanunknownquantity,soinwhateverway isdetermined,

from dataorin other ways,it isimp ossible toguaranteethatthe ideal Condition1

holds.This problem is extra accute since it is imp ossible to estimate accurately

when the numb er of variables is large. It is therefore imp ortant to lo ok at the

b ehaviour of the mean square prediction error also when span( ). In this

Section we will nd a new reasonable condition to imp ose up on , a condition

which applies to situations where the mo del dimension is increased stepwise and

whichis notsosensitive tothevalueof .

Thuswestartwithasimplemo delofdimension =1,andincreasethedimension

stepwise.At each step wethen have = asa matrixof rank ,and at

thenext step =( ) for somevector .Then in general ~ = ~ in (7)

will decrease or stayconstant, while the factor 1+ will increase.The

typical net eectwill b e that = will b e a convex function of witha

certain minimum.The aim in the end is to get a low minimum of , and to

achievethis,itisimp ortanttohavetheinitialdecreasein ~ ateachstepaslargeas

p ossible.Forthediscussionwhichfollows,itisuseful tohaveinmindahyp othetical

plotof asafunctionof withtheidealcurvecorresp ondingto~ = atthe

b ottomof theplot.(See Fig.1).In theplotof against there aretwoxed

p oints: = and = if 1.(If 1therighthand

endof thecurvewill essentially tendtoinnity. ) Togeta minimum whichis aslow

as p ossible b etween these two p oints, it is essential that the decrease ~ ~ is

largeforsmall .

Itturnsoutthattheoptimalconditionofthiskindagainissensitive tothevalue

of .However, by b eing satised with a decrease which is as large as p ossible or

nearlyso ,wegeta conditionwhich,whilestill dep endi ng

up on unknown parameters, involve parameters which are more easy to estimate

accurately.

The following result gives the formula for the decrease in the variance and the

mathematically optimal value forthis.

= ( )

~ = ( )

span( )

~ ~ =

( )

(9)

1

0

1

!

0

5

0

0 0 0 0

0

0 2

+1

2

2 1 2

+1

2

2 2

+1 2

2

0

2

1

2

1

+1

2

+1

2

+1

;

:

k

:

k

; Q

d R

R R 6 0

d

6

6 R R6 R R6 R R

R R d

Q R 0

d R

d Q

R

d

6

R

R R

R

d d d

d

Q 6 R

Condition P: R 6 R

R k

k k

k k k k k xx k

k

y y

xx

k

y y

xx

xx xx k

k

k k

k k k

k k

k k k k k

k k k

k

k k

y y

xx

k

y y

xx

k

k k

k

k xy xx k

k

andthisisachievedifandonlyif plusarbitrary valuesin . It

maybeconvenienttoreplace hereby ,givenby with .

(c) With this choice of , will always achieve its smallest possible value

, so the expected meansquareerror atstep isthenas small

as possible.

=const span( )

= + =

~

= +1

Pro of.

We have ~ = ( ) with = , and ~ is

given by the same formula with = ( ). By a straightforward calculation

(see App endi x 1) we nd (9).Since = , we can replace by in (9),and

we can without loss of generality let b e in the space orthogonal to span( ).

ThenbySchwarz'inequalityweseethatthetermsubtractedin (9)ismaximized for

=const ,and thevalue ofthedierence isthen .

For the last assertion, insert = + into the formula for ~ in the

b eginnin g of this pro of, use (9) with = , and the same formulafor in

= toprove ~ = .

This Theorem holds also for = 0 with ~ = , and then it shows that the

smallestp ossible ~ isjustthesmallestglobally,namely = ,thatis,

wegetfullreductionin onestep,andthisisobtainedbytakingspan( )=span ( ),

i.e.,aversionofCondition 1again.

The interesting case, however, is when Condition 1 do es not hold exactly, and

we at a later step have some more or less arbitrary selection matrix = .

Whatconditions should onethenimp ose on in thenextstepin order thatthe

reduction in ~ should b eas large asp ossible? Onemayexp ect that it isnot very

crucialtohavespan( ) determinedin an exactlyoptimal wayaslongas(i) one

go esin adirection which leadtoa denite reductionin ~ ,(ii)thedeterminationof

thedirectiondep endsonparameterswhichareeasytoestimaterelativelyaccurately.

On this background we will normalize by = 1 and concentrate on the

numerator of (9). This numerator then gets its largest value when is chosen

along ,a vectorwhich hastheform ,where b elongs tospan ( ).

Thisleads tointro ducing a new condition:

span( ) span( ).

If we disregard complications that may rise from a p ossible increase of the de-

nominator in (9), it turns out that a choice of span( ) which do es not satisfy

(10)

+1

2 2

2

1

1 6

0 0

2

5

/ 2

111

0

0 0

3 3 3

3

0 3

0

3 3

0

0 0 0

3

0

;

PRE

;

L

k

p k

; ; ; :

xx k k k

k k k

k k k k

k k

xy xx k

k k

k

k k

k

k k

k k k

k

k xx xx k k xx k k xx xy

xx k xy xx k

k k

k

xy k

k xy xx xy

k

xx xy

xy Theorem 4.

6 R R d

R R d

d Q d d P Q P d

P 6 R

d R R d

R

Q d

d Q d

Q d

d Q d

Q

Q 6 6 R R 6 R R 6

6 R 6 R

Q d

R d d

d

R R

Condition P': R 6 6

Remarks.

Q PQP Q P

Assume that . Suppose that we have found and

hence in such a way that Condition 2 does not hold. Assume

furtherthat

where isthe projectionoperator uponthe spacespanned by and . Then

one can always nd another vector , and hence such that

(i)Condition P holds for .

(ii)We have

for all . The reduction in and therefore in is therefore larger than or

equal to what wasobtained by the unstarred spaceextension.

squaredprediction error.

span( ) span( )

= ( )

(10)

=( )

( ) ( )

(11)

~

Pro of.Fromthe formulafor wehave

= ( ) =

where span ( ).Let b e the space spanned by and ,so that

. Without loss of generality we can assume that is p erp endicu lar to

span( ), and then byassumption it follows that , the projection of onto

will b e nonzero. With this the numerators will b e the same on b oth sides of

(11), while the denominator on the left hand side will b e less than or equal to the

denominatoron therighthandside.

As already stated,Theorem4 is valid also for =0, when Condition Pimplie s

.Since the dimension of is always assumed tob e ,we therefore

geta unique solution ateachstep from this:

span( )=span( )

1.The technical condition (10) is really needed; it is not in general true that

is nonnegative denite when isp ositive denite and isa projection.

ThisisthereasonwhyConditionPleadstothesolution in therststepinstead

ofthetheoretically optimal choice .

(11)

k

2

1 0

0 0

k

0

0 Q d

dQ d

6

2

( )

+1

1

1 d

d 6

6 R R

R R

Theorem 5.

R x

x

6 6 6

6

R 6

p

; ;:::;

k

k k <

k k

5 The reduced model with relevant components.

xy

k

k xx

xx k k

k k

k

xx

k

xx

k

xx

xy xy xx xy

k

xx xy

xx

(a) In the sense formulated in Lemma 2, the -conditioned regression model

satisfying Condition P has the same form as the ordinary -conditioned regression

modelunder the following equivalentsetsof parameter restrictions:

(i) .

(ii) .

(iii)Thereexistsa setof eigenvectorsof suchthat belongstothe space

spanned by theseeigenvectors.

(b)Ifthecondition(a)(i),(a)(ii)or(a)(iii)holdfordimension ,butnotfor

, thereis a unique invariant space under of dimension containing contrastto ,which isdicult toestimate forlarge .

3.Ifcovariances arereplaced byestimated covariancesin theab ove formulation

of Condition P',we getthe partial least squares regression algorithm,as shown by

Helland (1988).An early discussionof PLSR can b efound in Woldet al.(1984);it

isnowusedroutinely throughout thechemometricalliterature.

4.Thechoicemadeherein ConditionPisnotunique; anotheralternativetothe

hard-toestimate = , which maximizes (9),is to maximize for some

xed .Thesampleversionofthisleadstocontinuumregression(StoneandBro oks,

1990),which Sundb erg (1993)has shown is closely related toridge regression.Yet

another alternative is to let the b e eigenvectors of . We will show later

that this leads to a p opulation mo del which is equivalent tothe one resulting from

ConditionP.

5.When the condition span( ) span( ) in Theorem 4 do es not hold,

we seefrom Condition P'that span( ) span( ).This case will de discussed

in thenext section.

Via Condition P we have now arrived at a sequence of mo dels which are `nearly

optimal' in terms of exp ected squared prediction error at each step, and which is

formulatedin terms oflinearcombinations oftheoriginal variables withco ecients

that areeasily estimable fromdata.This mo del is conditional with resp ect to just

this rather p eculiar set of linear combinations, however, and it this sense b ears no

relation to the original regression mo del, which was conditional up on all the x-

variables.To nda connection,wemustreplacetheconditioning byarestrictionof

theparameters,which isdone byusing Lemma2.

span( )

span ( )

(12)

R 0

2

5

2

0 0

0

0 1

1

2 2

+1

2

k xy xx xy

k

xx xy

xx

xy k xx

k

xx

k

xx k k

p

k xx k k

k k

k

p k

; ;:::;

k

:

p

k<p

k

;:::;

k

z Rx

R

U

R 6 6 U

6 R 6

R

6

Condition 1: R

Condition 2: 6 R R

R 6 R v v

v

v .

(c) If the conditions in (a) are satised, we canlet bethe linear space

mentionedin(a)(i),equivalentlythe linearspacementionedin(a)(ii)orequivalently

the spacespannedby the eigenvectors in (a)(iii).

should be a -

dimensionalinvariant spaceunder containing

=

span( )

Pro of.

Let b e a matrix of full column rank whose columns are orthogonal

to span( ) = span( ). Then = 0 is equivalent to

= span( ).Multiplyin g theresulting linear relation by , we see

thatthis is equivalent tothestatement thatcondition (a)(ii) holds eitherfor this

orsome lowervalues of .The equivalenc e b etween (a)(i)and (a)(ii)is easytosee

bya similarreasoning,and theequivalen ce to(a)(iii)wasprovedin Helland (1990).

Therest of theTheoremfollows from Helland (1990)and Nsand Helland (1993),

wherealso other equivalent setsof conditions aregiven.

One way to state these conditions together is that span( )

:

span( )

span( )=span( )

It is imp ortant for understanding that nowwill b e a parametervectorin the

reduced mo del, in general not equal to the true regression vector of the original

mo del.

Asp ointedoutbyvon Rosen(1994),there isa substancialmathematical theory

ofinvariantspaces,andtheconcepthasalsob eenusedtosomeextentinlinearmo del

theory(see,e.g.,Kruskal,1968).Itisobviousthattherealwaysisaninvariantspace

containing of dimension , namely the whole space .Theorem 5(a) expresses

equivalent conditions implying that there exist invariant spaces (containing ) of

smallerdimensions.In all cases this imp oses restrictions up onthe parameter

space.The restrictionsimp osed arenaturallynested in .

Theorem 5 was arrived at by taking Condition P as a p oint of departure, a

condition that aims at making (9) large for small . Another p ossible p oint of

departure is to let thecoloumns of b e eigenvectors of : =( ).

Inserting this successively into (9), we nd ~ ~ = ( ) , where is

the eigenvalue corresp onding to .Again this reduction should b e taken as large

as p ossible for small , but since is hard to estimate, it is dicult to say what

successionofeigenvectorsthisleadsto.InanycaseTheorem5showsthatthisleads

toexactlythesamereduced mo del asConditionP.

(13)

1 1

1

0 0 0

0 0

R

R s S s S s S XX

s Xy

R

R xx

xy xx xy

k

xx

xy xx

xy

6 Estimation of the matrix R.

k

PRE

k PRE

k

; ;:::; n

n

k

k<p

k

and the results (except for the reference to Lemma 2) do not dep end on multinor-

mality.Thequestionofnonnormalitywillb etakenupinabroadersettinginSection

9.

Theorem5gives formulaeforthedesired invariant spaceof dimension and condi-

tionsforits existence, everything expressedin terms of theunknown mo delparam-

eters.Fortheuseofthepresent ideas in practice,this space mustb e determinedin

some approximate way, either by using data or by other means.Heuristically one

should exp ect that it is notto o imp ortant todetermine this space very accurately:

If Condition 1 is satised exactly (with as thetrue regression parameter),we do

not need further conditions; if it only holds approximately, then (exact or approx-

imate) validi ty of Condition 2 will help minimizi ng the eect of this.If Condition

2 is only approximately satised, the result is only small additional terms in the

exp ected squaredprediction error .Acrucial p oint isthedetermination ofthe

dimension .Ifit isto o large,Theorem2 shows that will increase; ifit isto o

small,therequirement neededtogetan approximateinvariant spaceof dimension

mayb e to osevere.

Tob eginwith,wewillx andlo okatthedeterminationofspan( ).Anobvious

solutionistousethespaceinTheorem5(a)(ii)withcovariancesreplacedbyestimat-

ed covariances, i.e., take

^

=( ) where = and

= .AsshowninHelland(1990),thisisequivalent tothewell-known,but

stillsomewhatcontroversialpartialleastsquaresmetho dPLSR prop osedbychemo-

metricians.Numerouspublications, in particularin thetwochemometricaljournals

haveshownthatthemetho d functionsreasonablywell,but critiquehasb eenraised

bystatisticians, forinstancein Frankand Friedman(1993).From thepresent p oint

ofview a critical remarkagainst PLSR is thefollowing: The assumption thatthere

existsaninvariantspacecontaining ofdimension impliesarestrictionofthemo d-

elparameters,asexpressedbyTheorem5(a)(ii).Theresultingparameterestimates

resulting from the ab ove PLSR-formula for

^

do es not satisfy the corresp onding

restriction(with probability1)when .Thismayimply alossin eciency.

Nevertheless,asymptoticdevelopmentsinHellandandAlmy(1994)andsimula-

tionsinAlmy(1996)seemtoindicatethatPLSRfunctionswellundertherestricted

mo del when is small.Less is known in precise terms ab out its prediction ability

undertheoriginal basic mo delasformulated ab ove.

Principal comp onent regression isanother much usedand reasonably well func-

tioning metho d. In our setting it is given by the estimates

^

connected to the

(14)

2

k 0 k

0

!

!1

0

p

0

0 0

! ! 1

0

0 0

0

0 0

0 0 0

0

0 0 0 0 0

0

xx

n

n n n n

L L

n

n n

n

xx xx xx xx

n

n n

n

( )

( ) ( ) ( ) 1

( )

0 0

2

( ) ( )

( )

2

P

( )

2

( ) 2

1

( )

2

1

2

2 2 2

2

S R

X y R

R

^

R R XXR R Xy v v

v v v v

v

^

x

R R

R

R R

WQW

Q 6 6 R R6 R R6 W

R R

y X

^

y X

^

R

^

R XX

^

R

^

R Xy

Q

t y

; n

p k

L d ;L

L

L PRE y

PRE

d ; :

PRE

n

PRE

k

n n

o

n

;

n

n k

;

n

O =n k

there are several ways todetermine which eigenvectors of toinclud e in

^

, the

two most common solutions b eing those with the largesteigenvalues or those with

thelargestvalues ofa -statistics connectedto .

A full discussion of various estimation pro cedures is b eyond the scop e of the

present pap er.We will limit ourself tolisting somemathematical results that may

b e of relevance in this connection.Most of the results are relatively easy toprove,

though some pro ofs, while based on relatively simple ideas, require more detail-

s. We will assume a sequence of data sets ( ) (size ) with

^

b eing a se-

quence of estimators of (all of the same dimension ), and we take =

^

(

^ ^

)

^

. Foravector andavectorspace ,let ( )b e

thedistancefrom to ,i.e.,thelength P ,whereP istheprojectionof

on .Finally,theexp ected squaredprediction erroris =E( ) .

1.If

^

is determined from an indep ende nt training set or if

^

converges

in probability to a constant matrix , then the exp ected squared prediction error

convergesto if and only if

( span(

^

)) 0 (12)

Ifthis condition do es nothold, every subsequence limit of will b elarger or

equal to ,withstrict inequali ty foratleastone subsequence limit.

2.If Condition 1 holds for the limiting matrix so that = forsome ,

then,as in Helland and Almy (1994),as

= (1+ )+ 1

E( ) + (

1

) (13)

where = ( ) and is the limit in distribution of

(

^

).

3.Asa p ossible alternative tocrossvalidation for determiningmo del dimension,

one might hop etodrawup on theestimated residual variance

^

~ = 1

( )( ) (14)

where

^

= ( ) . It is easy to prove that this estimator is

asymptotically unbiased in the sense that E(

^

~ ) ~ as with ~ = +

.

A serious diculty though is that the rate of convergenceof E(

^

~ ) towards ~ ,

although of order (1 ) will in general dep end up on the dimension . To get

(15)

X Y

j

0 j j

2

0

C

s ;

k

s

0 0 0 0 0

0

0 ( )

2 1

1

0

2

=1

2

=1 1

^

R

XR

s R RS R Rs RS R

^

S s

^

R

^

R S

6 6

s e

6 S

R

p

n

y xy

xx xy

xx

LS

xx xy

xx

y k

j xy

i

k

j i

xx xx

may b e corrected for. This issue will not b e pursued here. Note that for mo del

choice situations where criteria like Mallow's and the Akaike criterion are used,

noestimation ofa subspaceis needed, sothis problemdo es noto ccur.

4. One obvious candidate for an estimator is the maximum likeliho o d

estimatorunderthe restricted mo del,which is develop ed in Helland (1992).Unfor-

tunately,this estimatorrequires heavy computation,and the empirical results in a

prediction setting are notto o convincing (Almy, 1996).An alternative candidate

istheconditional maximumlikelih o o d estimator,whereone conditionsup on in

therestrictedmo del.Using essentially thesamecomputations asinHelland (1992),

we getthatthis estimatoris found byminimizin g

( ( ) ) (15)

Notethatthepro ductoftwofactorsistob eminimiz ed in(15);therstfactoris

minimizedifandonlyif = span ( )andthesecondfactorisminimized

i is spanned by the eigenvectors of with the largest eigenvalues.In this

waytheresulting predictor will have a relation b oth toordinary regression and the

mostcommonform of principal comp onent regression.

As in Helland (1992) the minimization here can b e done stepwise, rst in one

dimension and thenbysuccessively increasing thedimension, if weimp ose thecon-

straint that the resulting estimated subspaces should b e nested within each other.

The minimization for each dimension can also b e done for instance as in Helland

(1992).

A simpler solution is achieved by assuming that = is known, when we

maximize thelikeliho o d byminimizi ng

(

( )

) (16)

with an obvious notation for eigenvectors and eigenvalues.One should exp ect this

tob eareasonablesolution alsowhen isunknownandeigenvectorsfrom are

used.A further simplic ation is toapproximate the rst paranthesis in (16) by a

pro duct,sothatnominimization oversubsetsisneeded.Unfortunately,simulations

indicatethatthecorresp onding predictorseems tob ehaveroughlylike theprincipal

comp onentpredictorbasedup onselecting comp onentsbyat-test(T.Almy,private

communication), a predictor which in most cases is known to b e inferior to the

ordinary princip al comp onentpredictor.

5.Other estimators of can b efound by taking intoaccount invariance prop-

erties ofthemo del.This ispresently b eing investigated.

(16)

2 2

0 0

0

2

j

1

0 0 0 0 0 0

0

0 0

1

0 0

1 1 1

1

1 2 1

;

n q p q

;

n

;

PRE

;

p k k<p

:

y

PRE

n

n :

xx

xx xy

y y xy xx

y y xx

xx xx

xx xy

xx

xx xx xx xx

0

0 0 0 0

0 0 0 0 0

0

0 0 0 0 0

0 0 0

0

0 0 0 0 0 0 0 0 0 0

variables.

Y XB E

Y B E

0

x 0 6

B 6 6

x y y Bx e e

0

^

B

^

Bx

y

^

Bx y

^

Bx

6

^

B6

^

B6

^

B

6 B6 B

R

^

B R RXXR RXY

R

Y X

^

BX R RXXR RXXB

^

BX B

Condition 1. B R

RXXR RS R R6 R

XR

^

BXR

R R6 R R6

^

B

^

B6

^

BXR

B6 R R6 R R6 B EXR R6 R RXEXR

Amultivariate extension of(1) is

= + (17)

where is , is an parameter vector, and where the rows of are

indep end entmultivariate normal( ).Againwealsohaveinterestinthemarginal

-distribution, which is assumed multinormal ( ).(For mostpurp oses, multi-

normality may b e disp ensed with; see Section 9.) If then (17) is lo oked up on as

representing indep endent conditional distributions, thejoint distribution will also

b emultinormal, and = .

Take ( ) from the same p opulation, so that = + with

N( ).Foran estimator onegetsa predictor ,andwhenevaluatingthis,it

isnaturalto weightthe dep endent variables bythe inverse errorcovariancematrix.

Thisgives

=E(( ) ( ))

=tr( ) 2E(tr( ))+E(tr( ))

(18)

where = + .

Consider nowsomexed matrix offull rank and theestimator

= ( )

Using this for prediction gives a similar predictor for each -variable as used in

Section 2,butwiththesame foreachvariable.Takingconditional exp ectation of

,given we get

E( )= ( )

Inparticular,E( )= if

span( ) span( ).

As in the case with one resp onse variable, this condition also minimizes .

Tolo okfurtherup ontheexp ected squaredprediction errorwhentheconditiondo es

not hold, we will use the approximation = , and

we will condition up on , as in the pro of of Theorem 2. Then E( ) =

( ) .Using (17)in the formulafor

E( )

( ) + E[ ( ) ]

(19)

(17)

1

2

1

2 1

2

j 0

j 1

0

1

1 1

+1

1

+1

1

:

PRE q

k

n :

:

k

;

k

0 0 0 0

0 0 0

0 0 0 0 0

0 0 0

0

i

i xx xx xx xx

i i

i

xx xx xx xx

k k k

k xx xx k

k

xx k k xx

k

k k

k k k

k

k k

xx k k

k

xx

e E

e XR B 6 6 R R6 R R6 B

f e

f

E E XR I

B 6 6 R R6 R R6 B

R R R d

Q 6 6 R R 6 R R 6

R R

d Q B BQ d

d Q d

Theorem 6.

I

d d R R

Condition 2 6 R R

R

6 B

Assume .

(a)ThepopulationPLS2algorithmminimizesateachstepthenumeratorof(21)

if . This algorithm terminates at if and only if Condition1 then

holds.

(b) Ifthisalgorithm terminatesat step , thenalso

spansa minimal spacesatisfyingboth Condition1 and Condi-

tion2.

(c) Alternatively, this space can be characterized as the smallest space spanned

by eigenvectors of which contains .

(d) The parameter restrictions formulated in (b) or (c) above constitute the re-

strictionsgeneralizing those of Theorem 5 to the multivariate case.

Toinsert theexpressionsab ove into (18)weneed that therows of satisfy

V( )=

~

= + ( ( ) )

Then = areindep en dent with covariance matrix

~

, and writing

outin terms ofthe 's,wend that

E( )= tr(

~

)

Sotakingrstconditional andthenunconditionalexp ectation inequation(18)gives

( +tr( ( ( ) ) ))(1+ ) (20)

This isthe multivariate generalization offormula(7).Again itis naturalto de-

terminethespacespan( )successively: =( ).Verymuchoftheprevious

development is exactly asb efore.Let = ( ) as

b efore, and let b e the rst factor on the righthand side of formula (20) when

= isinserted.The generalization of formula(9)is then

= (21)

(See App endi x 1.) For the formulation of the following Theorem we refer to the

p opulation PLS2algorithm dened and discussed brieyin App endi x 2.

=

=1 =

: span( )=span( )

andwehave that

span( )

(18)

1

2

1

2

k

; :

k

k k

p k

0

5

0 0

0

0 0 0

0 0

0 0 0 0

+1

( +1) ( +1)

( +1) ()

+1

1

+1 +1

+1

k k k

k

xy k

xy

k

xy

i

xx i

xy

i

xx

i k k

k k

k

k k

k xy xx k

k xx k

k xx

k k k k k

k xy xx k

k k

xy xx k k

xx k

k

xx

d w Q BBQ 6 6

6 0 D 6 6

S 6 R D I S S S S D 0

D S B R

Q B 6 6 R R 6 R R 6

w d R R d

Condition P R 6 6 R

R R

6 6 B B R R

6 R

R

B

6

U 0 UB 0 U

6

R

^

wXY

^

Y Xw

(a) In the notation of App endix 2, the numerator of (21)is maximized if and

only if = is an eigenvector of = with maximal

eigenvalue, and this is just the waythe algorithm is dened in the app endix.The

algorithm terminatesatstep i = ,which in thenotation =

and = means =( ( ) ) = .This is equivalent to

span( ) span( ),hence span( ) span( ).

(b)Since = ( ) determinesthenext vector

= in =( ), itisclear that

: span( ) span ( )

When thealgorithmstopsatstep ,then mayb ereplacedby here.Since

we know that = with span( ) span( ), it follows that span( )

.Sincethematricesonb othsideshavethesamerank,equalityfollows.From

thepro of in (a) itfollows that the space span( ) is minimal with theprop ertyof

containingspan( ) amongallnontrivial sequences of spacessatisfying ConditionP

ateach step.

(c)A space ofdimension satisesCondition 2if and only ifit isspanned by

eigenvectorsof .

(d) Lemma 2 can immediately b e generalized to the multivariate case. The

previous restriction = now reads = , where is spanned by

eigenvectorsof .This gives clearly thesamerestriction asin (c).

So far for the p opulation algorithm corresp onding to a restricted p opulation

mo del.The natural sampleestimate of arising from this - again see App endix 2

-then givesthe PLS2-predictorfrom chemometry.Therearesomevariants ofthese

estimators/predictors(seeHolcombetal.(1997)andreferences there),andatleast

someofthem seemtop erformp o orlycomparedtootherpredictorsprop osed bythe

statisticalcommunity(Breimanand Friedman,1996).Thereis probablyscop eb oth

for improvements and for b etter comparisons, and one p ossible p oint of departure

mayb e thepresent reducedmo del.

Note,however,thattheab oveformulationalreadyp ointsatoneweaknessofthe

PLS2algorithm.Togettherelationsketchedb etweenthealgorithminitsp opulation

form and the reduction in mean square error, we had to assume that the residual

covariancematrix wastheidentity.Thecasewheretheresidualsb etweendierent

dep endentvariablesarecorrelated,isonecasewhereweexp ecttogetsomegainfrom

ajointprediction.Itseemslikelythatamo diedalgorithm,whererstapreliminary

estimate isfound, and thena maximization of is done in each

(19)

i

2

0

0 0

0 2

0 0

0 0 0 0

0 8 Classication.

1 1

2 2

1 1 2 2

2

1 2

1

1 2

1

1 2

y

k

p ;

;

n

p

n

k

p k

; i ;

;

k

:

wXYYXw

R

B

x 6

R

z Rx z Rx z

R R6R

z

R R6R R

R

Condition 1. 6 R

step instead of that of , will have the p ossibili ty of leading to b etter

predictions.

Both in thechemometric literature and in the statistical literature (see theref-

erencesab ove)therehasb een somediscussionon whenit pays todo singlevariable

predictionand whenitpaystoinclude several -variables simultaneouslyinthepre-

diction.The present mo del formulation may throw some light up on this question.

Themultivariate prediction means parsimonyin thesense that the same is used

for all variables.On the other hand, if the dimension of the invariant space has

tob e increased muchto ensure that all columns of b elong tothis space, the net

gainmayb enegative.

Thesimplest classication metho dis linear discriminant analysis, whereweassume

classicationvariables thatareobserved ineachof2 classes, N( )in the

rst class and N( ) in the second class.Again the mo del parameters are

estimated by a training set, say observations from each of the two classes, and

again the estimation is dicult if the numb er of variables is large compared to

.In an interesting recent article Friedman (1997)argue that this problem is less

in classication thanin regression, but theproblem may nevertheless b eserious in

manyapplications.

So assume that we reduce the numb er of variables to by letting b e some

xed matrix and taking = and = . Then of course

N( ) ( = 1 2), and standard linear discriminant analysis can b e done

withthenew variables .

Concentrate on the simple symmetric case with equal prior probability for the

two classes and equal cost. Then (see for instance Ripley, 1996) the asymptotic

probability ofmisclassication fortheclassication based up onthe 'swill b e

=28(

1

2

) (22)

where 8 is the cumulative standard normal distribution function, and where is

p ositive with =( ) ( ) ( ).Again we are relatively con-

dent with theformulae resultingfrom asymptoticcalculation when the dimension

ofthevariables involved isonly .

The probabilityof misclassication is small i is notto osmall, andthis is the

objectiveforourconditions forthe(theoretically) b estp ossible choice of .

( ) span( )

(20)

0 0

0

0 0

0

0 0

5

0

j j

j

0

5 0 0

0

0 0

0 0 0 0 0 0 0 0

0

0 0 2

1 2

1

1 2

1

1 2

1

1 2

2

= =

=

1

1 2

2 1

k

>

:

Q

t t t

t =Q

6

U6 0

Condition 2. 6R R

Theorem 8.

R

6 R R 0

R6R

6 R R 0

R6R 6R R6R R6 R6

R 0

R6

R

We have with equality if and only if Condition 1

holds. Hencethis condition minimizes the probability of misclassication for given

, and thisminimum is (asymptotically to the lowest order)the samefor all values

of .

Assume a general is such that Condition 1 may or may not hold, and put

, where . TheneitherCondition2 holdsand

, hence for the probability of misclassication ,

or Condition2 does nothold and .

( ) ( )

Pro of.

Multiplying the identity (6) from the left by ( ) and from the right by

, we see that the minimum is reached for ( ) = , which is

equivalent to Condition1.

A similar discussion asin the regression case involving stepwisemo del selection

can b e made.Instead we take a simpler approach showing a dierent prop erty of

thereduced mo del.We thenintro duce atonce

span( )=span( ).

( )= + = inf =

= sup =

sup

Pro of.

When ( )= + with = ,wend

= + ( ) +2

Thelasttwotermson therighthandsideheredep end up on .IfCondition2holds,

they vanish when = .Assume that Condition 2 do es not hold.Then can

b e chosen such that the quadratic term in - call it - is p ositive.For such an

replace by ,where issomescalar.Minimization over leads tothatthesumof

these termsis negativewhen = .

We hop eto discuss estimatorsof and corresp onding classication pro cedures

elsewhere.

Remark.

Asimilar discussionofclassication errorusing astepwiseincrease in dimension

can b e done in exactly the same way as in the regression case.The formula (22)

(21)

0 0

0 i i

i i

xx 0

f g

j j

j 0

j j 0 j

0 0

0

0 0

2

0 0

0 0 0 0 0

0 0 0 0

0 0 0 0 0 0

2 R

6

R 0 R6

0

x

x x

x

x x

x

x x x

^

6

^

x

z x

x

9 General theory and discussion.

k k

y

PRE y y ;

y ;y i ;:::;n

;y i ; ;:::;n

y y y

PRE y y y :

p

h k<p

with a small numb er of columns.When is large, sample variation will cause

theclassicationerrortob elarger thanwhatisgiven by(22),againparallell tothe

prediction error in the regression case.It is therefore again imp ortant to keep the

numb eroflinear combinationsofvariables onwhichclassications should b ebased,

low.Again a restricted mo del, stating that only a small numb er of eigenvectors of

contribute in thecassication,seemto b euseful.

On theother hand, the prop ertyusedin Theorem8 ab ove, thatCondition 2 in

general implie s an equivalence b etween the orthogonalities = and =

, has useful consequences in the regression case, also. (It can b e used to derive

alternative expressions for the exp ected squared prediction error.) Furthermore, a

minimax prop erty of prediction error - analoguous to that for classication error

givenin Theorem8- can b eformulated fortheregression case.

For deniteness we will return to the situation of predicting a scalar variable

froma vector of variables, but generalizations tocoverthesituationin Section 8

should b e fairly straightforward to make.We will also assume quadratic loss (and

will assumethat hasnite variance), sothegeneraltask istominimize

=E( ^ ) (23)

where ^ is a function of and of atraining set ( ); =1 of indep en-

dent observations, such that all ( ); = 0 1 have thesame, more or less

unknown distribution.

Theordinarylinearornonlinearregressionpro cedure istogoviatheconditional

exp ectation E( ) and use as ^ an estimate

^

E( ) of this from the training

set.By adding and subtracting E( ) to ^ in the resulting equation (23),

we nd thatitis equal to

=E[Var( )]+E[E( )

^

E( )] (24)

The rsttermhere isunrelated tothetraining set,soitis thesecondtermthat

one should trytoreduce in order togive go o dpredictions.To geta feeling forthis

lastterm;in thelinearregressioncaseit isE(( ) ( )).Thereductionof

this termwill b ea problem if thenumb er of parametersislarge, which itgenerally

will b e if the numb er of variables in is large.So we may cho ose to reduce the

mo delbyconditioningonasmallervectorvariable = ( )(sayofdimension )

instead of on the whole vector .Note that in this general setting the analogueof

(22)

2

0

0 0

j j

j j j j

j j 0 j

j j j j 0 j

j j j

j

j j

5

j

j 0 j j

y y y

y y y ;

PRE y y y

y y y y

:

y y

y

y y

y

y y

PRE

y y :

z

x z

x x z

z x x z

z x

z z z

x x z z z

x z x z x z

z

Condition 1. x z

Theorem 9.

x z

z Rx

x

x x x

x z

z x xRx

(a) Therst term in (24) is constantin goingfrom to if Condition1 holds.

Inall other casesthisterm will increase.

(b) For the case of a linear function and of multinormal observations

(ormoregenerally,assumingthat all conditionalexpectations arelinear; seebelow),

the Condition1 hereis equivalent tothe previousCondition 1.

conditiononisin generalsomethingessentially dierent fromequatingsomesp ecic

parametersin themo deltozero.

An imp ortant question ishowtonda sensible newvector tocondition up on.

Theoretically, the rst requirement to b e satised is that the change from to

should notincrease substantially the rst term in (24).It is easy to nd a simple

theoretical condition forno such increase to take place: Lo ok at theversionof the

well-known identity Var( ) = E[Var( )]+Var[E( )], conditioned up on , and

taketheexp ectation ofthis toget

E[Var( )]=E[Var( )]+E[Var(E( ) )] (25)

sobyconditioning up on instead ofup on in (24)we nd

=E[Var( )] +E[E( )

^

E( )]

=E[Var( )]+E[Var(E( ) )] +E[E( )

^

E( )]

(26)

Hence theterm indep endent ofestimation in (24)do es notincrease when going

from to if and onlyif Var(E( ) )=0,i.e.,ifE( )is afunction of (almost

surely),which then necessarilymustb eE( ).This leads to

E( )=E( ) (a.s).

=

Pro of.(a)is alreadyproved.For(b),useTheorem1,either(b)or(c)there; the

generalization toother cases withlinear conditional exp ectation is straightforward,

usingTheorem 10b elow.

To nd further conditions we will limit ourselves to linear functions = .

Furthermore,wewill assumethat theconditional exp ectation of ,given islinear

in ;i.e., E( ) = .Then, from (25), theincrease in conditional variance of

when going from conditionin g on to conditioning on and hence the increase in

thenon-estimative part of ,will b e

E[Var( )] E[Var( )]=E[Var( )] (27)

(23)

1

1 2

2 2

1

1 1

0

j

0

j

0

j j

j

j j

j 0 j j

0

0 0

j j

;

y y

y

p

;

xx xx xx xx

xx

xx xx xx xx

xx xy

xx xx

xx

xx xx

xx

xx xx

0 0 0 0

0 0 0

0 0

0 0 0 0

0 0 0

0

0 0 0

0 0

0

0 0

0 0 0 0 0 0

0

0 0 0

0

0 0

0 0 0 0 0 0

0 0

0 0 0

0 0

0

0 0 0 0 0

6 6 R R6 R R6

Theorem 10.

x 6

x Rx xRx ARx

xRx Q

Q 6 6 R R6 R R6

xRx

~

Q

~

Q Q QMQ M

~

Q

x Rx z Rx

z z R6 R R x

z

~

Q U 0 RU 0

R U x z

xRx ARx xR

A 6 R R6 R

R U RU 0 R U

Q U U6 U U

x 6 R R6 R Rx Q6 x

xRx Q6 xRx

Q6 xx Rx xRx xRx 6 Q

Q6 xx 6 Q QMQ

U U6 U U QMQ Q QMQ

M 6 xRx xRx 6

Assume that has a nitecovariance matrix . Then we have the following:

(a) If the conditional expectation of , given is linear: ,

then

where .

(b) Ingeneral

where for a nonnegative denitematrix which issuch that is

nonnegativedenite. Sothedierencein variance,hencetheapproximate dierence

inmeansquareerrorwhenwedisregard estimationerror, isgivenby thisexpression.

(c)Ifthe conditionalexpectationof ,given islinear, thenwith we

have , where . Furthermore, if ,

wehave . Themodelrestriction , where

and has full rank, leads tothe same conditionalexpectation .

( ( ) )

This was the basic result needed for the mean square error calculations of Section

2, and it is imp ortant, though p erhaps surprising to some that no distributional

assumptions atall (except for nite variance) is needed for a closely related result

tob evalid.

E( ) =

E[Var( )]=

= ( )

E[Var( )]=

=

E( ) = = ( ) Var( ) =

Var( )= ~ = + = =

( ) E( )=

Pro of.

Incase (a)wend bymultiplyin g E( )= fromthe right by and

thentakingexp ectation that = ( ) .

In general, given , cho ose such that = and such that ( ) hasfull

rank .Then from equation (6) we have = ( ) , and by the same

equation

= ( ) + (28)

so

E(Var( ))=E(Var( ))

=E( [E( ) E( )E( )] )

= E( )

= ( ) =

(29)

where = E[E( )E( )] and equation (6)hasb een usedagain.

(24)

0

0 j

j

5

j

y

x x

y x

Q

x x

z Rx z

x

U 0

x Rx

x x

Q Q

R Q

R

is nonnegative denite. Nonnegative deniteness of

~

follows since the exp ected

variance calculated ab ove mustb enonnegative.

The pro of of(c) isfound byrst noting that E( )= ,and thentaking the

conditionalexp ectation ofthisgiven = .TheformulaforVar( )followsfrom

the same result used to prove (25).The formula for E( ) under the restriction

= follows from equation(6).

The consequence of this is that all essential results ofthe Sections 2-5 arevalid

if we assume that the conditional exp ectation of , given is linear.The class

ofdistributionsforwhichthisisvalid,include stheellip tical (orellip ticall y symmet-

ric) distributions; see Devlinet al.(1976)and references there,in particularKelker

(1970).Inaddition,Theorems3and4holdwithsome-admittedlynontrivial-mo d-

ications essentially without anydistributional assumptions atall on the variables,

assumingonly nite moments and E( )= .The detailed pro of of this will b e

omitted, but the main idea is that these pro ofs are directly based on the formula

fortheexp ectedsquaredprediction error,which byTheorem10(b)isasymptotically

valid in general if we only replace the matrix by

~

, and on the fact that this

matrixalsohasavanishin g pro ductwith andthatitissmallwhen issmall.As

a consequence, the two basic conditions, Condition 1 and the invariance condition

on span( ) on the reduced mo del are of somerelevance forany linear mo del with

random 's, whatever the distribution of these 's, and whatever the conditional

distribution of , given these 's.The discussion on estimation in Section 6 is also

quitegenerally valid,but themaximumlikelih o o d estimate ofSection 7 isof course

distribution dep endent.

One way to put this, is that the chemometricians in some sense seem to have

b eenontherighttrackwhen theyhaveusedtheterm`softmo dels'inconnection to

their PLS regression.Thegeneralfeeling amongstatisticiansstill is thatchemome-

triciansare imprecise in some of their terminology, but on the other hand itseems

to b e easier to develop new and fruitful ideas on an intuitive level than if full rig-

or is demanded at each step.Recent issues of the chemometrical journals contain

ideas that go farb eyond whathas b een discussed in this pap er.(This has also re-

cently reachedstatistical journals;seefor instanceDurand and Sabatier(1997)and

references there.)

The wayof thinking of statistical mo dels that is promotedin this pap er, p oints

furtherthansp ecic chemometricmetho ds,however.Wemakeexplicit thefactthat

inthecasewithmanyunknownparametersitisusefultohaveatleasttwodierent

statistical mo dels underconsiderations at the sametime: The `correct'mo del that

adequately describ es reality in all details and the `simplied', reduced mo del which

(25)

:

;

p p

0

0 0

0

p

0

1

+1

+1 1

+1

1

0 0 0

0

0 0 0 0 0 0

0

0 Appendix 1. Proof of (9) and related formulae.

S 6 R c 6 d P S S S S I P c

S

P P

I P c c I P

c I P c

6

6 R R 6 R R 6 6 R R 6 R R 6

Q d d Q

d Q d

k xx k k xx k k k

k k

k

k k

k

k k

k

k k

xx

xx k

k

xx k

k

xx xx k

k xx k

k xx

k k

k k k pap erweintro duce forthelinearmo delcasesp ecictheoreticalconditions toassure

thatthereducedmo delfunctions aswellasp ossible forprediction purp oses.Inthis

way we get a nested sequence of reduced mo dels, and the order can b e found by

cross-validation orin other ways.

The p ossible danger of overtting of mo dels that may result from this way of

thinking,needs to b efurtheranalyzed.If theorder of the mo del isfound bycross-

validation, one will probably b ereasonably safe, but in themultinormal case there

also seemtob e somep ossibili ty of using estimated prediction error found from the

ordinary regressionmean square,a p ossibility thatshould b einvestigatedfurther.

There may b e some intuitive arguments to the eect that b ecause irrelevant

informationisthrownaway,theestimatesfromthereducedmo delmayhavecertain

robustness prop erties.Exact results in this direction may b e dicult, but will b e

welcome.

Practicalalgorithmsforcomputingestimatesarenottouched up onatallin this

pap er.It iswell known from partial leastsquaresregression that theb estformulas

or algorithms for theoretical understanding are usually not the b est for numerical

computations.

As a nal p oint, once the ideas b ehind this pap er are accepted, there seems

to b e p otential for extending in several directions. Logistic regression and other

loglinear mo dels(withorwithoutlink functions)withmanyexplanatoryvariablesis

an immediate p ossibili ty, likewise multivariable mo dels withmanyparameters.An

interesting challange is the situation where the parameters are not in linear form,

but where one neverless p erhaps maygive a similar theoryto thepresent theory if

theparametersarerelatedbysomegroup symmetry.

Let = , = and = ( ) .Then ( ) -

if nonzero -is orthogonaltospan( ), andtherefore

=

( ) ( )

( )

Multiplyi ng this equation fromthe leftand from theright by thengives

( ) ( ) =