Distribution based truncation for variable selection in subspace methods for multivariate regression

(1)

methods for multivariate regression

KristianHovde Liland

1 ,⋆

,Martin Høy

2

, Harald Martens

2 , 3

, SolveSæbø

1

8th January 2013

1)NorwegianUniversityofLifeSienes,DepartmentofChemistry,BiotehnologyandFood Siene

P.O.Box5003,N-1432Ås,Norway

2)Noma,NorwegianInstituteofFood, FisheriesandAquaultureResearh

Osloveien1,N-1430Ås,Norway,

3) NorwegianUniversityofLifeSienes, DepartmentofMathematialSienesandTehnology

P.O.Box5003,N-1432Ås,Norway

(

⋆

)Correspondingauthor: kristian.lilandumb.no,tel: +4764965830

(2)

methods for multivariate regression

Abstrat

Analysisofdataontainingavastnumberoffeatures,butonlyalimitednumberofinformativeones,

requiresmethodsthatanseparatetruesignalfromnoisevariables. Onelassofmethodsattempting

this are the sparse partial least squares methods for regression (sparse PLS). This paper aims at

improvingthetheoretialfoundation,speedand robustnessofsuhmethods. A generaljustiation

of trunation of PLS loadingweights is ahieved through distributiontheory and the entral limit

theorem. Wealsointrodueaquikplug-inbasedtrunationproedurebasedonanovelappliation

oftheoryintendedforanalysisofvarianeforexperimentswithoutrepliates. Theresultisaversatile

andintuitivemethodthatperformsomponent-wisevariableseletionveryeientlyandinalessad

ho mannerthanexisting methods. Predition performane is onpar withexisting methods,while

robustnessisensuredthroughabettertheoretialfoundation.

1 Introdution

Oneofthemajorhallengesinreentandomingdataanalysisistheeverinreasingnumberofvariables

reorded for eah sample. The data matries beome wider and wider. Beause of instrumental noise,

biologialnoiseandother unontrollablevariationsin the reordedsignal,variables that shouldhaveno

signalforagivensample,orbeequalarosssamples,almostnevershowazerosignalin thenalentred

data set. And dierenes between two signals that should be zero are seldom zero in pratie. Sine

preditivemultivariatemethodslikepartial least squaresregression(PLSR) [1℄ in theirbasiforms take

into aountall variables, the sheer number of non-zero noisevariables will often over-shadowthe true

signal.

Variousforms ofvariable seletionapproaheshavebeenproposed in theontext ofregression. Variable

seletion analso play a role in nding important variables in explorative studies, with the purpose of

stabilizingtheregressionmodellingandimprovingitspreditiveabilityandinterpretability. Sometimesthe

aimistondwhihvariablesinueneaertainproessausually,oratleastonveythemostinteresting

information, e.g. metabolites, genes, wavenumbers, ormoleular weights. Depending on the aimof the

studydierentseletionstrategiesmaybefavourableandthefousonhowmanyvariablesto retainmay

bedierent.

Basedonideasofomponent-wisevariableseletion,sparsenessandnormallydistributednoisewepropose

tousedistributionbasedtrunation toidentifyallunimportantmodelparametersthatare(orappearto

be)non-zeroduetorandomerrors,andforethesetowardszero. InthepresentPLSRontext,thismeans

tozerooutsmall,apparentlyrandomelementsinalltheloadingweightvetors. Theintensionisthereby

todrastiallyreduetheproblemofnon-zeronoiseontributions. Inthefollowingsetionswewilllookat

somerelatedmethodsintendedforthesamepurposeandmotivateasimple,intuitiveandexiblestrategy

fortrunationofnon-informativevariables. Appliationstorealandsimulateddataandomparisonwith

othermethodswillalsobepresented.

(3)

A basi assumption in statistis is the entral limit theorem (CLT). The CLT was rst presented by

AbrahamdeMoivrein1733andhasbeenformalisedandinterpretedundervaryingonditionsanddegrees

of stritness eversine. A simple interpretation is that as the number of observations sampled from a

randomproessinreases,thedistributionofthemean(andthesum)willapproahanormaldistribution.

Moreinterestingin this ontextis thatmany typesofrandom noiseare seenasapproximatelynormally

distributed, and linear ombinations of suh will tend even more towards the normal distribution. In

this paperwepropose to use the CLT to distinguishbetweenvariables with expeted non-zero loading

weights from the noisy variables with loading weights with a zero-expetation. We refer to the new

modellingprinipleasTrunation-PLSin thefollowing,andtheresultingmethodsTrunation-PLSRand

Trunation-PLS-DAaredesribedindetailinSetion3.

Manyapproaheshavebeeninventedthatattempttondtheinterestinginformationinaloudofvariables

theneedleinthehaystak. Oneoftheoldestandmostvariedlassofmethodsforthispurposeisvariable

seletion. Alargeproportionofthese methodsworkunivariately,evaluatingsinglevariablesforinlusion

orexlusion. Whenthenumberofvariablesareountedintensorhundredsofthousands,thisstrategywill

bepronetospuriousorrelations,hamperedbymultipletestingproblemsandvulnerabletolowsensitivity

or high false disovery rate. Moreover, it an lead to serious misinterpretation: Assume e.g. that the

regressorset ontainsbothan "upstream",ausally important variableobservedwith muh noiseand a

"downstream" onsequentialbut unimportant variable observed with little noise, and that the two are

stronglyinterorrelated. Traditional stepwisevariableseletion methods willthen eliminatetheausally

importantvariableto reduetheollinearity.

Subspae-based regression methods suh as PCR and PLSR attain an impliit variable seletion - not

byeliminating individual variables, but by eliminating subspaedimensions-i.e. linearombinationsof

variables. However, if the number of noisy regressor- orregressand-variables is very high ompared to

thenumber ofobservations, this basibilinear approah is notgood enough: Theombined ovariation

ontributions of thenoisy variables preventthe bilinear regressionmethods from nding auseful initial

subspae. Therefore,various variableseletion stragetieshavebeendevelopedalso forPLSRto improve

predition and to simplify interpretation, but without eliminating interesting variables just to redue

ollinearity.

Oneapproah is to redue small parameterstowards zero by a generalshrinking/expansion of thePLS

loading weight elements aording to ahosen exponent (Powered PLS[2, 3℄). Another approah is to

indue sparseness in the data byforing ontributions lose to zeroto be true zeros. Examples of suh

methodsaretheleastabsoluteshrinkageandseletionoperator(LASSO)[4℄anditsspin-otheelastinet

[5℄, bothinduingonstraintsonthe

L 1

^norm^of^the^regression^vetor

β

^. ^The^latter^method ^also^applies

ridgingbypenalizingthe

L 2

^norm^of

β

^. ^F^or^PLSR^sparseness^was^introdued ^by^Martens^&^Næs^(1989,

p. 160),whosuggestedtheuseof roughstatistial signianetesting oftheelementsin eahindividual

loadingweightvetor,followedbyare-orthogonalization. Asimilarapproahwasimplementedintermsof

thesoft-threshold-PLS[6℄(ST-PLS)andsparsePLS[7℄(sPLS).Thesemethodsapplyashrinkagetowards

zeroto thePLS loadingweightssothat manyontributions beomezero. The amountof shrinkagean

behosento removeaertainproportionof thevariables oritanbehosenbysomeotherriterion. In

addition to giving amultivariateapproah to variable seletion, these methods analso selet dierent

variablesin eah PLS omponent that is produed. Asthese two methods, ST-PLSand sPLS, arevery

(4)

ofthismethodtsmodelsmuhfasterthanthesPLSversion. Weproposetoombinethesparsenessideas

with the distributional quality of noisein data, e.g. in PLS loadingweights, to sort between noiseand

signalandtherebyweightingdownorompletelytrunatingwhat islassiedasnoise.

Inadditiontoseveral ofthementionedsparsemethods wewillinlude variableseletionbytheVariable

InueneonProjetion[8℄(VIP) andSeletivityRatioplot[9℄(SR)methodsforomparison. ThesePLS

basedmethodsusedierentriteriaforassessingtheimportaneofvariablesinregressionandlassiation.

Wewill notgointo details abouthowvariables areseletedbythese methods in thispaper,but inlude

themasreferenestandards.

Thedistributionbasedtrunationapproahto variableseletionaddsto analreadylonglistof methods

forvariableseletion. Asdesribedinthis artiletheseletionofvariables in thisapproah ismotivated

from awell establishedpriniplein lassial statistis. Furthermore, thereis only one tuning parameter

whih needs to be set for variable seletion, whih makes the method simple and easy to implement.

The statistial foundation and the non-omplexity of the new method makes it appealing and easy to

understand. However, thepreditive performane of predition methods is typiallyverydependent on

theproperties of thedata, and there is no uniformly best method for preditionand variable seletion.

Therefore,itisimportanttoexpandthestatistialtoolbox,butatthesametimeitisimportanttobuild

anunderstandingofwhen thevariousmethods workbest. Inorderto dothisweomparethepreditive

performane of the various methods and attempt to interprete the results in light of the multivariate

propertiesofthedata.

3 Methods

Distributionassumptions

InthefollowingtheTrunation-PLSisbasedonloadingweightsfromPLSregression,thoughtheonept

is appliable also to regular regressionoeients. Further, theapproah ould similarly be applied to

seletY variables,orto PLSsoresin orderto eliminatenon-informativesamples,but these aspetsare

notoveredin this paper. Whenreording output from somekind of spetrosopi/-metri instrument

we expet that theabsene of a signalresults in white (non-informative) noise, while the presene of a

signalwillprodueasystematideviationfromrandomness. Thesameappliestoothertypesofdata,e.g.

miroarrays,butthedistributionofthenoisevaries. WhenreatingvetorsofloadingweightsinPLS,we

omputethersteigenvetorofthematrixprodut

X ^′ _{a−1} · Y _{a−1}

^(for^omponent^number

a

^). ^If^a^given

X-variableisunorrelatedwiththeresponsevariable(s)(forpossiblydeatedmatries)theloadingweight

forthisvariablewillbeasumovernequallydistributedrandomvariables,andbytheCLTitwilltherefore

representrandomnormalnoise,atleastapproximately. ForX-variablesorrelatedtotheresponsevariable

thetheoretialdistributionsofeahloadingweightwillalsobeasymptotiallynormaldistributed,butwith

non-zero mean. However,as the orrelation inreasesthe distributions will be inreasinglyskewed. As

thetrueorrelationbetweenanX-variableandtheresponseapproahes1,thelimitingdistributionofthe

orrespondingloadingweightwillbeahi-squaredistributionwithnon-zeroexpetation. InFigure1(left)

thetheoretial distributions of three non-normalizedloadingweights(sample size

n

⁼²⁰⁾^are illustrated;

a entred normal distribution for an unorrelated X-variable, and two skewed distributions for two X-

variableswithorrelation-0.6and0.6withtheresponse,respetively. Inthisgurethedistributionshave

(5)

noise distributionand 30%are orrelated with the response with either the -0.6 or the 0.6 orrelation.

Inareal data appliationthe loadingweightsofthe informativeX-variables willfollowdierent skewed

distributions. Thesampledistributionoftheweightswillthereforerepresentamix ofseveraltheoretial

distributionsandnotjust threeasusedinFigure1(left). An exampleofasampledistributionofloading

weightsis given in Figure 1(right). The main objetive in Trunation-PLS is to nd lowerand upper

ut-os betweenwhih it is assumed that the majority of the loadingweightsrepresent noisevariables.

Hene,theproblemboilsdowntondinganestimateoftheentralnormaldistributionofloadingweights

(oratleastseletedperentiles) inordertodistinguishthisfrom theskeweddistributions.

−40 0 −30 −20 −10 0 10 20 30 40

0.01 0.02 0.03 0.04 0.05 0.06 0.07

Value

Density

−0.15 −0.1 −0.05 0 0.05 0.1

0 10 20 30 40

Loading weight

Frequency

Figure1: Left: SimulatedtheoretialdistributionsofloadingweightsfromXvariableswithnoorrelation

to the response (red urve, 70% entred around 0), and orrelation of -0.6 and 0.6, respetively (blue

urves, 15% eah, entred around -12 and 12, respetively). Right: Histogram of normalized loading

weights(milk proteindata)illustratesthedistributionalharaterofthenoninformativeloadingweights.

Theredvertiallinesindiate theut-osbetweeninliersandoutliers.

ToonformtothelassialCLTtheobservationswouldneedtobeindependent,butthisisnotalwaystrue

in pratie. However, CLTtheory alsoexist for observations having weak dependene,and wewill only

onsiderthevariableswhere wedonotexpetanyinformationtobepresent,supportingindependeneof

thesevariables.

Algorithm

TheideapresentedinSetion2laysthegroundforawiderangeofpossibleimplementationsforlassifying

dataasnoiseorsignalbasedontheirdistribution. Inpriniple,thetrunationmaybeapplied toseveral

dierentmodelparametertypes-toobjetsoresinYorX,toY-loadingweightsandtoX-loadingweights.

Inthis paperwefouson thetrunation ofthe X-loadingweights,alled w in the nomenlatureof [10℄.

Themain approahwillbeto makeaondeneintervalaroundthemedianvalueofasortedvetor,e.g.

PLSloadingweights,andtrunateordown-weighteverythingthatfalls insidetheinterval,seeAlgorithm

1. ThewidthoftheondeneintervalwillbeestimatedusingtheoryfromLenth[11℄. Aseondapproah

willbetomakeuseofaqq-plot,lassifyingvariableslosetothestraightlinegoingthroughahosenpair

ofquantilesasinliers. AlternativelyoneouldadaptanormalorStudenttdistributiontothesamevetor

bydiret tting to the seleteddistribution, but this anbea timeonsuming and unstable proedure.

Thevariations havein ommon that outliers are onsidered true information, while observations within

aertain rangeof the distributionare lassiedasnoise. In thehistogram of loadingweights in Figure

1(right)theestimatedut-osbetweeninliersandoutliers areindiated. Thegeneraldistribution based

trunationalgorithmisasfollows:

(6)

•

^Input^andidate^loading^weight^vetor

w

^to ^be^trunated.

•

^Sort

w ⇒ w s

^.

•

^Either

omputeaondeneintervalaroundthemedianof

w s

^,^or

talinethroughquantilesaroundthemedianof

w s

^.

•

^Classify^outliers^as^real,informativeontributionsandinliersasnoise.

•

^Trunate^inliers.

InpratiethedistributionbasedtrunationanbepluggedintotheNIPALS[12℄algorithmorkernelbased

algorithmsasaomponent-wiseproessingoftheandidatePLSloadingweightstoimposesparsenesson

the variables, or even trunate the sores to impose sparseness on the objets. In this paper we limit

the appliations to the single response ase, but the proedures are equally relevant in multi-response

problems, aswell as other multivariatemethods likeLPLS, PCA, ICA and CCA. Trunation of loading

weightswill berelevant for mostappliations as it is morelikelythat somevariables do notontribute

toaomponentthanthat aset ofobjetsdo notontribute. Whentrunatingonlyloadingweights,the

following omputationof soresensuresthat loadingweightsand soresreetthe sameinformation. If

soresaretrunated, thiswill notbereetedin theinformationof theloadingweights, meaningthat a

re-omputationof loadingweightsand soresmay beneessarybasedon thetrunation generated from

thesores,orloadingweightshavetobedisregardedwhenanalysingtheresultingmodel. Assuggestedby

Martens&Næs,oneouldalsore-orthogonalizethevetorsofloadingweightsiforthogonalityisonsidered

important. Re-orthogonalizationmayintrodueshadowingeetfrompreviousomponentsuhthatsome

zeroloadingweightsbeomenon-zero. Forthedatasetsweareusinginthispaperthehangesinregression

oeients areverysmall with orwithoutre-orthogonalization, and the preditionsare equal sinethe

non-orthogonalizedandorthogonalizedloadingweightsspanthesamepreditor spae.

Insteadof applying hard thresholding, where inliers are set to zeroand outliers are keptas theyare, it

ouldbevaluabletoshrinkaordingtotheprobabilityofbeinganinlieroroutlier. Suhasoftshrinkage

ouldbe

1 − P(x j = inlier)

^,^but ^estimating ^thisprobabilitywould requireestimatesof thedistributions oftheoutliers. Insteadweapplyaumulativedistributionfuntion ontheobservedvariablesandresale

so that the median is given weight 0 and the largest outlier is given weight 1. As this strategy gives

ratherpoordistintionbetweeninliersandoutliersweintrodueaparameterizedversionoftheseweights

toprodueweightsthat arelosertoahardut-oasillustratedinFigure2.

(7)

200 400 600 800 1000 0

0.2 0.4 0.6 0.8 1

Variable number

Weight

Figure2: Transformationof saledweightsforgraduallysteepertransition betweeninliersand outliers.

Forthisexampletheweightorrespondingto theut-obetweeninliersandoutliersisset to0.7.

3.1 Cut-o determination

In order to nd ut-os between inliers and outliers an estimate of the entral normal distribution of

inliersis needed. Sinethedistributionis enteredin zerothedistribution will be fullyharaterized by

an estimate of its variane. In order to distinguish the entral noise distribution from the non-entral

distributions oftheinformativeoutliers,amixture model approah ouldbeadopted. Forinstane, [13℄

presented amixture model approah forsample size determination with false disoveryrate ontrol for

high-throughputdataproblems,andasimilarapproahouldbeadoptedhere. However,estimatingaset

ofentralandnon-entraldistributionsinvolvesiterativeproedures(liketheEM-algorithm)whihwould

seriouslyslowdown the tting proess of the PLS regression model. Further, only the variane of the

entralnoisedistributionisneeded,notthepropertiesofthenon-entraldistributions.

A similar problem arises in the analysis of saturatedANOVA models for

2 ^k

^-designs ^without^repliates.

Thenalldegreesoffreedomareonsumedintheestimationoftheeetsandnoonventionalerrorvariane

estimateanbeomputed. Still,all eet estimateshavethesamevariane,but aset ofnon-important

eetshavezero-expetation. Fromtheseavarianeestimateforsignianetestinganbefoundbythe

methodpresentedbyLenth [11℄. Inorderto estimatethevarianeLenth usesthefatthat thestandard

deviation of a entral normal distribution is tightly onneted to the median of the absolute value of

therandomvariable. Sine themedian isratherrobustagainstthe inuenefrom outliers,this variane

estimatewillbeonlymoderatelyaetedbytheoutliersaslongasthemajorityoftheeets(orloading

weights in our ase) are samples from the entral noise distribution. In the setting of this paper the

approahofLenthan bedesribedasfollows:

Let

w 1 , w 2 , ..., w p

^represent ^the ^loading ^weights^omputed ^from ^the

p

X-variables at step

a

^of ^the ^PLS

algorithm. Further, dene

s 0 = 1.5 · median |w k |

^for

k = 1, ...p

^. ^It ^an^be^shown^that

s 0

^is^a^fairly ^good

estimateofthestandarddeviationofthenormaldistributionoftheinliers. Inordertomakeitevenmore

robustand lessbiased Lenthreommends to make thenal estimate, thepseudo standard error (PSE),

basedonasetofinlyingvaluesonly:

P SE = 1.5 · median

|w k |<2.5·s 0

|w k |

^.

Lenth argues that if the

w k

^are realizations of a

N (0, τ ² )

^random ^variable

W

^, ^the ^median ^of

|W |

^is

approximately

0.675τ

^,^implying ^that

1.5 × median |W | ≈ 1.01τ

^. ^And^sine

P r(|W | > 2.5τ) ≈ 0.01

^,^the

(8)

PSE isroughlyonsistentfor1.5timesthe

0.495th

^quantile^of

|W |

^,^whih ^is

1.5 × 0.665τ ≈ τ

^.

ThePSE anbeombinedwithaStudenttquantileof

d = p/3

^degrees^of^freedom^to ^give^aonservative marginoferror(ME) forondeneintervals:

M E = t 0.975;d · P SE

^(95% ^ondene). ^However,ⁱⁿ ^high-

throughputdataproblemsthedegreesoffreedomwillusuallybelarge,andperentilesfromthestandard

normaldistributionmaybeusedinstead. InthePLSalgorithmtheut-osarethusdened bythelimits

ofa

(1 − α)100%

ôndeneîntervalâround^the^median^loading^weight^with^marginsôfêrrorâs^desribed

above:

median(w) ± M E

^,^for^some^hosen^ondene^level

(1 − α)

^.

Ifthereisalargeasymmetryinthenumberofpositiveandnegativeoutliers,theskewnessinthedistribution

of

w

^mayâuse^ME^to ^be^slightlyînatedâusingâ^potential^lossôfinformativeoutliers detetedinthe lighter tail. This an beavoided byestimating the margin of errorseparately for positiveand negative

loading weights. This is aomplished by rst nding

s ⁻ ₀

^and

P SE ⁻

ûsing ^the âbsolute ^values ôf ^the

negativeweightsandthenomputingthemarginalerror

M E ⁻

^for^the^lower^tail. ^Then^the^same^exerise

isondutedforthepositiveloadingweightsnding

s ⁺ ₀

^,

P SE ⁺

^and^nally

M E ⁺

^for^the^upper^tail.^Finally,

theut-osaredenedby

M E = min(M E ⁻ , M E ⁺ )

^. ^Theînreasedêxibilityânîmprove^theêstimation

ofboundariesbetweeninliersandoutlierswhenthereisasymmetryinthedistributions. Intherestofthis

paperwerefertotrunation usingLenth'smethodsasLenth.

3.2 Outlier detetion by qq-plots

An alternative to the above strategy is to use a qq-plot (quantile-quantile plot) as basis, extending an

interval around the median value of

w s

^minimising ^the ^mean ^squared ^error ^(MSE) ^to ^the ^line ^going

throughseleted quantiles (qq-line),e.g. the25-th and 75-th perentile of theStudent t distribution or

normaldistribution,seeFigure3. TofavoursolutionshavingmanyinlierstheMSEis weightedwiththe

ratiobetweenthetotalnumberofpointsand thenumberof non-informativeinliers (

n tot

n in

). Alternatively

onean favour solutionswith few informative outliers with MSEs that are not signiantly worse than

the minimum MSE. Utilizing funtions based on golden setion searh with paraboli interpolation, or

similar, the MSE minimization an be solved quikly as a linear searh, or aseries of suh in ases of

asymmetry. Visualisationofthesorted

w

^vetor^plotted ^against^the^naldistribution,e.g. Figure3,an aidinvalidatingandjustifyingthenal trunation.

−0.05 0 0.05 0.1

−4

−3

−2

−1 0 1 2 3 4

Student t distribution (22 pseudo df)

Sorted loading weights

Figure3: qq-plotoftherstvetorofloadingweights(olonanerdata)againstaStudenttdistribution

with22pseudodegreesoffreedom. Smalldotsindiateoutlierswhilelargerdotsindiateinliers. Theline

goingthroughthe20-thand80-thperentilesisindiatedindot-dashedform.

(9)

exatlyhowmanydegreesoffreedomthat areonsumedbyaPLSomponentisnottrivial,butarough

estimate is the following leverage-based estimate (pseudo degrees of freedom):

P

i t ² _a max

i (t ² a )

^, ^where

t a

^is

the

a

^-th^PLS-sore^vetor^and

i

îs^the^sample^number. Âs^the^trunationîs ^robust^to ^hangesⁱⁿ ^number

of degrees of freedom, we do not need the exat degrees of freedom. Note that the numberof degrees

offreedom onsumedwill hange after trunation. Inthe restof this paperwe referto trunationusing

qq-plotsasqq-line.

Note that for both the Lenth and the qq-line method the number of variables seleted as informative

may vary from one omponent to another. Furthermore, the same variable may be seleted in several

omponents. Hene,the total numberof seleted variables may notbe set exatly, but anbe to some

extent ontrolled by the number of PLS-omponents and the hosen width of the interval around the

medianweight.

3.3 Referene methods

Thetrunation proeduresare omparedto ST-PLS,Elasti net, variableseletionby VIPandSR, and

PLSwithout any modiations. This isa small subsetof representativemethods. FormorePLS based

variable seletion methods wereommend the papers of Mehmood et al. [14℄ and Roger et al [15℄. To

makeomparisonsfair weoptimizeeahmethod separatelywith regardto lassiation/predition. The

performane ofeahmethod isevaluatedontest setdata orbyross-validationin termsof lassiation

errorsforthelassiationproblemsandrootmeansquareerrorofpredition(RMSEP)forthepredition

problems. WiththeElastinettheoptimizationisperformedoverareasonablegridofridgingvalues(0.1

to1,where thevalue1givestheLasso)and

L 1

^shrinkages(automatiallyhosen[16℄). Theshrinkageof ST-PLSisvariedoverarelevantrange(0.05to0.95),andtheut-oforVIPisvariedfrom0.8to1.2[17℄.

ForSRweoptimizetheut-obetween0.05and0.5,astheut-osuggestedbytheauthors(0.5)selets

toofewvariables to obtaingood preditionson thedata sets tested in this paper. Beause there are so

manymodels,notallparameterombinationswillbereported.

There are several sparse PLS regression methods to hose between, but we found that their resulting

variableseletionswerequitesimilar,espeiallywhenoptimizingthesparsenessparameterwithregardto

predition. WehaveseletedST-PLS[6℄asaommonrepresentative,thoughanyof[7,18,19℄wouldhave

beenagoodalternative.

Inaddition to theresults assoiated withparameters givingthelowest preditionerrors wewill present

models that haveslightly higher predition errors but give moresparse loading weights and regression

oeients(simplied models). Forthe datasets where repeated ross-validation isused, the simplied

modelsshouldhavenomorethanonestandarderrorhigherpreditionerror,whileforthedatasetswhere

test set predition is used ommon additions to the error of 0.001 and 0.01 are used (see the Results

setion).

(10)

4.1 Data sets

The distribution based trunation method for variable seletion is ompared to the referene methods

on both aset of real data sets and to simulated data. These data sets represent awide range of high-

dimensional data typeswith dierent properties, and the results will be disussed in lightof these. In

ordertosummarizethedatapropertiesweusetheapproahofHellandandAlmøy[20℄andSæbøetal. [6℄

whostudythe eigenvaluestruture ofthesample ovarianematrixof thepreditorsand theovariane

betweentheprinipalomponentsandtheresponse. Inthefollowingwerefertothelatterpropertyasthe

relevaneofalatentomponent,followingthenotationofNæsandHelland[21℄. Wesummarizethedata

struturesineigenvalue-ovarianeplots. HellandandAlmøy[20℄onludeintheirstudythat predition,

usingPLSRmethodsatleast,ismostdiultinaseswherethereareirrelevantomponentshavinglarge

eigenvalues, or ontrary, if there are relevant omponents having small eigenvalues. In these ases we

thereforeexpetthatvariableseletionmethodsbasedonlatentomponentswillbelessfavourable.

4.1.1 Simulateddata

Thesearesimulateddataontainingtwoorrelating,informativefeaturesandavariablenumberofunin-

formativevariablesas desribedin [22,23℄. Thetotalnumberofvariablesrangefrom 100to 20000,and

thenumberofobservations ineahof twolassesare100and50 forthealibrationandvalidation data,

respetively. Thesimulationstudyisrepliatedexatlytobeomparabletothepapersithasappearedin

previously.

4.1.2 Colon aner data

These are expression levelsof 2000 genes on 62 patients as presented by Alon et al. [24℄. Among the

patients20werehealthywhile42hadolon aner. As anbeseenfrom Figure4thereareseverallarge

eigenvalueswhih indiate several diretions in thepreditor spae of largevariane. At the sametime

thesediretionsappeartoberelevantforpreditionbyhavinglargeovarianeswiththeresponse. Hene,

preditionusingPLSbasedmethodsshouldberelativelyeasy,butmightrequireafewomponents.

4.1.3 Prostate aner data

These are expression levels of 12600genes on 102 patients aspresented by Singh et al. [25℄. Among

the samples 52 were tumor speimens and 50 were normal. From Figure 4 we observe arapid drop in

eigenvaluesimplyingstrongdependenebetweenthepreditorvariables. However,somediretionsofsmall

variability(smalleigenvalues)havesomeofthelargestovarianeswiththeresponse. Thisisanexample

ofadatasetwheretherearerelevantomponentswithsmalleigenvalueswhihaordingtoHellandand

Almøy[20℄isnotfavourableforPLSpredition. WethereforeexpetthatthePLS-basedvariableseletion

methodswillnotperformwellforthis dataset.

(11)

These areRamanspetrafrom 45oilsamples extratedfrom farmed salmon(Salmo salar)[26℄. Raman

spetrosopy with aUV laser has beenonduted. As a fat indiator theiodine valuehas beenhosen

asthe response for regression. The spetraare pre-proessed by asymmetri least squares [27℄ (

λ = 7

^,

p = 0.11

^[28℄)^wrappedⁱⁿâûstomized^baselineôrretion^[29℄^to^redue^baselineêxibilityûnderâ^broad

luster ofpeaks. Thespetrahavebeenut down to2263 wavlengthsto removeartifats atthe endsof

thespetra. Thesedatahaveastrutureresemblingtheolondatawithseveraldiretionsinthepreditor

spaewithhighvariabilityandhighrelevane. Preditionshouldberelativelyeasyusingafewomponents

inthePLSmodel.

4.1.5 Milk protein data

These arematrix-assisted laserdesorption/ionizationtime-of-ight (MALDI-TOF)spetrafrom 45 milk

mixtures(x4spotrepliates) ofow,goat andewemilk[3℄. Anothersetof45mixturesfrom atehnial

repliate is used as validation data. Spetral values from 5000 m/z to 20000m/z (6179 variables) are

used for prediting the perentage of ow milk in the mixtures, i.e. the degree of adulteration. If the

trunationproedureispluggedintoanonialPLS(CPLS)[30℄,theperentageofgoatandewemilkan

be used asadditional responses to obtain moreparsimonious solutions. The eigenvalues for these data

implystrongvariable dependene with oneortworelevantomponents. Predition shouldbequiteeasy

withfewomponentsusingPLSregression.

(12)

Colon cancer data

Component

Scaled eigen v alue

0 5 10 15 20 25 30

0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Scaled co v ar iance

Prostate cancer data

Component

Scaled eigen v alue

0 5 10 15 20 25 30

0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Scaled co v ar iance

Fish oil data

Component

Scaled eigen v alue

0 5 10 15 20 25 30

0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Scaled co v ar iance

Milk protein data

Component

Scaled eigen v alue

0 5 10 15 20 25 30

0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Scaled co v ar iance

Figure4: Summariesofdatapropertiesfortherealdatasets. Eigenvaluesofthesampleovarianematrix

(saled by the largest) are marked by the height of bars. Covarianes (saled by the largest) between

prinipalomponentsandtheresponse aremarkedbyreddots.

4.2 Results

4.2.1 Simulateddata

Followingtheproposedsimulationshemeof[22℄ aswasdonewithPLSand sPLSin[23℄, weobtainthe

resultsshowninFigure5. ChoosingtwodierentwidthsoftheondeneintervalsofLenth'smethodwe

ndlassiationerrorsalmostidential towhat wasshownusingsPLS andgreatlyimprovedompared

to theonventionalPLS regression. However,thewidestLenthondene interval(99.9%) givesalmost

perfet lassiationregardlessofnumberofuninformativevariables. Theseoptimistiresultsareaused

by a simulation proedure that highly favours sparse modelling methods, and so should not be over-

interpreted.

(13)

0 2500 5000 7500 10000 12500 15000 17500 20000 0

0.05 0.1 0.15 0.2

p

Classification error

PLS Lenth (95%) Lenth (99.9%)

Figure5: Classiationerroroftwolass simulateddata. Tworegressorvariablesareinformativeforthe

regressandvariable,whilethetotalnumberofregressorvariablesareindiatedontherstaxisas

p

^.

4.2.2 Colon aner data

Figure6ashowstheaveragelassiationerrorofpatientsfrom200random10-foldross-validations[31℄.

Linear disriminant analysis with empirial priors is used for the lassiation. It is evident that one

omponentis not enoughto obtain good lassiation regardlessof the PLS method used. Elasti net

performs approximatelyat thesamelevelastheone-omponentPLSvariants. TheST-PLS andqq-line

Trunation-PLShavethebestombinationsoffewnon-zerovariablesandlowlassiationerror(bottom

leftorner ofthegure). The VIPansSRmethodswith twoand threePLSomponentshaveaslightly

worseombinationofsparsenessanderror,togetherwithLenthandWeightedLenth.

Wealsoobservethat hoosingamodelwithslightlyhighererrorthanthebestmodel angreatlyredue

thenumberofnon-zerovariables,espeiallyforLenth'smethod. Dependingontheaimoftheanalysis,e.g.

variable seletionorstable preditions,the hoie of trunationtype andparameter settingsmaydier,

espeiallysineallthepresentedmodelsusing twoandthreeomponentsliewithin a1% errormargin.

The mostsparse two omponent models (average number of non-zerovariables in parentheses) are ST-

PLS (74, simplied model), qq-line (171), Lenth (243) and ST-PLS (294). All of these models have a

higheraveragepreisionomparedtotheordinarytwoomponentPLSsolution,andareverylosetothe

preisionofthethreeomponentPLSsolution.

4.2.3 Prostate aner data

Figure6bshowstheaveragelassiationerrorofpatientsfrom100random10-foldross-validations.We

observethat thebest preditionsarefoundwhen using5omponentPLSmodels withvariable seletion

bySR. Following losely is theElasti net. Both of these methods giveverysparse solutions. There is

almost a 2% gap down to the rest of the methods. Here variable seletion by VIP, qq-line (simplied

model), ST-PLSand Lenth givethemostsparse solutionswhileWeighted Lenthgivesmarginallybetter

lassiation.

For thisdata set itseems that the small variationin the disriminating information favoursElasti net

andSRwhilethesparsePLSmethodsandVIPobtainproportionsorretlylassiedsimilartoonlyusing

PLSwithallvariables.

(14)

0 500 1000 1500 2000 0.08

0.09 0.1 0.11 0.12 0.13 0.14

# of non−zeros variables

Classification error

1 1

1 1 1

1 1 1 1

1 1

1 1 1

1 1

1 2 2

2 2

2 2 2

2

2 33 3 3 3

3 3 3 3

3 3 3

3 3

3 Lenth Weighted Lenth qq−line ST−PLS Elastic net VIP SR

(a)Colonanermiro-arraylassiationusingLDA.

Full PLS-DA:1 omp.: 0.130, 2 omp.: 0.105, 3 omp.:

0.085(dashedlines).

0 2000 4000 6000 8000 10000 12000

0.055 0.06 0.065 0.07 0.075 0.08 0.085

5 5 5

5 5

5 5 5

5

5 5 5

5 5

5 5 10

10 10

10 10 10

10 # of non−zeros variables

Classification error

(b)Prostateanermiro-arraydatalassiationusing

LDA.

FullPLS-DA:5 omp.: 0.078, 10omp. 0.0825(dashed

lines).

0 500 1000 1500 2000

1.4 1.6 1.8 2 2.2 2.4 2.6

1

1 1 1

1 1

1

1 1 1

1

1 1 1

1 1

2 2 2

2 2

2 2 2

2 2

2 2 2

2 2

2 3

3 3

3 3 3 3

3 3 3

3 3

3 # of non−zeros variables

Root mean squared error of prediction

()FishoilRamandatapreditionofiodine.

Full PLSR:1omp.: 2.70, 2 omp.: 1.68,3 omp. 1.74

(dashedlines).

0 1000 2000 3000 4000 5000 6000

0.075 0.08 0.085 0.09 0.095 0.1 0.105

2 2 2 2

2 2

2

2 2 2

2 2

2 2 2 2

2 2 3

3 3 3 3

3 3 3

3 3 3 3 3

3 3 3 3

3 4

4

4 4 4

4 4

4 4 4

4

4 4 4

4 # of non−zeros variables

Root mean squared error of prediction

(d)Milkprotein MALDI-TOFdata predition ofadul-

teration.

FullPLSR:1omp.:0.103,2omp.:0.074,3omp.:0.078

(dashedlines).

Figure 6: Repeated random 10-fold ross-validated lassiation (subgures a and b) and test set pre-

ditions(subgures andd) usingvarying numbersofPLS omponents. The symbolsindiate dierent

variable seletion strategies and their numbers of omponents. Blak symbols are assoiated with the

parameters giving the highest preision, while red symbols indiate models using fewer variables while

retainingmostoftheirpreision.

(15)

InFigure6weseetheresultsoftest setpreditionsusing thesamemethodsasabove. Parametershave

beenhosenbyross-validation. ThebestombinationofpreditionandsparsenessisobservedforLenth

and ST-PLS. Preisions of these preditions are muh better than only using PLS. TheRMSEP values

from Elasti net aresomewhere between theone omponent PLSmodels andthe two/three omponent

models. Astheparametersandsimpliationsarehosenontheross-validationresults,weobserveboth

redutionsandinreasesinRMSEPwhenusing simpliedmodels.

4.2.5 Milk protein data

Inadditiontoomparisonwiththereferenemethods thisdataset isinludedbothto showhowonean

obtainparsimoniousmodelsbypluggingthetrunationalgorithmintoadierentNIPALSalgorithm,the

anonialPLS,andtoshowhowinterpretationofspetraldataanbemadeeasierbyimposingsparseness.

TheCPLSalgorithmdiersfromtheregularPLSinthewaythatadditionalsampleinformation(likedesign

variables)maybeinludedasextraresponsevariablestostabilizetheextrationofthelatentomponents.

This has the typial eet that the number of omponentsis reduedompared to PLS regression. As

mentionedinthedesriptionof thedatatheperentageofgoat andewemilkwasinludedasadditional

responses in the analysis of the ow milk data. In Figure 6d we see the results of test set preditions

using the same methods as above. Parameters have been hosen by ross-validation. Here Elasti net

is the winner onsidering the ombination of predition and sparseness. However, predition-wise the

othermethodsareverylosebehind. AmongthePLSbasedmethods, Lenthhasthebestombinationof

preditionandsparseness,havingmarginallybetterpreditionthanElasti netusinglessthan

1 / ⁶

^of^the

variableswiththesimpliedmodel.

Figure 7 shows the predition errorof PLS and CPLSregression used separately and ombined with a

pre-hosentrunation (99.9%ondene interval(Lenth'smethod)withsharput-o). Weobservethat

for models using few omponents trunation has no eet on predition with PLS, but gives a minor

improvementwhen ombinedwith CPLS. Also, CPLShas muh lower preditionerror for oneand two

omponentmodels. Lookingonlyatpredition,thebestbalanebetweenpreditionerrorandomplexity

isatwoomponentCPLSmodelwithtrunation.

(16)

0 2 4 6 8 10 0

0.05 0.1 0.15 0.2 0.25 0.3

# components

RMSEP

PLS PLS (Lenth) CPLS CPLS (Lenth)

2 4 6 8 10

0 200 400 600 800 1000 1200

# components

# of non−zero variables

PLS (Lenth) − per component PLS (Lenth) − total CPLS (Lenth) − per component CPLS (Lenth) − total

Figure 7: Predition of ow milk proportions in milk mixtures from MALDI-TOF spetra (left) and

thenumberof non-zero variablesperomponent/intotal using trunation(right). The totalnumberof

variableswas6179.

In Figure 8 we see the rst two vetors of loading weights from PLS and CPLS regression with and

withouttrunation. The ontrastis high witha highlevelof noisein the upperspetraand onlyafew

remainingpeaksin thelowerspetra. Here thetrunated spetraseemto haveanadvantagewhenused

forinterpretationandproteinassignment.

5 7 9 11 13 15 17 19

−0.2

−0.1 0 0.1

PLS

5 7 9 11 13 15 17 19

−0.2

−0.1 0 0.1

CPLS

5 7 9 11 13 15 17 19

−0.2

−0.1 0 0.1

x1000 m/z

Truncated CPLS

Figure 8: Loading weight vetors from MALDI-TOF spetra of milk (two rst omponents). The top

spetra ome from ordinaryPLS, the middle spetra from CPLS, while the bottom spetra ome from

trunatedCPLSwithtrunationparametersseletedtoreetatypialhoieappliableformanytypes

ofdata.

(17)

ThroughthispaperwehaveformalisedsomeaspetsofthefamilyofsparsePLSmethods. Firstlywehave

have justied trunation of loadingweights through the entral limit theorem and the distributions of

loadingweightswithnoorrelationtotheresponse. Seondlywehaveproposedanewtrunationfounded

on lassialstatistial asymptoti priniples. This is introdued through a novel appliation of Lenth's

theoryforreatingondeneintervalsinsaturatedANOVAmodelsfor

2 ^k

^-designs^without^repliates. ^The

eetisthat theuseronlyhastohooseasignianelevelfortheondeneinterval,resultinginaless

adhoapproah.

Trunation inthispaperisahievedusingageneralandexibleplug-inwhihaneasilybeadjustedand

implemented also in other projetion based methods like PCA [32℄, ICA [33℄, PCR, CPLS and PPLS.

PLS regression is an iterative algorithm and omponent wise trunation will inevitably slow down the

algorithm,but Lenth's method is extremelyquik, i.e. there isa minimal lagompared to just running

regular PLSR. The alternative approah based on the qq-line is also quite quik, and appears to give

slightlybetterresultsin somesituations.

With regard to predition performane the trunation PLS is mostly on par with ST-PLS, sometimes

alittle better, sometimes alittle worse. As with all statistial methods, this is highly data dependent.

However, there are few parameters to tune and they have statistial interpretations. For the data sets

inludedinthispaperweseethatElastinetsometimesperformssigniantlybetterthanthesparsePLS

methods,whileittrailsbehindwhenusedonotherdatasets. Thisisalsotheaseforthevariableseletion

by Seletivity Ratio plots and to someextent the Variable Inuene on Predition method. The Lasso

wasalsotestedwiththeinludeddatasets,butbeingaspeialaseoftheElastinetitneverperformed

better in pratie. Butpredition isnottheonly goalforastatistial method. Thetrunation methods

havealso shown onsistentgood results, arebased onintuitivetheory, are quite robustto the hoie of

parametersandareextremelyquik.

Theperformaneof thevarious methods mayto someextentbeexplained bythestruture of thedata.

ThePLS-based methodsperformrelativelybetterwhenthere aremanydiretionsin thepreditorspae

with both ahigh variane(high eigenvalue) and ahigh relevane. This wasthe asefor both theolon

anerdataandtheshoildata,andherealsothePLS-basedvariableseletionmethodsperformedwell,

with the newtrunation method and ST-PLSslightly ahead of the others. Forthe prostate data these

methods performed worse, and this result onrms the expetations basedon the data properties that

PLSmethodshavetroublemakinggoodpreditionsforthiskindofdatawheretherearediretionsinthe

preditorspaeof lowvariane,but withhighrelevane. However,anexeptionistheSR method based

onthe5omponentPLSmodel. Thisanbeexplainedbythefat thattheSRmethodisadjusted tobe

morefavourablethanordinaryPLSwhentherearevariableswithlowvarianes,butwithhighorrelations

withtheresponse[9℄. This isexatlywhat istheasehereaordingtoFigure4. Apparentlytheelasti

nethasasimilar behaviour,whih anbeexplainedbythefatthat thismethod, liketheordinaryleast

squares, giveshigher weight to variableswith high orrelations to theresponse, asopposed to the more

ovariane-fousedPLS.Theresultsindiatethat inaseswherethereisastrongorrelationstruturein

thedata(prostateanerdataandmilkproteindata)theelastinetisagoodhoieofmethodforvariable

seletion. Whenhoosingamethod foranalysis andvariableseletionit maythereforebeworthwhileto

studythedatapropertiesintermsofeigenvaluesandomponent-responseovarianes.

(18)

Inmostappliations, smalldeviationsfrom orthogonalityanbedisregarded. However,when orthogonal

vetorsofloadingweightsis important, are-orthogonalizationstepanbeinludedafter thetrunation,

foring the urrent vetor of loading weights to be orthogonal to the previous vetors extrated. The

down-sidetothisisthatshadoweetsfrompreviousloadingweightsmayappearinthere-orthogonalized

loadingweights, ausing zeroweights ofregressorsalreadyused in previousomponentsto beomenon-

zero. Forthedatasetswehaveusedinthispaper,theshadoweetwassosmallthat theywereinvisible

in plots, and only appeared a few times in measurable sizes. The total numberof non-zero regression

oeientsshould notbeaeted.

A note should be madeon the dierent roles of theX loadingweights,

w _a

^, ^and ^the ^X ^loadings,

p _a

^. ^It

is important to rememberthat theloading weightsontain the ovarianeinformation between

X {a−1}

and

Y {a−1}

^(the^rstêigenvetorôf^theôvariane^matrixîf

Y

îs ^multi^response) ând^giveûs^the^weights

that eah explanatory variable has when reating sores and loadings. The sores,

t a

^, ^are ^just ^linear

ombinations of the explanatoryvariables weightedby the loadingweights. The loadings, however,are

foundbyprojetingeah explanatoryvariableof

X {a−1}

^on^the^sores,

t a

^. ^Loading^weights^and^loadings

an look quite similar when no trunation has been applied, espeially for spetrosopi data. With

trunation, however, the loading weights obtain a lot of zero holes, while the loadings retain a more

ontinuousshape(at least for spetrosopi data). The upshot is that fully trunated variables are not

ompletelylost,andtheirroleinthesystemmaybeinterpretedgraphiallysinetheirloadingsareintat.

Depending ontheappliation, either loadingweightsorloadingsanbeinterpreted,havingrolessimilar

totheregressionoeientswith andwithoutzeroholes.

Insomeappliationsitmaybeinterestingtoapplytrunationwithoutendingupwithzerosintheresulting

regressionoeients,analogoustofousingonloadingsinsteadofloadingweights. Thisanbejustiedby

theneedtoremovenoiseintheomputationofPLSomponentsandatthesametimeproduingontinuous

regressionoeients.FromtheearlydaysofPLSRwendapproximateestimatesofregressionoeients

thatproduethedesiredeet. Twoalternativeshavebeenproposed. Firstlytheapproximatedregression

oeientsansimplybeestimatedbytheprodutoftheXandyloadings:

β ˆ ^† = P q ^′

^. ^A^more^elaborate

strategyistoproduenewapproximatedXsores,yloadingsandregressionoeientsbyfullprojetion

ontheX loadings:

T ^⋆ = XP (P ^′ P ) ⁻¹

^,

q ^⋆ = y ^′ T ^⋆ (T ^⋆′ T ^⋆ ) ⁻¹

^, ^and^nally:

β ˆ ^⋆ = P q ^⋆′

^. ^Both ^strategies^will

produeregressionvetorswithoutzeroholes.

Referenes

[1℄ Wold, S., Martens,H. &Wold,H. Themultivariatealibration problemin hemistry solvedbythe

PLSmethods. Leturenotesinmathematis 973,286293(1983).

[2℄ Indahl, U. Atwisttopartialleastsquaresregression. JournalofChemometris 19,3244(2005).

[3℄ Liland, K. H., Mevik, B.-H., Rukke,E.-O., Almøy, T.&Isaksson,T. Quantitativewhole spetrum

analysiswithMALDI-TOFMS,PartII:Determiningtheonentrationofmilkinmixtures. Chemo-

metris andIntelligent LaboratorySystems 99,3948(2009).

[4℄ Tibshirani, R. Regressionshrinkageand seletionviathelasso. J. R. Statist. So. B 58, 267288

(1996).

(19)

67, 301320(2005).

[6℄ Sæbø, S., Almøy, T., Aarøe, J. & Aastveit, A. H. ST-PLS: a multi-diretional nearest shrunken

entroidtypelassierviapls. Journalof Chemometris 22,5462(2008).

[7℄ Lê Cao, K., Rossouw, D., Robert-Granié, C. & Besse, P. A sparse pls for variable seletion when

integrating omisdata. Statistial appliations ingenetis andmoleularbiology 7(2008).

[8℄ Wold, S.,Johansson, E. & Cohi, M. 3DQSAR in drug design: theory, methods andappliations

(ESCOM SienePublishersB.V.,Leiden,TheNetherlands,1993).

[9℄ Rajalahti, T. et al. Disriminating variable test and seletivity ratio plot: Quantitative tools for

interpretation and variable (biomarker) seletion in omplex spetral or hromatographi proles.

Analytial Chemistry 81,25812590(2009).

[10℄ Martens,H. &Næs,T. Multivariate alibration (JohnWileyandSons,Chihester,UK,1989).

[11℄ Lenth,R.V. Quikandeasyanalysisofunrepliatedfatorials.Tehnometris 31,469473(1989).

[12℄ Wold, H. Estimation of prinipal omponent andrelated models by iterative leastsquares,vol.Mul-

tivariateanalysis(AademiPress,NewYork,USA,1966).

[13℄ Jørstad, T., Midelfart, H. & Bones, A. A mixture model approah to sample size estimation in

two-sampleomparativemiroarrayexperiments. BMC Bioinformatis 9(2008).

[14℄ Mehmood,T.,Liland,K.H.,Snipen,L.&Sæbø,S. Areviewofvariableseletionmethodsinpartial

least squaresregression. Chemometris andIntelligent LaboratorySystems 118,6269(2012).

[15℄ Roger, J., Palagos, B., Bertrand, D. & Fernandez-Ahumada, E. Covsel: Variable se-

letion for highly multivariate and multi-response alibration: Appliation to IR spetro-

sopy. Chemometris and Intelligent Laboratory Systems 106, 216 223 (2011). URL

http://www.sienediret.om/siene/artile/pii/S0169743910001978.

[16℄ Friedman,J.&Hastie, T. Regularizationpathsforgeneralizedlinearmodelsviaoordinatedesent.

Journal ofStatistialSoftware 33(2010).

[17℄ Chong,I.&Jun,C.Performaneofsomevariableseletionmethodswhenmultiollinearityispresent.

Chemometris andIntelligent LaboratorySystems 78,103112(2005).

[18℄ Chun,H. &Kele³. Sparsepartial leastsquaresregressionforsimultaneous dimensionredutionand

variable seletion. Journal of the Royal Statistial Soiety: Series B (Statistial Methodology) 72,

325(2010).

[19℄ Lee,D.,Lee,W.,Lee,Y.&Pawitan,Y.Sparsepartialleast-squaresregressionanditsappliationsto

high-throughput dataanalysis. ChemometrisandIntelligentLaboratorySystems 109,18(2011).

[20℄ Helland, I. S. & Almøy, T. Comparison of predition methods when only a few omponents are

relevant. Journalofthe AmerianStatistialAssoiation 89, 583591(1994).

[21℄ Næs, T.&Helland,I.S. Relevantomponentsin regression. Sandinavian Journal ofStatistis 20,

239250(1993).

(20)

dimensional lowsamplesizedata. InternationalJournalof AppliedMathematis 39,4860(2009).

[23℄ Filzmosera, P., Gshwandtnera, M. & Todorov, V. Review of sparse methods in regression and

lassiationwithappliationtohemometris. Journal 26,4251(2012).

[24℄ Alon,U.etal.Broadpatternsofgeneexpressionrevealedbylusteringanalysisoftumorandnormal

olontissuesprobedbyoligonuleotidearrays. P.Natl.Aad.Si.96,67456750(1996).

[25℄ Singh,D.et al. Geneexpressionorrelatesoflinialprostateanerbehavior. CanerCell 1,203

209(2002).

[26℄ Afseth, N. K., Segtnan, V. H. & Wold, J. P. Raman spetra of biologial samples: A study of

preproessingmethods. AppliedSpetrosopy 60,13581367(2006).

[27℄ Eilers, P. H. Parametritimewarping. Analytial Chemistry 76, 404411(2004).

[28℄ Liland, K. H., Almøy, T. & Mevik, B.-H. Optimal hoie of baseline orretion for multivariate

alibrationofspetra. AppliedSpetrosopy 64, 10071016(2010).

[29℄ Liland,K.H.,Rukke,E.-O.,Olsen,E.F.&Isaksson,T. Customizedbaselineorretion. Chemomet-

ris andIntelligent LaboratorySystems 109,5156(2011).

[30℄ Indahl, U. G., Liland, K.H. &Næs, T. Canonialpartial leastsquares aunied pls approah to

lassiationandregressionproblems. JournalofChemometris 23,495504(2009).

[31℄ Stone, M. Cross-validatory hoie and assesment of statistial preditions. Journal of the Royal

Statistial Soiety,Series BMethodologial 36,111147(1974).

[32℄ Pearson,K. Onlines andplanesoflosestt tosystemsofpointsinspae. Philosophial Magazine

2,559572(1901).

[33℄ Comon,P.Independentomponentanalysis,Anewonept? Signalproessing 36,287314(1994).

Distribution based truncation for variable selection in subspace methods for multivariate regression

1 ,⋆

2

2 , 3

1

⋆

L 1

β

L 2

β

X ′ {a−1} · Y {a−1}

a

n

−40 0 −30 −20 −10 0 10 20 30 40

0.01 0.02 0.03 0.04 0.05 0.06 0.07

Value

Density

−0.15 −0.1 −0.05 0 0.05 0.1

0 10 20 30 40

Loading weight

Frequency

•

w

•

w ⇒ w s

•

w s

w s

•

•

1 − P(x j = inlier)

200 400 600 800 1000 0

0.2 0.4 0.6 0.8 1

Variable number

Weight

2 k

w 1 , w 2 , ..., w p

p

a

s 0 = 1.5 · median |w k |

k = 1, ...p

s 0

P SE = 1.5 · median

|w k |<2.5·s 0

|w k |

w k

N (0, τ 2 )

W

|W |

0.675τ

1.5 × median |W | ≈ 1.01τ

P r(|W | > 2.5τ) ≈ 0.01

0.495th

|W |

1.5 × 0.665τ ≈ τ

d = p/3

M E = t 0.975;d · P SE

(1 − α)100%

median(w) ± M E

(1 − α)

w

s − 0

P SE −

M E −

s + 0

P SE +

M E +

M E = min(M E − , M E + )

w s

n tot

n in

w

−0.05 0 0.05 0.1

−4

−3

−2

−1 0 1 2 3 4

Student t distribution (22 pseudo df)

Sorted loading weights

P

X ^′ _{a−1} · Y _{a−1}

2 ^k

N (0, τ ² )

s ⁻ ₀

P SE ⁻

M E ⁻

s ⁺ ₀

P SE ⁺

M E ⁺

M E = min(M E ⁻ , M E ⁺ )

i t ² _a max

i (t ² a )