• No results found

Time series cluster kernels to exploit informative missingness and incomplete label information

N/A
N/A
Protected

Academic year: 2022

Share "Time series cluster kernels to exploit informative missingness and incomplete label information"

Copied!
12
0
0

Laster.... (Se fulltekst nå)

Fulltekst

(1)

ContentslistsavailableatScienceDirect

Pattern Recognition

journalhomepage:www.elsevier.com/locate/patcog

Time series cluster kernels to exploit informative missingness and incomplete label information

Karl Øyvind Mikalsen

a,b,1,

, Cristina Soguero-Ruiz

c

, Filippo Maria Bianchi

d

, Arthur Revhaug

b

, Robert Jenssen

a,1

aDepartment of Physics and Technology, UiT The Arctic University of Norway, Tromsø, Norway

bDepartment of Gastrointestinal Surgery, University Hospital of North Norway (UNN), Tromsø, Norway

cDepartment of Signal Theory and Comm., Telematics and Computing, Universidad Rey Juan Carlos, Fuenlabrada, Spain

dDepartment of Mathematics and Statistics, UiT, Tromsø, Norway

a rt i c l e i nf o

Article history:

Received 27 November 2018 Revised 11 November 2020 Accepted 8 February 2021 Available online 20 February 2021 Keywords:

Multivariate time series Kernel methods Missing data

Informative missingness Semi-supervised learning

a b s t r a c t

Thetimeseriesclusterkernel(TCK)providesapowerfultoolforanalysingmultivariatetimeseriessub- jecttomissingdata.TCK isdesignedusinganensemblelearningapproachinwhichBayesian mixture modelsformthe basemodels.BecauseoftheBayesian approach,TCK cannaturallydealwithmissing valueswithoutresortingtoimputation andtheensemblestrategyensuresrobustnesstohyperparame- ters,makingitparticularlywellsuitedforunsupervisedlearning.

However, TCK assumes missing at randomand that the underlyingmissingness mechanism is ignor- able,i.e.uninformative,anassumptionthatdoesnotholdinmanyreal-worldapplications,suchase.g.

medicine.Toovercomethislimitation,wepresentakernelcapableofexploitingthepotentiallyrichin- formationinthemissingvaluesandpatterns,aswellastheinformationfromtheobserveddata.Inour approach,wecreatearepresentationofthemissingpattern,whichisincorporatedintomixedmodemix- turemodelsinsuchawaythattheinformationprovidedbythemissingpatternsiseffectivelyexploited.

Moreover,wealsopropose asemi-supervisedkernel,capableoftaking advantageofincompletelabel informationtolearnmoreaccuratesimilarities.

Experimentsonbenchmarkdata,aswellasareal-worldcasestudyofpatientsdescribedbylongitudinal electronichealth recorddatawhopotentiallysufferfromhospital-acquiredinfections,demonstratethe effectivenessoftheproposedmethods.

© 2021TheAuthors.PublishedbyElsevierLtd.

ThisisanopenaccessarticleundertheCCBYlicense(http://creativecommons.org/licenses/by/4.0/)

1. Introduction

Multivariate time series (MTS) frequently occur in a whole range ofpractical applicationssuch asmedicine,biology, andcli- mate studies, to name a few. A challenge that complicates the analysisisthatreal-worldMTSareoftensubjecttolargeamounts ofmissingdata.Traditionally,missingnessmechanismshavebeen categorized into missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR) [1]. The main difference between these mechanisms consists in whether the missingness is ignorable (MCAR and MAR) or non-ignorable (MNAR)[1–3].Ine.g.medicine,non-ignorablemissingnesscanoc-

Corresponding author.

E-mail address: [email protected] (K. Øyvind Mikalsen).

1 KM, RJ are with the UiT Machine Learning Group: machine-learning.uit.no .

curwhenthemissingpatternsRare relatedto thediseaseunder studyY.In thiscase, the distributionof the missingpatternsfor diseasedpatientsisnotequaltothecorrespondingdistributionfor thecontrolgroup,i.e.p(R

|

Y=1)=p(R

|

Y=0).Hence,themiss- ingness is informative [4–7]. By contrast, uninformative missing- nesswillbereferredtoasignorableintheremainderofthispaper.

Bothignorableandinformativemissingnessoccurinreal-world data.An example frommedicine ofignorable missingnessoccurs e.g. if a clinician orders lab testsfor a patient andthe tests are performed,butbecauseofanerrortheresultsarenotrecorded.On theotherhand,informativemissingnesscouldoccurifitisdecided to not performlabtests becausethe doctorthinksthe patient is ingoodshape.In thelattercase,themissingvaluesandpatterns potentiallycontainrichinformationaboutthediseasesandclinical outcomesforthepatient. Efficientdata-driven approachesaiming toextract knowledge,perform predictivemodelling,etc.,mustbe capableofcapturingthisinformation.

https://doi.org/10.1016/j.patcog.2021.107896

0031-3203/© 2021 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY license ( http://creativecommons.org/licenses/by/4.0/ )

(2)

Variousmethodshavebeenproposedtohandlemissingdatain MTS [8–11]. One simple approachis to createa complete dataset by discarding the time series with missing data. However, this gives unbiased predictions only if the missingness mechanismis MCAR. As an alternative, a preprocessing step involving imputa- tion of missing values with some estimated value, such as the mean, is common.Other so-called single imputation methods ex- ploitmachine learningbasedmethodssuch asmultilayerpercep- trons, self-organizing maps,k-nearest neighbors, recurrentneural networks and regression-based imputation [12,13]. Alternatively, one can impute missing values using various smoothing and in- terpolationtechniques[12,14].Amongthese,aprominentexample isthelastobservationcarriedforward(LOCF)schemethatimputes the lastnon-missingvalue forthe followingmissingvalues.Limi- tations of imputationmethods are that they introduceadditional bias andthey ignoreuncertaintyassociated withthe missingval- ues.

Multiple imputation [15] resolves this problem, to some ex- tent,by estimatingthemissingvaluesmultipletimesandthereby creating multiplecomplete datasets. Thereafter,e.g. a classifier is trainedonalldatasetsandtheresultsarecombinedtoobtainthe final predictions. However, despite that multiple imputation and other imputationmethodscangive satisfyingresultsinsomesce- narios,thesearead-hocsolutionsthatleadtoamulti-stepproce- dure inwhich themissingdataare handledseparately andinde- pendentlyfromtherestoftheanalysis.Moreover,theinformation aboutwhich valuesare actually missing(the missingpatterns)is lost, i.e. imputationmethods cannot exploit informative missing- ness.

Due totheaforementionedlimitations,severalresearchefforts havebeendevotedoverthelast yearstoprocessincompletetime series without relying on imputation [5–7,16–23]. In this regard, powerful kernelmethods havebeen proposed, ofwhich the time series clusterkernel(TCK)[24]isaprominentexample.TheTCKis designedusingan ensemblelearningapproachinwhichBayesian mixturemodelsformthebasemodels.AnadvantageofTCK,com- paredtoimputationmethods,isthatthemissingdataarehandled automaticallyandnoadditionaltasksarelefttotheuser.Multiple imputation instead requiresa carefulselection of the imputation model and other variables are needed to do the imputation [8], which particularly inan unsupervised setting can turnout to be problematic.

AshortcomingoftheTCKisthatunbiasedpredictionsareonly guaranteed forignorable missingness, i.e.the kernel cannot take advantage ofinformativemissingpatternsfrequently occurringin medicalapplications.Toovercomethislimitation,inthiswork,we presentanoveltimeseriesclusterkernel,TCKIM.Inourapproach, we createarepresentationofthemissingpatternsusingmasking, i.e.werepresentthemissingpatternsusingbinaryindicator time series.Bydoing so,we obtain MTSconsistingofboth continuous and discreteattributes. To modelthesetime series,we introduce mixed mode Bayesian mixture models, which can effectively ex- ploitinformationprovidedbythemissingpatterns.

The time series cluster kernels are particularly useful in un- supervised settings. In many practical applications such as e.g.

medicine it is not feasible to obtain completely labeled training sets [25],butinsome casesitispossibleto annotatea fewsam- ples withlabels, i.e.incomplete label informationis available. In order to exploit the incomplete label information, we propose a semi-supervised MTSkernel,ssTCK.Inour approach,weincorpo- rateideasfrominformationtheorytomeasuresimilaritiesbetween distributions.Morespecifically,weemploytheKullback-Leiblerdi- vergencetoassignlabelstounlabeleddata.

ExperimentsonbenchmarkMTSdatasetsandareal-worldcase study of patients suffering from hospital-acquired infections, de-

scribedbylongitudinalelectronic healthrecorddata,demonstrate theeffectivenessoftheproposedTCKIMandssTCKkernels.

The remainder of this paper is organized as follows.

Section 2 presents background on MTS kernels. The two pro- posed kernels are described in Sections 3 and 4, respectively.

Experiments on synthetic andbenchmark datasets are presented in Section 5, whereas the case study is described in Section 6. Section7concludesthepaper.

2. Multivariatetimeserieskernelstohandlemissingdata

Kernel methods have been of great importance in machine learning for several decades and have applications in many dif- ferent fields [26–28]. Within the context of time series, a kernel isa similarity measurethat also ispositive semi-definite[29,30]. Oncedefined, such similaritiesbetweenpairs of time seriesmay be utilized ina wide range of applications,such as classification orclustering,benefitingfromthevastbodyofworkinthefieldof kernelmethods.HereweprovideanoverviewofMTSkernels,and describehowtheydealwithmissingdata.

Thesimplestofallkernelfunctionsisthelinearkernel,which fortwodatapointsrepresentedasvectors,xandy,isgivenbythe innerproduct

x,y

,possiblyplusaconstantc.Onecanalsoapply alinearkerneltopairsofMTSoncetheyareunfoldedintovectors.

However,bydoingsotheinformationthattheyareMTSandthere mightbeinherentdependenciesintimeandbetweenattributes,is then lost. Nevertheless,in some cases such a kernelcan be effi- cient,especiallyiftheMTSareshort[31].IftheMTScontainmiss- ingdata,thelinearkernelrequiresa preprocessingstepinvolving e.g.imputation.

Themostwidelyusedtimeseriessimilaritymeasureisdynamic timewarping(DTW)[32–34],wherethesimilarityisquantifiedas the alignment cost between the MTS. More specifically, in DTW the time dimension of one or both of thetime seriesis warped toachieveabetteralignment.DespitethesuccessofDTWinmany applications,similarlytomanyothersimilaritymeasures,itisnon- metricandthereforecannotnon-triviallybeusedtodesignapos- itive semi-definite kernel [35]. Hence, it is not suited for kernel methodsin itsoriginal formulation.However, becauseofits pop- ularity therehavebeen attemptsto designkernels exploitingthe DTW.Forexample,Cuturietal.designedaDTW-basedkernelus- ingglobalalignments[36].Anefficientversionoftheglobalalign- ment kernel(GAK)is provided inCuturi[37].The latterhas two hyperparameters,namelythekernelbandwidthandthetriangular parameter.GAKdoesnotnaturallydealwithmissingdataandin- completedatasets,andthereforealsorequiresapreprocessingstep involvingimputation.

Two MTS kernels that can naturally deal with missing data withouthavingtoresorttoimputationarethelearnedpatternsim- ilarity(LPS)[38] andTCK. LPSgeneralizesthewell-known autore- gressivemodelstolocalautopatternsusingmultiplelagvaluesfor autocorrelation.Theseautopatternsaresupposedtocapturethelo- caldependencystructureinthetimeseriesandarelearnedusing atree-based(randomforest)learningstrategy.Morespecifically,a time series is represented as a matrix ofsegments. Randomness isinjectedtothelearningprocessbyrandomlychoosingtimeseg- ment(columninthematrix)andlagpforeachtreeintherandom forest. Abag-of-words type compressed representationis created fromtheoutputoftheleaf-nodesforeachtree.Thefinaltimese- riesrepresentationiscreatedby concatenating therepresentation obtainedfromtheindividualtrees,whichinturnareusedtocom- putethesimilarityusingahistogramintersectionkernel[39].

The TCK isbased on an ensemble learningapproach wherein robustnessto hyperparameters is ensured by joiningthe cluster- ingresults ofmanyGaussianmixturemodels (GMM)toformthe

(3)

final kernel.Hence, no criticalhyperparametershave tobe tuned by theuser,andtheTCKcanbe learnedinanunsupervisedman- ner.Toensurerobustnesstosparselysampleddata,theGMMsthat arethebasemodelsintheensemble,areextendedusinginforma- tivepriordistributionssuchthatthemissingdataisexplicitlydealt with.Morespecifically,theTCKmatrixisbuiltbyfittingGMMsto thesetofMTSforarangeofnumberofmixturecomponents.The idea isthat by generating partitions atdifferent resolutions, one cancaptureboththelocalandglobalstructureofthedata.More- over,tocapturediversityinthedata,randomnessisinjectedbyfor each resolution (number of components) estimating the mixture parametersforarangeofrandominitializationsandrandomlycho- senhyperparameters.Inaddition,eachGMMseesarandomsubset ofattributesandsegmentsintheMTS.Theposteriordistributions foreachmixturecomponentarethenusedtobuildtheTCKmatrix bytakingtheinnerproductbetweenallpairsofposteriordistribu- tions. Eventually,givenan ensembleofGMMs,theTCKiscreated inanadditivewaybyusingthefactthatthesumofkernelsisalso a kernel. Recently, TCKhas also been extendedto handle spatial dependencies[40].

Despite that LPS and TCKkernels share many properties,the waymissingdataaredealtwithisverydifferent.InLPS,themiss- ing data handling abilities of decision trees are exploited. Along withensemblemethods,fuzzyapproachesandsupportvectorso- lutions, decisiontrees canbe categorizedasmachinelearningap- proaches for handling missing data [12], i.e.the missing data are handlednaturallybythemachinelearningalgorithm.Onecanalso arguethatthewaymissingdataaredealtwithintheTCKbelongs to this category, since an ensemble approach is exploited. How- ever,itcanalsobecategorizedasalikelihood-basedapproachsince theunderlyingmodelsintheensembleareGaussianmixturemod- els.Inthelikelihood-basedapproaches,thefull,incompletedataset is analysed using maximum likelihood (or maximum a posteri- ori, equivalently), typically in combination with the expectation- maximization (EM)algorithm[8,9].Theseapproachesassumethat themissingnessisignorable.

3. Timeseriesclusterkerneltoexploitinformativemissingness

Inthissection,wepresentthenoveltimeseriesclusterkernel, TCKIM,whichiscapableofexploitinginformativemissingness.

Akeycomponentinthetimeseriesclusterkernelframeworkis ensemble learning, inwhichthe basicideaconsistsin combining acollectionofmanybasemodelsintoacompositemodel.Agood suchcompositemodelwillhavestatistical,computationalandrep- resentational advantagessuch aslowervariance, lower sensitivity tolocaloptimaandiscapableofrepresentingabroaderspanfunc- tions (increasedexpressiveness),respectively,comparedtothe in- dividualbasemodels[41].Keytoachievethisisdiversityandaccu- racy[42],i.e.thebasemodelscannotmakethesameerrorsonnew test data andhaveto performbetter than randomguessing. This canbedonebyintegratingmultipleoutcomesofthesame(weak) basemodelasitistrainedunderdifferent,oftenrandomlychosen, settings(parameters,initialization,subsampling,etc.)toensuredi- versity[43].

IntheTCKIM kernel,thebasemodelisamixedmodeBayesian mixturemodel.Next,weprovidethedetailsofthismodel.

Notation

Thefollowingnotationisused.Amultivariatetimeseries(MTS) X is defined as a (finite) combination of univariate time series (UTS), X=

{

xv∈RT

| v

=1,2,...,V

}

, where each attribute, xv, is a UTS of length T. The number of UTS, V, is the dimension of X. The length T of the UTSxv is also the length of the MTS X. Hence, a V-dimensional MTS, X, of length T can be represented

as a matrix in RV×T. Given a dataset of N MTS, we denote X(n) then-thMTS.An incompletelyobserved MTSisdescribedby the pair U(n)=(X(n),R(n)), where R(n) is a binary MTS with entry r(vn)(t)=0ifthe realizationxv(n)(t) ismissingandrv(n)(t)=1ifit isobserved.

Mixedmodemixturemodel

AssumethataMTSU=(X,R)isgeneratedfromtwo modes.X isaV-variatereal-valuedMTS(X∈RV×T),whereasRisaV-variate binaryMTS(R

{

0,1

}

V×T).Further,weassumethatU isgenerated

fromafinitemixturedensity, p

(

U

|

,

)

=

G

g=1

θ

gf

(

U

| φ

g

)

, (1)

whereGisthenumberofcomponents, fisthedensityofthecom- ponents parametrized by =(

φ

1,...,

φ

G), and =(

θ

1,...,

θ

g) arethemixingcoefficients,0≤

θ

G≤1andG

g=1

θ

g=1.

Now,introducealatentrandomvariableZ,representedasaG- dimensional one-hotvector Z=(Z1,. . .,ZG), whosemarginal dis- tributionisgivenby p(Z

|

)=G

g=1

θ

gZg. Theunobservedvariable Z records the membership of U and therefore Zg=1 if U be- longs to componentg andZg=0otherwise. Hence, p(U

|

Z,)= G

g=1f(U

| φ

g)Zg,andthereforeitfollowsthat p

(

U,Z

|

,

)

=p

(

U

|

Z,

)

p

(

Z

| )

=

G

g=1

[f

(

U

| φ

g

) θ

g]Zg (2)

U=(X,R)consistsoftwomodalitiesX andR.Wenownaivelyas- sumethat

f

(

U

| φ

g

)

=f

(

X

|

R,

μ

g,

g

)

f

(

R

| β

g

)

, (3) where f(X

|

R,

μ

g,g)isadensityfunctiongivenby

f

(

X

|

R,

μ

g,

g

)

= V

v=1

T

t=1

N

(

xv

(

t

) | μ

gv

(

t

)

,

σ

gv

)

rv(t), (4) and f(R

| β

g)isaprobabilitymassgivenby

f

(

R

| β

g

)

= V

v=1

T

t=1

β

grvv(tt)

(

1

β

gvt

)

1rv(t). (5) The parameters of each component are

φ

g=(

μ

g,g,

β

g), where

μ

g=

{ μ

gv∈RT

| v

=1,...,V

}

isatime-dependentmean(

μ

gv isa UTSoflength T),g=diag

{ σ

g12,...,

σ

gV2

}

isa time-constantdiag- onalcovariancematrixinwhich

σ

g2visthevarianceofattribute

v

,

and

β

gvt∈[0,1]aretheparametersoftheBernoullimixturemodel inEq.(5).Theideaisthateventhoughthemissingnessmechanism isignoredin f(X

|

R,

μ

g,g),whichisonlycomputedovertheob- served data,theBernoulli term f(R

| β

g) willcaptureinformation fromthemissingpatterns.

The conditional probability of Z given U, can be found using Bayes’theorem,

π

gP

(

Zg=1

|

U,

,

)

=

θ

g

V v=1

T

t=1[N

(

xv

(

t

) | μ

gv

(

t

)

,

σ

gv

) β

gvt]rv(t)

(

1

β

gvt

)

1rv(t) G

g=1

θ

gV

v=1

T t=1

[N

(

xv

(

t

) | μ

gv

(

t

)

,

σ

gv

) β

gvt]rv(t)

(

1

β

gvt

)

1−rv(t) .

(6) Similarlyto[24],weintroduceaBayesianextensionandputin- formative priors over the parameters of the normal distribution, whichenforcessmoothnessovertimeandthatclusterscontaining

(4)

few time series,to haveparameters similar to themean andco- variance computed overthe wholedataset. Akernel-basedGaus- sianpriorisdefinedforthemean,P(

μ

gv)=N

μ

gv

|

mv,Sv

.mvare theempiricalmeansandthepriorcovariancematrices,Sv,arede- fined asSv=svK, wheresv areempiricalstandard deviationsand K is a kernel matrix, whose elements are Ktt =b0exp(a0(tt)2), t,t =1,...,T.a0,b0 areuser-definedhyperparameters.An inverseGammadistributionpriorisputonthestandarddeviation

σ

gv,P(

σ

gv)

σ

g−Nv0exp

N2σ0s2v

gv ,whereN0 isauser-definedhyper- parameter.Wedenote=

{

a0,b0,N0

}

thesetofhyperparameters.

Then, givena dataset

{

U(n)

}

Nn=1,theparameters

{

,

}

canbe

estimated using maximuma posteriori expectation maximization (MAP-EM)[44,45].ThisleadstoAlgorithm1.

Algorithm1 MAP-EMformixedmodemixturemodel.

Require: Dataset

{

U(n)=(X(n),R(n))

}

Nn=1, hyperparameters and numberofmixturesG.

1: Initialize the parameters =(

θ

1,...,

θ

G) and =

{ μ

g,

σ

g,

β

g

}

Gg=1.

2: E-step. ForeachMTSU(n),evaluate theposteriorprobabilities usingEq.(6)withthecurrentparameterestimates.

3: M-step.Updateparametersusingthecurrentposteriors

θ

g=N1N n=1

π

g(n)

σ

g2v=N0s2v+N

n=1

T

t=1r(vn)

(

t

) π

g(n)

x(vn)

(

t

)

μ

gv

(

t

)

2

N0+N

n=1

T

t=1rv(n)

(

t

) π

g(n)

μ

gv=S−1v mv+

σ

g−2v

N

n=1

π

g(n)diag

(

rv(n)

)

x(vn) S−1v +

σ

g−2v

N

n=1

π

g(n)diag

(

r(vn)

) β

gvt=

(

Nn=1

π

g(n)

)

1Nn=1

π

g(n)r(vn)

(

t

)

4: Repeatstep2-3untilconvergence.

Ensure: Posteriors (n)

π

1(n),...,

π

G(n) T

and parameter esti- matesand.

3.1. Formingthekernel

We now explainhow themixedmode mixture modelisused toformtheTCKIM kernel.

We use themixed mode Bayesianmixture model asthe base model inanensemble approach.Toensurediversity, wevarythe number of components forthe base models by sampling froma setofintegersIC=

{

I,...,I+C

}

.Foreach numberofcomponents, we applyQ differentrandominitialconditionsandhyperparame- ters. We let Q=

{

q=(q1,q2)

|

q1=1,...Q,q2IC

}

be theindex

set keeping trackof initial conditions and hyperparameters (q1), andthenumberofcomponents(q2).Eachbasemodelqistrained onarandomsubsetofMTS

{

(X(n),R(n))

}

nη(q).Moreover,foreach q,we selectrandomsubsets ofvariablesV(q) aswellasrandom timesegmentsT(q).

The inner products of the normalized posterior distributions from each mixture component are then added up to build the TCKIM kernel matrix. Note that, in addition to introducing novel basemodelstoaccountforinformativemissingness,wealsomod- ify the kernel by normalizing the vectors of posteriors to have unit lengthinthel2-norm.Thisprovidesan additionalregulariza- tionthatmayincreasethegeneralizationcapabilityofthelearned model. The details of the method are presented in Algorithm 2. The kernelforMTSnotavailable duringtrainingcanbeevaluated accordingtoAlgorithm3.

Algorithm2 Timeseriesclusterkernel.Trainingphase.

Require: TrainingsetofMTS

{

(X(n),R(n))

}

Nn=1,Q initializations,set ofintegersIC controllingnumberofcomponentsforeachbase model.

1: InitializekernelmatrixK=0N×N. 2: forqQdo

3: Computeposteriors (n)(q)(

π

1(n),...,

π

q(2n))T,by fittinga mixedmodemixturemodelwithq2clusterstothedatasetand byrandomlyselecting:

i. hyperparameters(q),

ii. a time segment T(q) of length Tmin

|

T

(

q

) |

Tmax to

extractfromeachX(n)andR(n),

iv. a subset of attributes V(q), with cardinality Vmin

|

V

(

q

) |

Vmax,toextractfromeachX(n)andR(n), vi. asubsetofMTS,

η

(q),withNmin

| η (

q

) |

N,

vii. initializationofthemixtureparameters(q)and(q).

4: Updatekernelmatrix,Knm=Knm+(n(n))((qq))T·(m(m)()q(q)). 5: endfor

Ensure:K kernelmatrix,timesegmentsT(q),subsetsofattributes V(q),subsetsofMTS

η

(q),parameters(q),(q)andposteri- ors(n)(q).

Algorithm3 Timeseriesclusterkernel.Testphase.

Require: Test set

X(m)

M

m=1,time segments T(q)subsets ofat- tributes V(q), VR(q), subsets of MTS

η

(q), parameters (q), (q)andposteriors(n)(q).

1: InitializekernelmatrixK=0N×M. 2: forqQdo

3: Compute posteriors (m)(q), m=1,...,M using the mix- tureparameters(q),(q).

4: Updatekernelmatrix,Knm =Knm +(n(n))((qq))T·(m(m)()q(q)). 5: endfor

Ensure:Ktestkernelmatrix.

4. Semi-supervisedtimeseriesclusterkernel

Thissectionpresentsasemi-supervisedMTSkernel,ssTCK,ca- pable of exploiting incomplete label information. In ssTCK, the base mixture models are learnedexactly in the same wayas in TCKorTCKIM,i.e.ifthere isno missingdata,or themissingness is ignorable, the base models will be the Bayesian GMMs. Con- versely,ifthemissingnessisinformative,thebasemodelsarethe mixedmode Bayesian mixture models presented inthe previous section. Bothapproacheswillassociateeach MTS X(n) witha q2- dimensionalposterior(n)

π

1(n),...,

π

q(n2)

T,where

π

g(n)repre-

sentstheprobabilitythattheMTSbelongstocomponentgandq2 isthetotalnumberofcomponentsinthebasemixturemodel.

InssTCK, labelinformation isincorporated inan intermediate processingstepinwhichtheposteriors (n)are transformed,be- forethetransformed posteriorsare sentinto Algorithms2and3. Moreprecisely,thetransformationconsistsinmappingtheposte- riorforthemixturecomponentstoaclass“posterior” (probability), i.e. we seek to find a function M: [0,1]q2→[0,1]Nc, (n)−→M ˜(n).Hence,wewanttoexploit theincompletelabelinformation to find a transformation that merges the q2 components of the mixturemodelintoNcclusters,whereNcisthenumber ofclasses.

ThemappingMcanbethoughtofasa(soft)Nc-classclassifier, andhencetherecouldbemanypossiblewaysoflearningM.How- ever,choosingatooflexibleclassifier forthispurposeleadstoan increasedriskofoverfittingandcould alsounnecessarilyincrease

(5)

thealgorithmiccomplexity.Forthesereasons,werestrictourselves tosearchingforalineartransformation

M

(

(n)

)

=WT

(n), W ∈[0,1]q2×Nc. (7) SincetheNc-dimensionaloutput˜(n)=M((n))shouldrepresent aprobability distribution,weaddtheconstraintNc

i=1Wji=1, j= 1,...,q2.

Anaturalfirststepistofirstassumethatthelabelinformation iscompleteandlookatthecorrespondingsupervisedkernel.Inthe followingtwo subsections,wedescribeourproposed methodsfor learningthetransformationM insupervisedandsemi-supervised settings,respectively.

4.1. Supervisedtimeseriesclusterkernel(sTCK)

Supervisedsetting.Eachbasemixturemodelconsistsofq2com- ponents,andweassumethatthenumberofcomponentsisgreater or equal to the number of classes Nc. Further, assume that each MTS X(n) in the trainingsetis associatedwith a Nc–dimensional one-hot vector y(n),which represents its label. Hence, thelabels ofthetrainingsetcanberepresentedvia amatrixY

{

0,1

}

N×Nc, whereNisthenumberofMTSinthetrainingset.

We approach this problem by considering one component at thetime.Foragivencomponentg,thetaskistoassociateitwith a class. One naturalwayto dothisis to identifyall members of component gandthen simply count how manytimes each label occur. To account for class imbalance, one can then divide each count bythenumberofMTSinthecorrespondingclass.Onepos- sible optionwouldthen be to assignthe componentto theclass with the largest normalizedcount. However, by doing so, one is not accounting for uncertainty/disagreement within the compo- nent. Hence, amore elegant alternativeis tosimply usethenor- malized countsasthe weightsinthe matrixW.Additionally,one hastoaccountforthateachMTScansimultaneouslybelongtosev- eralcomponents,i.e.eachMTSX(n)hasaonlysoftmembershipto thecomponentg,determinedbythevalue

π

g(n).Thiscanbedone using(n)asweightsinthefirststep.Thisprocedure issumma- rizedinAlgorithm4.

Algorithm4 Supervisedposteriortransformation.

Require: Posteriors

{

(n)

}

Nn=1 from mixture models consisting of q2 componentsandlabels

{

y(n)

}

Nn=1,

1: fori=1,...,q2, j=1,...,Ncdo 2: ComputeWi j=

N n=1y(n)

j πi(n) N

n=1y(n) j

.

3: Wi j=NWci j j=1Wi j. 4: endfor

5: Transformtrainingandtestposteriorsvia˜ =WT Ensure: Transformedposteriors˜(n)

4.2. Semi-supervisedtimeseriesclusterkernel(ssTCK)

Setting Assumethatthelabels

{

y(n)

}

Ln=1,L<N,areknownand

{

y(n)

}

Nn=L+1 areunknown.

In thissetting,ifone naively triestoapply Algorithm4based ononlythelabeledpartofthedataset,oneendsupdividingby0s.

The reasonisthat someofthecomponentsinthemixturemodel willcontainonlyunlabeledMTS(thesoftlabelanalogyisthatthe probability that anyofthe labeled MTS belong tothat particular component iszeroor veryclose tozero). Hence, we needa way toassignlabelstothecomponentsthatdonotcontainanylabeled MTS.

Notethateach componentisdescribedbyaprobability distri- bution.Anaturalmeasureofdissimilaritybetweenprobabilitydis- tributions is the Kullback–Leibler(KL) divergence[46]. Moreover, since the components are described by parametric distributions, theKLdivergencehasasimpleclosed-formexpression.TheKLdi- vergencebetweentwocomponents,i and j,inourBayesianGMM isgivenby

DKL

(

f(i)

f(j)

)

= 1 2

V

v=1

T

t=1

σ

i2v

σ

jv2+

σ

jv2

( μ

jv

(

t

)

μ

iv

(

t

))

2−1+log

( σ

j2v

)

−log

( σ

i2v

)

, (8)

where f(i)= f(X

|

R,

μ

i,i) is the density given in Eq. (4). The KL-divergencecanbemadesymmetricviathetransformation DSKL

(

f(i)

f(j)

)

= 1

2

DKL

(

f(i)

f(j)

)

+DKL

(

f(j)

f(i)

)

. (9)

Theunderlyingideainoursemi-supervisedframework istolearn thetransformationW fortheclusterswithonlyunlabeledpoints by finding the nearest cluster (in the DSKL-sense) that contain la- beledpoints.ThisleadstoAlgorithm5.

Algorithm5 Semi-supervisedposteriortransformation.

Require: Posteriors

{

(n)

}

Nn=1 frommixture models consisting of q2components,labels

{

y(n)

}

Ln=1,andhyperparameterh. 1: fori=1,...,q2, j=1,...,Nc do

2: ComputeWi j=

N n=1y(n)

j πi(n) N

n=1y(n) j

. 3: endfor

4: forallks.t.Nc

j=1Wk j<hdo 5: LetL=

{

ls.t.Nc

j=1Wl jh

}

6: Wk j=Wl jwherel=argmin

l∈L DSKL(f(k)

f(l)).

7: endfor

8: fori=1,...,q2, j=1,...,Nc do 9: Wi j= NWci j

j=1Wi j. 10: endfor

11: Transformtrainingortestposteriorvia˜ =WT Ensure:Transformedposteriors˜(n)

5. Experimentsonsyntheticandbenchmarkdatasets

The experimentsin thispaperconsists oftwo parts. Thepur- poseofthefirstpartwasto demonstratewithina controlleden- vironment situations where the proposed TCKIM and ssTCK ker- nels might prove more useful than the TCK. In the second part (Section6),wepresentacasestudyfromareal-worldmedicalap- plicationinwhichwecomparedtoseveralbaselinemethods.

In the first part, we considered synthetic and benchmark datasets. The following experimental setup was considered. We performedkernelprincipalcomponentanalysis(KPCA) usingtime series cluster kernels and let the dimensionality of the embed- ding be 10.Thereafter, we traineda kNN-classifierwithk=1 on the embedding and evaluated performance in terms of classifi- cation accuracy on an independent test set. We let Q=30 and

Table 1

Accuracy on the synthetic VAR(1) dataset.

Unsupervised Semi-supervised Supervised

TCK 0.826 0.854 0.867

TCK IM 0.933 0.967 0.970

(6)

Table 2

Description of benchmark time series datasets. Column 2 to 5 show the number of attributes, samples in training and test set, and number of classes, respectively. T min

is the length of the shortest MTS in the dataset and T maxthe longest MTS. T is the length of the MTS after the transformation.

Datasets Attributes Train Test N c T min T max T Source

uWave 3 200 4278 8 315 315 25 UCR

Char.Traj. 3 300 2558 20 109 205 23 UCI

Wafer 6 298 896 2 104 198 25 Olsz.

Japan.vow. 12 270 370 9 7 29 15 UCI

IC=

{

Nc,. . .,Nc+20

}

.An additionalhyperparameter hwasintro- duced for ssTCK. We set h to 10−1 in our experiments. We also standardizedeachattributetozeromeanandunitstandarddevia- tion.

5.1. Syntheticexample

To illustrate the effectiveness of the proposed methods, we firstconsideredacontrolledexperimentinwhichasyntheticMTS datasetwithtwoclasseswassampledfromafirst-ordervectorau- toregressivemodel,

x1

(

t

)

x2

(

t

)

=

α

1

α

2

+

ρ

1 0 0

ρ

2

x1

(

t−1

)

x2

(

t1

)

+

ξ

1

(

t

) ξ

2

(

t

)

(10)

Tomakex1(t)andx2(t)correlatedwithcorr(x1(t),x2(t))=

ρ

,we chose the noise term s.t., corr(

ξ

1(t),

ξ

2(t))=

ρ

(1

ρ

1

ρ

2)[(1

ρ

12)(1

ρ

22)]−1. Forthefirstclass(y=1),we generated100two- variate MTS of length 50 for the training and 100 for the test, from the VAR(1)-model with parameters

ρ

=

ρ

1=

ρ

2=0.8 and E[(x1(t),x2(t))T

|

y=1]=(0.5,−0.5)T. Analogously, the MTS of the second class (y=2) were generated using parameters

ρ

=

−0.8,

ρ

1=

ρ

2=0.6andE[(x1(t),x2(t))T

|

y=2]=(0,0)T. TosimulateMNARandinjectinformativemissingpatterns,we let x(in)(t) have a probability p(n) of being missing, given that xi(n)(t)>−1, i=1,2. We let p(n)=0.9 if y(n)=1 and p(n)=0.8 otherwise.Bydoingso,themissingratiowasroughly63%inboth classes.

Table 1 shows theaccuracy on the test data for the different kernels.Asexpected,theTCKgivesthelowestaccuracy,0.826.The ssTCK improvestheaccuracy considerably(0.854),andthesuper- vised version(sTCK)givesfurther improvement(0.867).However, as we can see, theeffect of explicitlymodelling the missingness mechanism in the TCKIM is larger. In this case the accuracy in- creases from 0.826 to 0.933. The two corresponding embeddings areplottedinFig.1(a)and(d),respectively.IntheTCKembedding, therearemanypointsfromdifferentclassesthatoverlapwitheach other,whereas fortheTCKIM thenumberofoverlapping pointsis muchlower.

The ssTCKIM improves the accuracy to 0.967 (from 0.933 for TCKIM and 0.854 for ssTCK).The two embeddings obtainedusing the semi-supervisedmethods are showninFig. 1(b)and(e).The supervisedversionsTCKIMyieldsaslightimprovementintermsof accuracycomparedtossTCKIM(0.970vs.0.967).Plotsofthesuper- visedembeddingsare showninFig.1(c)and(f).We canseethat forthesTCKIMtheclassesareclearlyseparated.

5.2. PerformanceofssTCKonbenchmarkdatasets

Thepurposeoftheexperimentsreportedinthefollowingpara- graphwastoevaluatetheimpactofincorporatingincompletelabel informationinthessTCK.Towardsthatend,weconsideredbench- markdatasetsandartificiallymodifiedthenumberoflabeledMTS in thetraining sets.We applied theproposed ssTCK to fourMTS benchmark datasets fromtheUCR andUCI databases [47,48]and

Table 3

Classification accuracy for benchmark datasets obtained using TCK, ssTCK and sTCK.

Datasets TCK ssTCK sTCK

Char. Traj. 0.908 0.928 0.934

uWave 0.867 0.881 0.894

Wafer 0.956 0.970 0.970

Japanese vowels 0.946 0.962 0.968

other published work [49], described in Table 2. Since some of the datasets contain MTSof varying length, we followed the ap- proach of Wang et al. [50] and transformed all the MTS in the samedatasettothesamelength,T,determinedbyT=

Tmax Tmax 25

, where Tmax is the length of the longest MTS in the dataset and

isthe ceilingoperator. ThenumberoflabeledMTSwassetto

max

{

20,Nc

}

.ssTCKwascomparedtoordinaryTCKandsTCK(as-

sumingcompletelabelinformationinthelattercase).

Table3 showstheperformance ofssTCK forthe4 benchmark datasets.Aswecansee,comparedtoTCK,theaccuracyingeneral increasesusingssTCK.FortheWaferdataset,ssTCKyieldsthesame performanceasthesupervisedkernel.Forthethreeotherdatasets, theperformanceofssTCKisslightlyworsethansTCK.Theseexper- imentsdemonstratethatssTCKiscapableofexploitingincomplete labelinformation.

Further, we created 8 datasets by randomly removing 50%

and 80%, respectively, of the values in each of the 4 benchmark datasets.AswecanseefromtheresultspresentedinTable4,also inpresenceofmissingdatatheaccuracyingeneralincreasesusing ssTCK,comparedtoTCK.

Forcomparison,inTable4we alsoaddedtheresultsobtained usingthreeotherkernels;GAK,thelinearkernel,andLPS.GAKand thelinearkernelcannotprocessincompleteMTSandthereforewe created complete datasets using mean imputation for these two kernels.LPS2wasrunusingdefaulthyperparameters,withtheex- ceptionthat weadjusted thesegmentlength tobesampledfrom theinterval[6,0.8T]toaccountfortherelativelyshortMTSinour datasets.Inaccordancewith[51],forGAK3 we setthebandwidth

σ

to0.1timesthemediandistanceofall MTSinthetrainingset

scaledbythesquarerootofthemedianlengthofallMTS,andthe triangular parameter to 0.2 timesthe medianlength of all MTS.

Distances were measured usingthe canonical metric induced by theFrobeniusnorm.In thelinearkernelwesetthe constantc to 0.Aswecansee,theperformanceofthesekernelsisconsiderably worsethanthetime seriesclusterkernels for7outof8datasets.

ForuWavewith50%missingness,theperformanceofGAKandthe linearkernelissimilartotheTCKkernels.

2Matlab implementation: http://www.mustafabaydogan.com/ .

3Matlab implementation: http://www.marcocuturi.net/GA.html .

(7)

Fig. 1. Plot of the two-dimensional KPCA representation of the synthetic data obtained using 6 different time series cluster kernels. The datapoints are colour-coded according to their labels.

Table 4

Classification accuracy for benchmark datasets obtained using TCK, ssTCK and sTCK.

Missing rate Datasets TCK ssTCK sTCK GAK Linear LPS 50% Char. Traj. 0.751 0.780 0.797 0.588 0.589 0.127

uWave 0.812 0.834 0.850 0.828 0.813 0.411 Wafer 0.956 0.970 0.972 0.792 0.791 0.823 Japanese vowels 0.929 0.948 0.947 0.827 0.824 0.746 80% Char. Traj. 0.282 0.310 0.331 0.194 0.192 0.062 uWave 0.589 0.592 0.603 0.441 0.464 0.234 Wafer 0.926 0.934 0.934 0.796 0.805 0.819 Japanese vowels 0.809 0.836 0.847 0.473 0.489 0.389

5.3. Exploitinginformativemissingnessinbenchmarkdatasets

To evaluate the effect of modelling the missing patterns in TCKIM,we generated8 versionsof theWafer andJapanese vow- elsdatasetsby manuallyinjectingmissingelementsusingthefol- lowingprocedure.Foreach attribute

v

{

1,...,V

}

,anumbercv

{

−1,1

}

wasrandomly sampledwithequalprobabilities. Ifcv=1, the attribute

v

ispositively correlated with the labels,otherwise negatively correlated. For each MTS X(n) and attribute, a miss- ingrate

γ

nvwassampledfromtheuniformdistributionU[0.3+E· cv·(y(n)−1),0.7+E·cv·(y(n)−1)]. Thisensures that theoverall missingrateofeachdatasetisapproximately50%.y(n)

{

1,...Nc

}

isthelabeloftheMTSX(n) andE isaparameter,whichwetune foreachdatasetinsuchawaythattheabsolutevalueofthePear- soncorrelationbetweenthemissingratesfortheattributes

γ

vand the labels y(n) takes the values

{

0.2,0.4,0.6,0.8

}

, respectively.

The higherthevalue ofthePearson correlation,thehigheristhe informativemissingness.

Table 5 shows the performance of the proposed TCKIM and threebaselinemodels (TCK,TCKB,andTCK0).The firstbaselineis ordinary TCK,which ignores themissingness mechanism. Forthe Waferdataset,the performanceof thisbaselinewasquite similar across allfour settings.Forthe Japanese vowels dataset,theper- formanceactuallydecreasesastheinformationinthemissingpat- terns increases.In the second baseline, TCKB, we tried to model the missingpatternsby concatenating the binarymissingindica- tor MTS R to the MTS X and creating a new MTS with 2V at- tributes.Then,wetrainedordinaryTCKonthisrepresentation.For theWafer dataset,the performancedecreases considerablyasthe informative missingness increases. For the Japanese vowels, this baselineyields thebestperformancewhenthecorrelation is20%. However, the performance actually decreases as the informative missingnessincreases.Hence,informative missingnessisnot cap- turedwiththisbaseline.Inthelastbaseline,TCK0,weinvestigated ifitispossibletocaptureinformativemissingnessbyimputingze- rosfor the missingvaluesand then trainingthe TCKon theim-

Referanser

RELATERTE DOKUMENTER

We assume that historical time series for

Based on the monthly equally weighted and value weighted return series of 17 industries in the US stock market, I construct and analyze the performance of four different

When GARCH models were introduced for modelling financial time series, the error distribution was assumed to be normal.. However, many studies have shown that financial time series

The patterns use the time series data on historical hourly electricity prices collected from the OSL (that provided in the Nord Pool) and the EEX market. 3) And we estimate that

We use the chemistry transport model OsloCTM2 to model historical time series of BC concentration and deposition from energy and industrial sources and compare these to

Abstract Cryptocurrencies have recently gained a lot of interest from investors, central banks and governments worldwide. The lack of any form of political regu- lation and

We estimated a SVAR model based on monthly data for the vector time

analysis; Time series decomposition; Bayesian model averaging; Disturbances; Nonlinear dynamics; Regime 38 ..