Time series cluster kernels to exploit informative missingness and incomplete label information

(1)

ContentslistsavailableatScienceDirect

Pattern Recognition

journalhomepage:www.elsevier.com/locate/patcog

Time series cluster kernels to exploit informative missingness and incomplete label information

Karl Øyvind Mikalsen

^a^,^b^,¹^,^∗

, Cristina Soguero-Ruiz

^c

, Filippo Maria Bianchi

^d

, Arthur Revhaug

^b

, Robert Jenssen

^a^,¹

aDepartment of Physics and Technology, UiT The Arctic University of Norway, Tromsø, Norway

bDepartment of Gastrointestinal Surgery, University Hospital of North Norway (UNN), Tromsø, Norway

cDepartment of Signal Theory and Comm., Telematics and Computing, Universidad Rey Juan Carlos, Fuenlabrada, Spain

dDepartment of Mathematics and Statistics, UiT, Tromsø, Norway

a rt i c l e i nf o

Article history:

Received 27 November 2018 Revised 11 November 2020 Accepted 8 February 2021 Available online 20 February 2021 Keywords:

Multivariate time series Kernel methods Missing data

Informative missingness Semi-supervised learning

a b s t r a c t

Thetimeseriesclusterkernel(TCK)providesapowerfultoolforanalysingmultivariatetimeseriessub- jecttomissingdata.TCK isdesignedusinganensemblelearningapproachinwhichBayesian mixture modelsformthe basemodels.BecauseoftheBayesian approach,TCK cannaturallydealwithmissing valueswithoutresortingtoimputation andtheensemblestrategyensuresrobustnesstohyperparame- ters,makingitparticularlywellsuitedforunsupervisedlearning.

However, TCK assumes missing at randomand that the underlyingmissingness mechanism is ignorable,i.e.uninformative,anassumptionthatdoesnotholdinmanyreal-worldapplications,suchase.g.

medicine.Toovercomethislimitation,wepresentakernelcapableofexploitingthepotentiallyrichin- formationinthemissingvaluesandpatterns,aswellastheinformationfromtheobserveddata.Inour approach,wecreatearepresentationofthemissingpattern,whichisincorporatedintomixedmodemix- turemodelsinsuchawaythattheinformationprovidedbythemissingpatternsiseffectivelyexploited.

Moreover,wealsopropose asemi-supervisedkernel,capableoftaking advantageofincompletelabel informationtolearnmoreaccuratesimilarities.

Experimentsonbenchmarkdata,aswellasareal-worldcasestudyofpatientsdescribedbylongitudinal electronichealth recorddatawhopotentiallysufferfromhospital-acquiredinfections,demonstratethe effectivenessoftheproposedmethods.

ThisisanopenaccessarticleundertheCCBYlicense(http://creativecommons.org/licenses/by/4.0/)

1. Introduction

Multivariate time series (MTS) frequently occur in a whole range ofpractical applicationssuch asmedicine,biology, andcli- mate studies, to name a few. A challenge that complicates the analysisisthatreal-worldMTSareoftensubjecttolargeamounts ofmissingdata.Traditionally,missingnessmechanismshavebeen categorized into missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR) [1]. The main difference between these mechanisms consists in whether the missingness is ignorable (MCAR and MAR) or non-ignorable (MNAR)[1–3].Ine.g.medicine,non-ignorablemissingnesscanoc-

∗Corresponding author.

E-mail address: [email protected] (K. Øyvind Mikalsen).

1 KM, RJ are with the UiT Machine Learning Group: machine-learning.uit.no .

curwhenthemissingpatternsRare relatedto thediseaseunder studyY.In thiscase, the distributionof the missingpatternsfor diseasedpatientsisnotequaltothecorrespondingdistributionfor thecontrolgroup,i.e.p(^R

|

^Y=1)=p(^R

|

^Y=0)^.^Hence,^the^missingness is informative [4–7]. By contrast, uninformative missing- nesswillbereferredtoasignorableintheremainderofthispaper.

Bothignorableandinformativemissingnessoccurinreal-world data.An example frommedicine ofignorable missingnessoccurs e.g. if a clinician orders lab testsfor a patient andthe tests are performed,butbecauseofanerrortheresultsarenotrecorded.On theotherhand,informativemissingnesscouldoccurifitisdecided to not performlabtests becausethe doctorthinksthe patient is ingoodshape.In thelattercase,themissingvaluesandpatterns potentiallycontainrichinformationaboutthediseasesandclinical outcomesforthepatient. Eﬃcientdata-driven approachesaiming toextract knowledge,perform predictivemodelling,etc.,mustbe capableofcapturingthisinformation.

https://doi.org/10.1016/j.patcog.2021.107896

(2)

Variousmethodshavebeenproposedtohandlemissingdatain MTS [8–11]. One simple approachis to createa complete dataset by discarding the time series with missing data. However, this gives unbiased predictions only if the missingness mechanismis MCAR. As an alternative, a preprocessing step involving imputa- tion of missing values with some estimated value, such as the mean, is common.Other so-called single imputation methods ex- ploitmachine learningbasedmethodssuch asmultilayerpercep- trons, self-organizing maps,k-nearest neighbors, recurrentneural networks and regression-based imputation [12,13]. Alternatively, one can impute missing values using various smoothing and in- terpolationtechniques[12,14].Amongthese,aprominentexample isthelastobservationcarriedforward(LOCF)schemethatimputes the lastnon-missingvalue forthe followingmissingvalues.Limi- tations of imputationmethods are that they introduceadditional bias andthey ignoreuncertaintyassociated withthe missingval- ues.

Multiple imputation [15] resolves this problem, to some ex- tent,by estimatingthemissingvaluesmultipletimesandthereby creating multiplecomplete datasets. Thereafter,e.g. a classiﬁer is trainedonalldatasetsandtheresultsarecombinedtoobtainthe ﬁnal predictions. However, despite that multiple imputation and other imputationmethodscangive satisfyingresultsinsomesce- narios,thesearead-hocsolutionsthatleadtoamulti-stepproce- dure inwhich themissingdataare handledseparately andinde- pendentlyfromtherestoftheanalysis.Moreover,theinformation aboutwhich valuesare actually missing(the missingpatterns)is lost, i.e. imputationmethods cannot exploit informative missingness.

Due totheaforementionedlimitations,severalresearchefforts havebeendevotedoverthelast yearstoprocessincompletetime series without relying on imputation [5–7,16–23]. In this regard, powerful kernelmethods havebeen proposed, ofwhich the time series clusterkernel(TCK)[24]isaprominentexample.TheTCKis designedusingan ensemblelearningapproachinwhichBayesian mixturemodelsformthebasemodels.AnadvantageofTCK,com- paredtoimputationmethods,isthatthemissingdataarehandled automaticallyandnoadditionaltasksarelefttotheuser.Multiple imputation instead requiresa carefulselection of the imputation model and other variables are needed to do the imputation [8], which particularly inan unsupervised setting can turnout to be problematic.

AshortcomingoftheTCKisthatunbiasedpredictionsareonly guaranteed forignorable missingness, i.e.the kernel cannot take advantage ofinformativemissingpatternsfrequently occurringin medicalapplications.Toovercomethislimitation,inthiswork,we presentanoveltimeseriesclusterkernel,TCK_IM.Inourapproach, we createarepresentationofthemissingpatternsusingmasking, i.e.werepresentthemissingpatternsusingbinaryindicator time series.Bydoing so,we obtain MTSconsistingofboth continuous and discreteattributes. To modelthesetime series,we introduce mixed mode Bayesian mixture models, which can effectively ex- ploitinformationprovidedbythemissingpatterns.

The time series cluster kernels are particularly useful in unsupervised settings. In many practical applications such as e.g.

medicine it is not feasible to obtain completely labeled training sets [25],butinsome casesitispossibleto annotatea fewsam- ples withlabels, i.e.incomplete label informationis available. In order to exploit the incomplete label information, we propose a semi-supervised MTSkernel,ssTCK.Inour approach,weincorpo- rateideasfrominformationtheorytomeasuresimilaritiesbetween distributions.Morespeciﬁcally,weemploytheKullback-Leiblerdi- vergencetoassignlabelstounlabeleddata.

ExperimentsonbenchmarkMTSdatasetsandareal-worldcase study of patients suffering from hospital-acquired infections, de-

scribedbylongitudinalelectronic healthrecorddata,demonstrate theeffectivenessoftheproposedTCK_IMandssTCKkernels.

The remainder of this paper is organized as follows.

Section 2 presents background on MTS kernels. The two proposed kernels are described in Sections 3 and 4, respectively.

Experiments on synthetic andbenchmark datasets are presented in Section 5, whereas the case study is described in Section 6. Section7concludesthepaper.

2. Multivariatetimeserieskernelstohandlemissingdata

Kernel methods have been of great importance in machine learning for several decades and have applications in many different fields [26–28]. Within the context of time series, a kernel isa similarity measurethat also ispositive semi-definite[29,30]. Oncedefined, such similaritiesbetweenpairs of time seriesmay be utilized ina wide range of applications,such as classification orclustering,benefitingfromthevastbodyofworkinthefieldof kernelmethods.HereweprovideanoverviewofMTSkernels,and describehowtheydealwithmissingdata.

Thesimplestofallkernelfunctionsisthelinearkernel,which fortwodatapointsrepresentedasvectors,xandy,isgivenbythe innerproduct

^x,y

,possiblyplusaconstantc.Onecanalsoapply alinearkerneltopairsofMTSoncetheyareunfoldedintovectors.

However,bydoingsotheinformationthattheyareMTSandthere mightbeinherentdependenciesintimeandbetweenattributes,is then lost. Nevertheless,in some cases such a kernelcan be eﬃ- cient,especiallyiftheMTSareshort[31].IftheMTScontainmiss- ingdata,thelinearkernelrequiresa preprocessingstepinvolving e.g.imputation.

Themostwidelyusedtimeseriessimilaritymeasureisdynamic timewarping(DTW)[32–34],wherethesimilarityisquantifiedas the alignment cost between the MTS. More specifically, in DTW the time dimension of one or both of thetime seriesis warped toachieveabetteralignment.DespitethesuccessofDTWinmany applications,similarlytomanyothersimilaritymeasures,itisnon- metricandthereforecannotnon-triviallybeusedtodesignapos- itive semi-definite kernel [35]. Hence, it is not suited for kernel methodsin itsoriginal formulation.However, becauseofits pop- ularity therehavebeen attemptsto designkernels exploitingthe DTW.Forexample,Cuturietal.designedaDTW-basedkernelus- ingglobalalignments[36].Anefficientversionoftheglobalalign- ment kernel(GAK)is provided inCuturi[37].The latterhas two hyperparameters,namelythekernelbandwidthandthetriangular parameter.GAKdoesnotnaturallydealwithmissingdataandin- completedatasets,andthereforealsorequiresapreprocessingstep involvingimputation.

Two MTS kernels that can naturally deal with missing data withouthavingtoresorttoimputationarethelearnedpatternsim- ilarity(LPS)[38] andTCK. LPSgeneralizesthewell-known autore- gressivemodelstolocalautopatternsusingmultiplelagvaluesfor autocorrelation.Theseautopatternsaresupposedtocapturethelo- caldependencystructureinthetimeseriesandarelearnedusing atree-based(randomforest)learningstrategy.Morespeciﬁcally,a time series is represented as a matrix ofsegments. Randomness isinjectedtothelearningprocessbyrandomlychoosingtimeseg- ment(columninthematrix)andlagpforeachtreeintherandom forest. Abag-of-words type compressed representationis created fromtheoutputoftheleaf-nodesforeachtree.Theﬁnaltimese- riesrepresentationiscreatedby concatenating therepresentation obtainedfromtheindividualtrees,whichinturnareusedtocom- putethesimilarityusingahistogramintersectionkernel[39].

The TCK isbased on an ensemble learningapproach wherein robustnessto hyperparameters is ensured by joiningthe cluster- ingresults ofmanyGaussianmixturemodels (GMM)toformthe

(3)

final kernel.Hence, no criticalhyperparametershave tobe tuned by theuser,andtheTCKcanbe learnedinanunsupervisedman- ner.Toensurerobustnesstosparselysampleddata,theGMMsthat arethebasemodelsintheensemble,areextendedusinginforma- tivepriordistributionssuchthatthemissingdataisexplicitlydealt with.Morespecifically,theTCKmatrixisbuiltbyfittingGMMsto thesetofMTSforarangeofnumberofmixturecomponents.The idea isthat by generating partitions atdifferent resolutions, one cancaptureboththelocalandglobalstructureofthedata.More- over,tocapturediversityinthedata,randomnessisinjectedbyfor each resolution (number of components) estimating the mixture parametersforarangeofrandominitializationsandrandomlycho- senhyperparameters.Inaddition,eachGMMseesarandomsubset ofattributesandsegmentsintheMTS.Theposteriordistributions foreachmixturecomponentarethenusedtobuildtheTCKmatrix bytakingtheinnerproductbetweenallpairsofposteriordistribu- tions. Eventually,givenan ensembleofGMMs,theTCKiscreated inanadditivewaybyusingthefactthatthesumofkernelsisalso a kernel. Recently, TCKhas also been extendedto handle spatial dependencies[40].

Despite that LPS and TCKkernels share many properties,the waymissingdataaredealtwithisverydifferent.InLPS,themiss- ing data handling abilities of decision trees are exploited. Along withensemblemethods,fuzzyapproachesandsupportvectorso- lutions, decisiontrees canbe categorizedasmachinelearningap- proaches for handling missing data [12], i.e.the missing data are handlednaturallybythemachinelearningalgorithm.Onecanalso arguethatthewaymissingdataaredealtwithintheTCKbelongs to this category, since an ensemble approach is exploited. How- ever,itcanalsobecategorizedasalikelihood-basedapproachsince theunderlyingmodelsintheensembleareGaussianmixturemod- els.Inthelikelihood-basedapproaches,thefull,incompletedataset is analysed using maximum likelihood (or maximum a posteriori, equivalently), typically in combination with the expectation- maximization (EM)algorithm[8,9].Theseapproachesassumethat themissingnessisignorable.

3. Timeseriesclusterkerneltoexploitinformativemissingness

Inthissection,wepresentthenoveltimeseriesclusterkernel, TCK_IM,whichiscapableofexploitinginformativemissingness.

Akeycomponentinthetimeseriesclusterkernelframeworkis ensemble learning, inwhichthe basicideaconsistsin combining acollectionofmanybasemodelsintoacompositemodel.Agood suchcompositemodelwillhavestatistical,computationalandrep- resentational advantagessuch aslowervariance, lower sensitivity tolocaloptimaandiscapableofrepresentingabroaderspanfunc- tions (increasedexpressiveness),respectively,comparedtothe in- dividualbasemodels[41].Keytoachievethisisdiversityandaccu- racy[42],i.e.thebasemodelscannotmakethesameerrorsonnew test data andhaveto performbetter than randomguessing. This canbedonebyintegratingmultipleoutcomesofthesame(weak) basemodelasitistrainedunderdifferent,oftenrandomlychosen, settings(parameters,initialization,subsampling,etc.)toensurediversity[43].

IntheTCK_IM kernel,thebasemodelisamixedmodeBayesian mixturemodel.Next,weprovidethedetailsofthismodel.

Notation

Thefollowingnotationisused.Amultivariatetimeseries(MTS) X is deﬁned as a (ﬁnite) combination of univariate time series (UTS), X=

{

^xv∈R^T

| v

=1,2,...,V

}

, where each attribute, x_v, is a UTS of length T. The number of UTS, V, is the dimension of X. The length T of the UTSx_v is also the length of the MTS X. Hence, a V-dimensional MTS, X, of length T can be represented

as a matrix in R^V^×T. Given a dataset of N MTS, we denote X⁽ⁿ⁾ then-thMTS.An incompletelyobserved MTSisdescribedby the pair U⁽ⁿ⁾=(^X⁽ⁿ⁾,R⁽ⁿ⁾), where R⁽ⁿ⁾ is a binary MTS with entry r⁽_vⁿ⁾(^t)=0ifthe realizationx_v⁽ⁿ⁾(^t) ^is^missing^and^rv⁽ⁿ⁾(^t)=1ifit isobserved.

Mixedmodemixturemodel

AssumethataMTSU=(^X,R)^is^generated^from^two ^modes.^X isaV-variatereal-valuedMTS(X∈R^V^×^T),whereasRisaV-variate binaryMTS(R∈

{

⁰,1

}

^V^×^T^).^Further,^weâssume^thatÛ îs^generated

fromaﬁnitemixturedensity, p

(

^U

|

^,

)

⁼

G

g=1

θ

^g^f

(

^U

| φ

^g

)

^, ⁽¹⁾

whereGisthenumberofcomponents, fisthedensityofthecom- ponents parametrized by =(

φ

1,...,

φ

G), and =(

θ

1,...,

θ

^g) arethemixingcoeﬃcients,0≤

θ

G≤1andG

g=1

θ

g=1.

Now,introducealatentrandomvariableZ,representedasaG- dimensional one-hotvector Z=(^Z1,. . .,Z_G), whosemarginal dis- tributionisgivenby p(^Z

|

)=G

g=1

θ

g^Z^g. Theunobservedvariable Z records the membership of U and therefore Zg=1 if U be- longs to componentg andZg=0otherwise. Hence, p(^U

|

^Z,)= G

g=1f(^U

| φ

^g)^Z^g,andthereforeitfollowsthat p

(

U,Z

|

,

)

=p

(

^U

|

Z,

)

^p

(

^Z

| )

=

G

g=1

[f

(

^U

| φ

g

) θ

g]^Z^g (2)

U=(^X,R)^consistsôf^two^modalities^X ând^R^.^We^now^naivelyâs- sumethat

f

(

^U

| φ

g

)

=f

(

^X

|

R,

μ

g,

g

)

^f

(

^R

| β

g

)

, (3) where f(^X

|

^R,

μ

^g,^g)^is^a^density^function^given^by

f

(

^X

|

^R,

μ

^g^,

g

)

= V

v=1

T

t=1

N

(

^xv

(

^t

) | μ

^gv

(

^t

)

,

σ

^gv

)

^r^v⁽^t⁾, (4) and f(^R

| β

^g)^is^aprobabilitymassgivenby

f

(

^R

| β

^g

)

= V

v=1

T

t=1

β

g^rv^v⁽t^t⁾

(

¹−

β

^gvt

)

¹⁻^r^v⁽^t⁾. (5) The parameters of each component are

φ

^g=(

μ

^g,^g,

β

^g), where

μ

^g=

{ μ

^gv∈R^T

| v

=1,...,V

}

^is^atime-dependentmean(

μ

^gv isa UTSoflength T),g=diag

{ σ

g1²,...,

σ

gV²

}

^is^a time-constantdiag- onalcovariancematrixinwhich

σ

g²visthevarianceofattribute

v

_,

and

β

^gv^t∈[0,1]aretheparametersoftheBernoullimixturemodel inEq.(5).Theideaisthateventhoughthemissingnessmechanism isignoredin f(^X

|

^R,

μ

^g,g),whichisonlycomputedovertheob- served data,theBernoulli term f(^R

| β

^g) ^will^captureinformation fromthemissingpatterns.

The conditional probability of Z given U, can be found using Bayes’theorem,

π

g≡P

(

^Zg=1

|

U,

,

)

=

θ

g

V v=1

T

t=1[N

(

^xv

(

^t

) | μ

gv

(

^t

)

,

σ

gv

) β

gvt]^r^v⁽^t⁾

(

¹−

β

gvt

)

¹⁻^r^v⁽^t⁾ G

g=1

θ

^g^V

v=1

T t=1

[N

(

^xv

(

^t

) | μ

^gv

(

^t

)

^,

σ

^gv

) β

^gvt]^r^v⁽^t⁾

(

¹⁻

β

^gvt

)

¹^−r^v⁽^t⁾ .

(6) Similarlyto[24],weintroduceaBayesianextensionandputin- formative priors over the parameters of the normal distribution, whichenforcessmoothnessovertimeandthatclusterscontaining

(4)

few time series,to haveparameters similar to themean andco- variance computed overthe wholedataset. Akernel-basedGaus- sianpriorisdeﬁnedforthemean,P(

μ

^gv)=N

μ

^gv

|

^mv,S_v

.m_vare theempiricalmeansandthepriorcovariancematrices,S_v,arede- ﬁned asS_v=s_vK, wheres_v areempiricalstandard deviationsand K is a kernel matrix, whose elements are Ktt =b₀exp(−a₀(^t− t)²), t,t =1,...,T.a₀,b₀ areuser-deﬁnedhyperparameters.An inverseGammadistributionpriorisputonthestandarddeviation

σ

^gv,P(

σ

^gv)∝

σ

g^−Nv⁰exp

−^N₂_σ⁰^s2^v

gv ,whereN₀ isauser-deﬁnedhyper- parameter.Wedenote=

{

^a0,b₀,N₀

}

^the^set^ofhyperparameters.

Then, givena dataset

{

^U⁽ⁿ⁾

}

^Nn=1,theparameters

{

,

}

^can^be

estimated using maximuma posteriori expectation maximization (MAP-EM)[44,45].ThisleadstoAlgorithm1.

Algorithm1 MAP-EMformixedmodemixturemodel.

Require: Dataset

{

^U⁽ⁿ⁾=(^X⁽ⁿ⁾,R⁽ⁿ⁾)

}

^Nn=1, hyperparameters ^and numberofmixturesG.

1: Initialize the parameters =(

θ

1,...,

θ

G) ^and =

{ μ

^g,

σ

^g,

β

^g

}

^Gg=1.

2: E-step. ForeachMTSU⁽ⁿ⁾,evaluate theposteriorprobabilities usingEq.(6)withthecurrentparameterestimates.

3: M-step.Updateparametersusingthecurrentposteriors

θ

^g=N⁻¹N n=1

π

g⁽ⁿ⁾

σ

g²v=N0s²_v+_N

n=1

_T

t=1r⁽_vⁿ⁾

(

^t

) π

g⁽ⁿ⁾

x⁽_vⁿ⁾

(

^t

)

−

μ

gv

(

^t

)

2

N0+_N

n=1

_T

t=1r_v⁽ⁿ⁾

(

^t

) π

g⁽ⁿ⁾

μ

gv=S⁻¹_v m_v+

σ

g⁻²v

_N

n=1

π

g⁽ⁿ⁾diag

(

^rv⁽ⁿ⁾

)

^x⁽vⁿ⁾ S⁻¹_v +

σ

g⁻²v

_N

n=1

π

g⁽ⁿ⁾diag

(

^r⁽vⁿ⁾

) β

gvt=

(

^Nn=1

π

g⁽ⁿ⁾

)

⁻¹^Nn=1

π

g⁽ⁿ⁾r⁽_vⁿ⁾

(

^t

)

4: Repeatstep2-3untilconvergence.

Ensure: Posteriors ⁽ⁿ⁾≡

π

1⁽ⁿ⁾,...,

π

G⁽ⁿ⁾ T

and parameter esti- mates^and^.

3.1. Formingthekernel

We now explainhow themixedmode mixture modelisused toformtheTCK_IM kernel.

We use themixed mode Bayesianmixture model asthe base model inanensemble approach.Toensurediversity, wevarythe number of components forthe base models by sampling froma setofintegersIC=

{

^I,...,I+C

}

^.^For^each ^number^ofcomponents, we applyQ differentrandominitialconditionsandhyperparame- ters. We let Q=

{

^q=(^q1,q2)

|

^q¹=1,...Q,q2∈IC

}

^be ^the^index

set keeping trackof initial conditions and hyperparameters (q₁), andthenumberofcomponents(q₂).Eachbasemodelqistrained onarandomsubsetofMTS

{

(^X⁽ⁿ⁾,R⁽ⁿ⁾)

}

n∈η(^q).Moreover,foreach q,we selectrandomsubsets ofvariablesV(^q) ^as^well^as^random timesegmentsT(^q)^.

The inner products of the normalized posterior distributions from each mixture component are then added up to build the TCK_IM kernel matrix. Note that, in addition to introducing novel basemodelstoaccountforinformativemissingness,wealsomod- ify the kernel by normalizing the vectors of posteriors to have unit lengthinthel₂-norm.Thisprovidesan additionalregulariza- tionthatmayincreasethegeneralizationcapabilityofthelearned model. The details of the method are presented in Algorithm 2. The kernelforMTSnotavailable duringtrainingcanbeevaluated accordingtoAlgorithm3.

Algorithm2 Timeseriesclusterkernel.Trainingphase.

Require: TrainingsetofMTS

{

(^X⁽ⁿ⁾,R⁽ⁿ⁾)

}

^Nn=1,Q initializations,set ofintegersIC controllingnumberofcomponentsforeachbase model.

1: InitializekernelmatrixK=0N×N. 2: forq∈Qdo

3: Computeposteriors ⁽ⁿ⁾(^q)≡(

π

1⁽ⁿ⁾,...,

π

q⁽₂ⁿ⁾)^T^,^by ^ﬁtting^a mixedmodemixturemodelwithq₂clusterstothedatasetand byrandomlyselecting:

i. hyperparameters(^q)^,

ii. a time segment T(^q) ^of ^length Tmin≤

|

^T

(

^q

) |

^≤ ^T^max ^to

extractfromeachX⁽ⁿ⁾andR⁽ⁿ⁾,

iv. a subset of attributes V(^q)^, ^with cardinality Vmin≤

|

V

(

^q

) |

≤Vmax,toextractfromeachX⁽ⁿ⁾andR⁽ⁿ⁾, vi. asubsetofMTS,

η

(^q)^,^withNmin≤

| η (

^q

) |

^≤^N^,

vii. initializationofthemixtureparameters(^q)^and(^q)^.

4: Updatekernelmatrix,Knm=Knm+₍n⁽ⁿ)⁾(⁽q^q)⁾^T·⁽^m⁽^m⁾⁽⁾^q(q⁾). 5: endfor

Ensure:K kernelmatrix,timesegmentsT(^q)^,^subsetsôfâttributes V(^q)^,^subsetsôf^MTS

η

(^q)^,^parameters(^q)^,(^q)^and^posteriors⁽ⁿ⁾(^q)^.

Algorithm3 Timeseriesclusterkernel.Testphase.

Require: Test set

X^∗⁽^m⁾

M

m=1,time segments T(^q)^subsets ôfât- tributes V(^q)^, VR(^q)^, ^subsets ôf ^MTS

η

(^q)^, ^parameters (^q)^, (^q)^and^posteriors⁽ⁿ⁾(^q)^.

1: InitializekernelmatrixK^∗=0_N_×_M. 2: forq∈Qdo

3: Compute posteriors ^∗⁽^m⁾(^q)^, ^m=1,...,M using the mix- tureparameters(^q)^,(^q)^.

4: Updatekernelmatrix,K_nm^∗ =K_nm^∗ +₍n⁽ⁿ)⁾(⁽q^q)⁾^T·^∗⁽^∗^m⁽^m⁾⁽⁾^q(q⁾). 5: endfor

Ensure:K^∗testkernelmatrix.

4. Semi-supervisedtimeseriesclusterkernel

Thissectionpresentsasemi-supervisedMTSkernel,ssTCK,ca- pable of exploiting incomplete label information. In ssTCK, the base mixture models are learnedexactly in the same wayas in TCKorTCK_IM,i.e.ifthere isno missingdata,or themissingness is ignorable, the base models will be the Bayesian GMMs. Con- versely,ifthemissingnessisinformative,thebasemodelsarethe mixedmode Bayesian mixture models presented inthe previous section. Bothapproacheswillassociateeach MTS X⁽ⁿ⁾ witha q₂- dimensionalposterior⁽ⁿ⁾≡

π

1⁽ⁿ⁾,...,

π

q⁽ⁿ₂⁾

T,where

π

g⁽ⁿ⁾repre-

sentstheprobabilitythattheMTSbelongstocomponentgandq₂ isthetotalnumberofcomponentsinthebasemixturemodel.

InssTCK, labelinformation isincorporated inan intermediate processingstepinwhichtheposteriors ⁽ⁿ⁾âre transformed,be- forethetransformed posteriorsare sentinto Algorithms2and3. Moreprecisely,thetransformationconsistsinmappingtheposte- riorforthemixturecomponentstoaclass“posterior” (probability), i.e. we seek to find a function M: [0,1]^q²→[0,1]^N^c, ⁽ⁿ⁾−→^M ˜⁽ⁿ⁾.Hence,wewanttoexploit theincompletelabelinformation to find a transformation that merges the q₂ components of the mixturemodelintoNcclusters,whereNcisthenumber ofclasses.

ThemappingMcanbethoughtofasa(soft)Nc-classclassifier, andhencetherecouldbemanypossiblewaysoflearningM.How- ever,choosingatooflexibleclassifier forthispurposeleadstoan increasedriskofoverfittingandcould alsounnecessarilyincrease

(5)

thealgorithmiccomplexity.Forthesereasons,werestrictourselves tosearchingforalineartransformation

M

(

⁽ⁿ⁾

)

=W^T

⁽ⁿ⁾, W ∈[0,1]^q²^×N^c. (7) SincetheNc-dimensionaloutput˜⁽ⁿ⁾=M(⁽ⁿ⁾)^should^represent aprobability distribution,weaddtheconstraintN_c

i=1W_ji=1, j= 1,...,q₂.

Anaturalﬁrststepistoﬁrstassumethatthelabelinformation iscompleteandlookatthecorrespondingsupervisedkernel.Inthe followingtwo subsections,wedescribeourproposed methodsfor learningthetransformationM insupervisedandsemi-supervised settings,respectively.

4.1. Supervisedtimeseriesclusterkernel(sTCK)

Supervisedsetting.Eachbasemixturemodelconsistsofq₂components,andweassumethatthenumberofcomponentsisgreater or equal to the number of classes Nc. Further, assume that each MTS X⁽ⁿ⁾ in the trainingsetis associatedwith a N_c–dimensional one-hot vector y⁽ⁿ⁾,which represents its label. Hence, thelabels ofthetrainingsetcanberepresentedvia amatrixY∈

{

⁰,1

}

^N×N^c, whereNisthenumberofMTSinthetrainingset.

We approach this problem by considering one component at thetime.Foragivencomponentg,thetaskistoassociateitwith a class. One naturalwayto dothisis to identifyall members of component gandthen simply count how manytimes each label occur. To account for class imbalance, one can then divide each count bythenumberofMTSinthecorrespondingclass.Onepos- sible optionwouldthen be to assignthe componentto theclass with the largest normalizedcount. However, by doing so, one is not accounting for uncertainty/disagreement within the component. Hence, amore elegant alternativeis tosimply usethenor- malized countsasthe weightsinthe matrixW.Additionally,one hastoaccountforthateachMTScansimultaneouslybelongtosev- eralcomponents,i.e.eachMTSX⁽ⁿ⁾hasaonlysoftmembershipto thecomponentg,determinedbythevalue

π

g⁽ⁿ⁾.Thiscanbedone using⁽ⁿ⁾âs^weightsⁱⁿ^the^first^step.^This^procedure îs^summa- rizedinAlgorithm4.

Algorithm4 Supervisedposteriortransformation.

Require: Posteriors

{

⁽ⁿ⁾

}

^Nn=1 from mixture models consisting of q₂ componentsandlabels

{

^y⁽ⁿ⁾

}

^Nn=1,

1: fori=1,...,q2, j=1,...,Ncdo 2: ComputeW_{i j}=

_N n=1y(n)

j π_i⁽ⁿ⁾ _N

n=1y(n) j

.

3: W_{i j}=N^Wc^{i j} j=1W_{i j}. 4: endfor

5: Transformtrainingandtestposteriorsvia˜ =W^T Ensure: Transformedposteriors˜⁽ⁿ⁾

4.2. Semi-supervisedtimeseriesclusterkernel(ssTCK)

Setting Assumethatthelabels

{

^y⁽ⁿ⁾

}

^Ln=1,L<N,areknownand

{

^y⁽ⁿ⁾

}

^Nn=L+1 areunknown.

In thissetting,ifone naively triestoapply Algorithm4based ononlythelabeledpartofthedataset,oneendsupdividingby0s.

The reasonisthat someofthecomponentsinthemixturemodel willcontainonlyunlabeledMTS(thesoftlabelanalogyisthatthe probability that anyofthe labeled MTS belong tothat particular component iszeroor veryclose tozero). Hence, we needa way toassignlabelstothecomponentsthatdonotcontainanylabeled MTS.

Notethateach componentisdescribedbyaprobability distribution.Anaturalmeasureofdissimilaritybetweenprobabilitydis- tributions is the Kullback–Leibler(KL) divergence[46]. Moreover, since the components are described by parametric distributions, theKLdivergencehasasimpleclosed-formexpression.TheKLdi- vergencebetweentwocomponents,i and j,inourBayesianGMM isgivenby

D_KL

(

^f⁽ⁱ⁾

^f⁽^j⁾

)

= 1 2

^V

v=1

T

t=1

σ

i²v

σ

⁻jv²+

σ

j⁻v²

( μ

jv

(

^t

)

−

μ

ⁱv

(

^t

))

²−1+log

( σ

j²v

)

−log

( σ

i²v

)

, (8)

where f⁽ⁱ⁾= f(^X

|

^R,

μ

i,i) ^is ^the ^density ^given ⁱⁿ ^Eq. ⁽⁴⁾^. ^The KL-divergencecanbemadesymmetricviathetransformation D^S_KL

(

^f⁽ⁱ⁾

^f⁽^j⁾

)

= 1

2

DKL

(

^f⁽ⁱ⁾

^f⁽^j⁾

)

+DKL

(

^f⁽^j⁾

^f⁽ⁱ⁾

)

. (9)

Theunderlyingideainoursemi-supervisedframework istolearn thetransformationW fortheclusterswithonlyunlabeledpoints by ﬁnding the nearest cluster (in the D^S_KL-sense) that contain la- beledpoints.ThisleadstoAlgorithm5.

Algorithm5 Semi-supervisedposteriortransformation.

Require: Posteriors

{

⁽ⁿ⁾

}

^Nn=1 frommixture models consisting of q₂components,labels

{

^y⁽ⁿ⁾

}

^Ln=1,andhyperparameterh. 1: fori=1,...,q₂, j=1,...,Nc do

2: ComputeW_{i j}=

N n=1y(n)

j π_i⁽ⁿ⁾ N

n=1y(n) j

. 3: endfor

4: forallks.t.N_c

j=1W_{k j}<hdo 5: LetL=

{

^l^s.t.Nc

j=1W_{l j}≥h

}

6: W_{k j}=W_{l j}wherel=argmin

l∈L D^S_KL(^f⁽^k⁾

^f⁽^l⁾)^.

7: endfor

8: fori=1,...,q₂, j=1,...,Nc do 9: W_{i j}= N^Wc^{i j}

j=1W_{i j}. 10: endfor

11: Transformtrainingortestposteriorvia˜ =W^T Ensure:Transformedposteriors˜⁽ⁿ⁾

5. Experimentsonsyntheticandbenchmarkdatasets

The experimentsin thispaperconsists oftwo parts. Thepur- poseoftheﬁrstpartwasto demonstratewithina controlleden- vironment situations where the proposed TCKIM and ssTCK kernels might prove more useful than the TCK. In the second part (Section6),wepresentacasestudyfromareal-worldmedicalap- plicationinwhichwecomparedtoseveralbaselinemethods.

In the first part, we considered synthetic and benchmark datasets. The following experimental setup was considered. We performedkernelprincipalcomponentanalysis(KPCA) usingtime series cluster kernels and let the dimensionality of the embedding be 10.Thereafter, we traineda kNN-classifierwithk=1 on the embedding and evaluated performance in terms of classification accuracy on an independent test set. We let Q=30 and

Table 1

Accuracy on the synthetic VAR(1) dataset.

Unsupervised Semi-supervised Supervised

TCK 0.826 0.854 0.867

TCK IM 0.933 0.967 0.970

(6)

Table 2

Description of benchmark time series datasets. Column 2 to 5 show the number of attributes, samples in training and test set, and number of classes, respectively. T min

is the length of the shortest MTS in the dataset and T maxthe longest MTS. T is the length of the MTS after the transformation.

Datasets Attributes Train Test N c T min T max T Source

uWave 3 200 4278 8 315 315 25 UCR

Char.Traj. 3 300 2558 20 109 205 23 UCI

Wafer 6 298 896 2 104 198 25 Olsz.

Japan.vow. 12 270 370 9 7 29 15 UCI

IC=

{

^N^c,. . .,Nc+20

}

^.^An ^additionalhyperparameter hwasintro- duced for ssTCK. We set h to 10⁻¹ in our experiments. We also standardizedeachattributetozeromeanandunitstandarddevia- tion.

5.1. Syntheticexample

To illustrate the effectiveness of the proposed methods, we ﬁrstconsideredacontrolledexperimentinwhichasyntheticMTS datasetwithtwoclasseswassampledfromaﬁrst-ordervectorau- toregressivemodel,

x1

(

^t

)

x₂

(

^t

)

=

α

1

α

2

+

ρ

1 0 0

ρ

2

x1

(

^t−1

)

x₂

(

^t⁻¹

)

+

ξ

1

(

^t

) ξ

2

(

^t

)

(10)

Tomakex₁(^t)^and^x2(^t)^correlated^with^corr(^x1(^t),x₂(^t))=

ρ

,we chose the noise term s.t., corr(

ξ

1(^t),

ξ

2(^t))=

ρ

(¹−

ρ

1

ρ

2)^[(¹−

ρ

1²)(¹−

ρ

²2)^]⁻¹. Fortheﬁrstclass(y=1),we generated100two- variate MTS of length 50 for the training and 100 for the test, from the VAR(1)-model with parameters

ρ

=

ρ

1=

ρ

2=0.8 and E[(^x1(^t),x₂(^t))^T

|

^y=1]=(⁰.5,−0.5)^T^. Analogously, the MTS of the second class (y=2) were generated using parameters

ρ

=

−0.8,

ρ

1=

ρ

2=0.6andE[(^x1(^t),x₂(^t))^T

|

^y=2]=(⁰,0)^T^. TosimulateMNARandinjectinformativemissingpatterns,we let x⁽_iⁿ⁾(^t) ^have ^a probability p⁽ⁿ⁾ of being missing, given that x_i⁽ⁿ⁾(^t)>−1, i=1,2. We let p⁽ⁿ⁾=0.9 if y⁽ⁿ⁾=1 and p₍_n₎=0.8 otherwise.Bydoingso,themissingratiowasroughly63%inboth classes.

Table 1 shows theaccuracy on the test data for the different kernels.Asexpected,theTCKgivesthelowestaccuracy,0.826.The ssTCK improvestheaccuracy considerably(0.854),andthesuper- vised version(sTCK)givesfurther improvement(0.867).However, as we can see, theeffect of explicitlymodelling the missingness mechanism in the TCKIM is larger. In this case the accuracy increases from 0.826 to 0.933. The two corresponding embeddings areplottedinFig.1(a)and(d),respectively.IntheTCKembedding, therearemanypointsfromdifferentclassesthatoverlapwitheach other,whereas fortheTCK_IM thenumberofoverlapping pointsis muchlower.

The ssTCK_IM improves the accuracy to 0.967 (from 0.933 for TCK_IM and 0.854 for ssTCK).The two embeddings obtainedusing the semi-supervisedmethods are showninFig. 1(b)and(e).The supervisedversionsTCK_IMyieldsaslightimprovementintermsof accuracycomparedtossTCK_IM(0.970vs.0.967).Plotsofthesuper- visedembeddingsare showninFig.1(c)and(f).We canseethat forthesTCK_IMtheclassesareclearlyseparated.

5.2. PerformanceofssTCKonbenchmarkdatasets

Thepurposeoftheexperimentsreportedinthefollowingpara- graphwastoevaluatetheimpactofincorporatingincompletelabel informationinthessTCK.Towardsthatend,weconsideredbench- markdatasetsandartiﬁciallymodiﬁedthenumberoflabeledMTS in thetraining sets.We applied theproposed ssTCK to fourMTS benchmark datasets fromtheUCR andUCI databases [47,48]and

Table 3

Classiﬁcation accuracy for benchmark datasets obtained using TCK, ssTCK and sTCK.

Datasets TCK ssTCK sTCK

Char. Traj. 0.908 0.928 0.934

uWave 0.867 0.881 0.894

Wafer 0.956 0.970 0.970

Japanese vowels 0.946 0.962 0.968

other published work [49], described in Table 2. Since some of the datasets contain MTSof varying length, we followed the approach of Wang et al. [50] and transformed all the MTS in the samedatasettothesamelength,T,determinedbyT=

T_max Tmax 25

, where Tmax is the length of the longest MTS in the dataset and

îs^the ^ceilingôperator. ^The^numberôf^labeled^MTS^was^set^to

max

{

²⁰,3·Nc

}

^.^ssTCK^was^compared^to^ordinary^TCK^and^sTCK^(as-

sumingcompletelabelinformationinthelattercase).

Table3 showstheperformance ofssTCK forthe4 benchmark datasets.Aswecansee,comparedtoTCK,theaccuracyingeneral increasesusingssTCK.FortheWaferdataset,ssTCKyieldsthesame performanceasthesupervisedkernel.Forthethreeotherdatasets, theperformanceofssTCKisslightlyworsethansTCK.Theseexper- imentsdemonstratethatssTCKiscapableofexploitingincomplete labelinformation.

Further, we created 8 datasets by randomly removing 50%

and 80%, respectively, of the values in each of the 4 benchmark datasets.AswecanseefromtheresultspresentedinTable4,also inpresenceofmissingdatatheaccuracyingeneralincreasesusing ssTCK,comparedtoTCK.

Forcomparison,inTable4we alsoaddedtheresultsobtained usingthreeotherkernels;GAK,thelinearkernel,andLPS.GAKand thelinearkernelcannotprocessincompleteMTSandthereforewe created complete datasets using mean imputation for these two kernels.LPS²wasrunusingdefaulthyperparameters,withtheex- ceptionthat weadjusted thesegmentlength tobesampledfrom theinterval[6,0.8T]toaccountfortherelativelyshortMTSinour datasets.Inaccordancewith[51],forGAK³ we setthebandwidth

σ

^to^0.1^times^the^median^distance^of^all ^MTSⁱⁿ^the^training^set

scaledbythesquarerootofthemedianlengthofallMTS,andthe triangular parameter to 0.2 timesthe medianlength of all MTS.

Distances were measured usingthe canonical metric induced by theFrobeniusnorm.In thelinearkernelwesetthe constantc to 0.Aswecansee,theperformanceofthesekernelsisconsiderably worsethanthetime seriesclusterkernels for7outof8datasets.

ForuWavewith50%missingness,theperformanceofGAKandthe linearkernelissimilartotheTCKkernels.

2Matlab implementation: http://www.mustafabaydogan.com/ .

3Matlab implementation: http://www.marcocuturi.net/GA.html .

(7)

Fig. 1. Plot of the two-dimensional KPCA representation of the synthetic data obtained using 6 different time series cluster kernels. The datapoints are colour-coded according to their labels.

Table 4

Classiﬁcation accuracy for benchmark datasets obtained using TCK, ssTCK and sTCK.

Missing rate Datasets TCK ssTCK sTCK GAK Linear LPS 50% Char. Traj. 0.751 0.780 0.797 0.588 0.589 0.127

uWave 0.812 0.834 0.850 0.828 0.813 0.411 Wafer 0.956 0.970 0.972 0.792 0.791 0.823 Japanese vowels 0.929 0.948 0.947 0.827 0.824 0.746 80% Char. Traj. 0.282 0.310 0.331 0.194 0.192 0.062 uWave 0.589 0.592 0.603 0.441 0.464 0.234 Wafer 0.926 0.934 0.934 0.796 0.805 0.819 Japanese vowels 0.809 0.836 0.847 0.473 0.489 0.389

5.3. Exploitinginformativemissingnessinbenchmarkdatasets

To evaluate the effect of modelling the missing patterns in TCK_IM,we generated8 versionsof theWafer andJapanese vow- elsdatasetsby manuallyinjectingmissingelementsusingthefol- lowingprocedure.Foreach attribute

v

∈

{

¹,...,V

}

,anumberc_v∈

{

−1,1

}

^was^randomly ^sampled^with^equalprobabilities. Ifc_v=1, the attribute

v

ispositively correlated with the labels,otherwise negatively correlated. For each MTS X⁽ⁿ⁾ and attribute, a miss- ingrate

γ

ⁿvwassampledfromtheuniformdistributionU[0.3+E· c_v·(^y⁽ⁿ⁾−1),0.7+E·c_v·(^y⁽ⁿ⁾−1)^]^. ^This^ensures ^that ^the^overall missingrateofeachdatasetisapproximately50%.y⁽ⁿ⁾∈

{

¹,...Nc

}

isthelabeloftheMTSX⁽ⁿ⁾ andE isaparameter,whichwetune foreachdatasetinsuchawaythattheabsolutevalueofthePear- soncorrelationbetweenthemissingratesfortheattributes

γ

vand the labels y⁽ⁿ⁾ takes the values

{

⁰.2,0.4,0.6,0.8

}

, respectively.

The higherthevalue ofthePearson correlation,thehigheristhe informativemissingness.

Table 5 shows the performance of the proposed TCK_IM and threebaselinemodels (TCK,TCK_B,andTCK₀).The ﬁrstbaselineis ordinary TCK,which ignores themissingness mechanism. Forthe Waferdataset,the performanceof thisbaselinewasquite similar across allfour settings.Forthe Japanese vowels dataset,theper- formanceactuallydecreasesastheinformationinthemissingpat- terns increases.In the second baseline, TCK_B, we tried to model the missingpatternsby concatenating the binarymissingindica- tor MTS R to the MTS X and creating a new MTS with 2V attributes.Then,wetrainedordinaryTCKonthisrepresentation.For theWafer dataset,the performancedecreases considerablyasthe informative missingness increases. For the Japanese vowels, this baselineyields thebestperformancewhenthecorrelation is20%. However, the performance actually decreases as the informative missingnessincreases.Hence,informative missingnessisnot cap- turedwiththisbaseline.Inthelastbaseline,TCK₀,weinvestigated ifitispossibletocaptureinformativemissingnessbyimputingze- rosfor the missingvaluesand then trainingthe TCKon theim-