• No results found

Fedor V. Fomin, Petr A. Golovach, Kirill Simonov

N/A
N/A
Protected

Academic year: 2022

Share "Fedor V. Fomin, Petr A. Golovach, Kirill Simonov"

Copied!
25
0
0

Laster.... (Se fulltekst nå)

Fulltekst

(1)

Contents lists available atScienceDirect

Journal of Computer and System Sciences

www.elsevier.com/locate/jcss

Parameterized k-Clustering: Tractability island

Fedor V. Fomin, Petr A. Golovach, Kirill Simonov

DepartmentofInformatics,UniversityofBergen,ThormøhlensGate55,5008Bergen,Norway

a rt i c l e i nf o a b s t ra c t

Articlehistory:

Received18December2019

Receivedinrevisedform 17August2020 Accepted12October2020

Availableonline19November2020

Keywords:

Clustering

Parameterizedcomplexity k-means

k-median

Ink-ClusteringwearegivenamultisetofnvectorsXZdandanonnegativenumberD, andweneedtodecidewhether XcanbepartitionedintokclustersC1,. . . ,Ck suchthat thecost

k

i=1 min

ciRd

xCi

xcippD,

where· p istheLp-norm.Forp=2,k-Clusteringisk-Means.Westudyk-Clustering fromtheperspectiveofparameterizedcomplexity.TheproblemisknowntobeNP-hardfor k=2 andalsoford=2.Itisalong-standingopenquestion,whethertheproblemisfixed- parametertractable(FPT)forthecombinedparameterd+k.Inthispaper,wefocusonthe parameterization byD.We complementtheknown negativeresults byshowingthat for p=0 and p= ∞,k-ClusteringisW[1]-hardwhenparameterizedby D.Interestingly,we discoveratractabilityislandofk-Clustering:foreveryp(0,1],k-Clusteringissolvable intime2O(DlogD)(nd)O(1).

©2020TheAuthor(s).PublishedbyElsevierInc.Thisisanopenaccessarticleunderthe CCBYlicense(http://creativecommons.org/licenses/by/4.0/).

1. Introduction

Recallthatforp

>

0,theMinkowskiorLp-normofavectorx

= (

x

[

1

] , . . . ,

x

[

d

] )

Rd isdefinedas

x

p

=

d

i=1

|

x

[

i

]|

p

1/p

.

Respectively,wedefinethe(Lp-norm)distancebetweentwovectorsx

= (

x

[

1

], . . . ,

x

[

d

])

and y

= (

y

[

1

], . . . ,

y

[

d

])

as

distp

(

x

,

y

) =

x

y

pp

=

d

i=1

|

x

[

i

] −

y

[

i

]|

p

.

We also consider distp for p

=

0 and p

= ∞

. For p

=

0, distp is L0 (or the Hamming) distance, that is the number of differentcoordinatesinxand y:

ThisworkissupportedbytheResearchCouncilofNorwayviatheproject“MULTIVAL”.Thegrantnumberis263317.

*

Correspondingauthor.

E-mailaddresses:[email protected](F.V. Fomin),[email protected](P.A. Golovach),[email protected](K. Simonov).

https://doi.org/10.1016/j.jcss.2020.10.005

0022-0000/©2020TheAuthor(s).PublishedbyElsevierInc.ThisisanopenaccessarticleundertheCCBYlicense (http://creativecommons.org/licenses/by/4.0/).

(2)

Fig. 1.Optimalclusteringsofthesamesetofvectorswithdifferentdistances:dist1 intheleftsubfigure,dist1/4 intherightsubfigure.Shapesdenote clusters,crossesdenoteclustercentroids.

dist0

(

x

,

y

) = |{

i

∈ {

1

, . . . ,

d

} |

x

[

i

] =

y

[

i

]}| .

Forp

= ∞

,distpis L-distance,whichisdefinedas dist

(

x

,

y

) =

max

i∈{1,...,d}

|

x

[

i

] −

y

[

i

]|.

Thek-Clusteringproblemisdefinedasfollows.Foragiven(multi)datasetofn vectors(points) X

Zd,thetaskisto findapartitionof X intokclustersC1

, . . . ,

Ckminimizingthecost

k i=1

min

ci∈Rd

xCi

distp

(

x

,

ci

),

intuitively,ci isacentroidoftheclusterCi.

Inparticular,forp

=

1,distp isthe L1-distanceandthecorrespondingclusteringproblemisknownask-Median.(Often intheliterature,k-MedianisalsousedforclusteringminimizingthesumsoftheEuclideandistances.) Forp

=

2,distp is theL2(Euclidean)distance,andthentheclusteringproblembecomesk-Means.

Letusnote that optimalclusteringsforthesame setofvectorscan bedrasticallydifferent forvariousvaluesof p,as showninFig.1.Asweshowinthepaper,thecomplexityofk-Clusteringalsostronglydependsonthechoiceofp.

k-Clustering, and especially k-Median and k-Means, are among the most prevalent problems occurring in virtually every subareaof data science. We refer to the survey of Jain [1] for an extensive overview. While inpractice the most commonapproachestoclusteringare basedondifferentvariationsofLloyd’sheuristic[2],theproblemisinterestingfrom the theoretical perspective aswell. Inparticular, thereis a vastamount ofliterature on approximation algorithms fork- Clusteringwhosebehaviorcanbeanalyzedrigorously,seee.g.[3–17].

When it comesto exact solutions,we observethe followingphenomena. While heuristic algorithms fork-Clustering work surprisinglywell in practice,from the perspective of parameterized complexity, k-Clustering is intractable forall previously studiedparameterizations,seeTable1.Thek-Clusteringproblemisnaturally“multivariate”:inadditiontothe number ofpoints n,there are alsoparameters like spacedimension d,number ofclustersk or thecost ofclustering D. TheproblemisknowntobeNP-completefork

=

2 [18,19] andford

=

2 [20,21].BytheclassicalworkofInabaetal. [22], inthecasewhenbothd andk areconstants, k-Clustering issolvableinpolynomial time O(ndk+1

)

.It isalong-standing openproblemwhetherk-ClusteringisFPTparameterizedbyd

+

k.UnderETH,thelowerboundofn(k),evenwhend

=

4, wasshownbyCohen-Addadetal.in [23] forthesettingswherethesetofpotentialcandidatecentersisexplicitlygivenas input. Howeverthelower bound ofCohen-Addadetal.doesnot generalizeto thesettingsofthispaperwhereanypoint inEuclideanspacecanserveasacenter.Forthespecialcase,whentheinputconsistsofbinaryvectorsandthedistanceis Hamming,theproblemissolvableintime2O(DlogD)

(

nd

)

O(1) [24].

Ourresultsandapproaches.Inthispaperweinvestigatethedependenceofthecomplexityofk-Clusteringonthecostof clustering D. Itappears thataddingthisnew“dimension” makes thecomplexity landscapeofk-Clusteringintricate and interesting.Moreprecisely,weconsiderthefollowingproblem.

Input: Amultiset X ofnvectorsinZd,apositiveintegerk,andanonnegativenumberD.

Task: DecidewhetherthereisapartitionofXintokclusters

{

Ci

}

ki=1andkvectors

{

ci

}

ki=1,calledcentroids, inRdsuchthat

k i=1

xCi

dist

(

x

,

ci

)

D

.

k-Clusteringwith distance dist

(3)

Letusremarkthatvectorset X (like thecolumnsetofamatrix)cancontainmanyequalvectors. Alsoweconsiderthe situationwhenvectorsfromX areintegervectors,whilecentroidvectorsarenotnecessarilyfromX.Moreover,coordinates ofcentroidscanbereals.

Ourmainalgorithmicresultisthefollowingtheorem.

Theorem1.k-Clusteringwithdistancedistpissolvableintime2O(DlogD)

(

nd

)

O(1)foreveryp

(

0

,

1

]

.

Thus k-Clusteringwhenparameterized by D isfixed-parametertractable (FPT) forMinkowskidistancedistp oforder 0

<

p

1.Inthe firststep ofouralgorithm weuse colorcodingto reducethe problemto ClusterSelection,whichwe findinterestingonitsown.InClusterSelectionwehavet groupsofweightedvectorsandthetaskistoselectexactlyone vectorfromeachgroupsuchthattheweightedcostofthecompositeclusterisatmostD.Moreformally,

Input: AsetofmvectorsX giventogetherwithapartition X

=

X1

∪ · · · ∪

Xt intot disjointsets,aweight functionw

:

X

Z+,andanonnegativenumberD.

Task: Decidewhetheritispossible toselectexactlyone vector xi fromeach set Xi suchthat thetotal costofthecompositeclusterformedbyx1,. . . ,xt isatmostD:

min

c∈Rd

t i=1

w

(

xi

) ·

dist

(

xi

,

c

)

D

.

Cluster Selectionwith distance dist

The ClusterSelection problemis closelyrelatedto variantsof thewell-known ConsensusPatternproblem. Namely, fortheHammingdistance,thedefinitionofClusterSelectionnearlycoincideswiththeColoredConsensusStringswith Outliersproblemstudiedin[25],onlyinthelatterthealphabetisassumedtobeofconstantsize.

Informally (see Theorem 10for the precise statement),our reduction showsthat if the distancenorm satisfies some specific properties (which distp satisfies for all p) and if ClusterSelection is FPT parameterized by D, then so is k- Clustering.Therefore,inorderto proveTheorem1,allwe needistoshow thatClusterSelectionis FPTparameterized by D when p

(

0

,

1

]

.Thisisthemostdifficultpartoftheproof.HereweinvokethetheoremofMarx[26] onthenumber ofsubhypergraphsinhypergraphsofboundedfractionaledgecover.

Superficially,thegeneralideaoftheproofofTheorem1issimilartotheideabehindthealgorithmforBinaryr-Means for L0 from[24]. In both cases,the classical color coding technique ofAlon etal. [27] is used as a preprocessing step.

However, the further steps in [24] strongly exploit the fact that the data is binary. As we will see in Theorem 2, the existenceofanFPTalgorithmfork-ClusteringinL0ishighlyunlikely.Thusthereductionsfrom[24] cannotbeappliedin ourcase,andweneedanewapproach.

Moreprecisely,forclusteringinL0weprovethefollowingtheorem.

Theorem2.Withdistancedist0, k-Clusteringparameterizedbyd

+

D andClusterSelectionparameterizedbyd

+

t

+

D are W

[

1

]

-hard.

Inparticular,thismeansthatup toawidely-believedassumptionincomplexity that FPT

=

W

[

1

]

,Theorem2rulesout algorithms solvingk-Clusteringintime f

(

d

,

D

) ·

nO(1)andalgorithmssolvingClusterSelectioninL0 intime g

(

t

,

d

,

D

) ·

nO(1)foranyfunctions f

(

d

,

D

)

andg

(

t

,

d

,

D

)

.AsimilarhardnessresultholdsforL.

Theorem3.Withdistancedist, k-ClusteringparameterizedbyD andClusterSelectionparameterizedbyt

+

D areW

[

1

]

-hard.

Thisnaturally brings ustothequestion:What happenswithk-Clustering for p

(

1

,)

,especially fortheEuclidean distance,thatis p

=

2.Unfortunately,wearenotabletoanswerthisquestion whentheparameterisD only.However,we canprovethat

Theorem4.k-ClusteringandClusterSelectionwithdistancedist2areFPTwhenparameterizedbyd

+

D.

Thusinparticular,Theorem4impliesthatk-Clusteringwithdistancedist2isFPTparameterizedbyd

+

D.Ontheother hand,weprovethat

Theorem5.ClusterSelectionwithdistancedistpisW

[

1

]

-hardforeveryp

(

1

,)

whenparameterizedbyt

+

D.

(4)

Table 1

Complexityofk-ClusteringandClusterSelection.

distp k-Clustering Cluster Selection

p=0 W[1]-hard param.d+D[Theorem2]

NP-c fork=2 [19] W[1]-hard param.d+t+D[Theorem2]

0<p1 2

O(DlogD)(nd)O(1)[Theorem1]

NP-c fork=2 whenp=1 [19]

NP-c ford=2 whenp=1 [20]

2O(DlogD)(nd)O(1)[Theorem15]

W[1]-hard param.t+dforp=1 [Theorem20]

1<p<+∞ FPTparam.d+Dforp=2 [Theorem4]

NP-c fork=2 whenp=2 [18]

NP-c ford=2 whenp=2 [21]

FPTparam.d+Dforp=2 [Theorem4]

W[1]-hard param.t+D[Theorem5]

p= ∞ W[1]-hard param.D[Theorem3]

NP-c fork=2 [Theorem30] W[1]-hard param.t+D[Theorem3]

In particular, Theorem 5 yields that the approach we used to establish the tractability (with parameter D) of k- Clusteringforp

=

1 willnotworkfor p

>

1.

We summarize our and previously known algorithmic and hardness results for k-Clustering and ClusterSelection withdifferentdistancesinTable 1.Observethat Theorem10worksalsointhesettingwherepossible clustercentersare restricted to be from a set givenin the input, and so doour algorithmic Theorems 1 and4 since ClusterSelection is triviallysolvableinpolynomialtimeinthissetting.

NowwediscussthechoiceoftheparameterD.ItmightbenotedthattheregimewherethecostofclusteringD issmall comparedtothenumberofpointsn,isquitespecial.Indeed,ifthecostofclusteringisatmostD,thentherearebutafew pointsthatarenotequaltotherespectiveclustercenters.Thus,theproblemwestudyhasthespiritofaneditingproblem:

check whether a given instanceis close to a “structured”one, where in our casea “structured” instance hasat mostk distinct points, and closeness is measured via the sumof Lp-distances. Editing problems are extensivelystudied in the parameterizedalgorithmsliterature,rangingfromthevastareaofgraphmodification(seee.g.arecentsurveybyCrespelle et al. [28]) to studies very closeto ours, like theConsensusPatterns algorithm by Marx [26], andthe study ofBinary r-Meansby Fominetal. [24] thatis essentiallya specialcaseofourk-Clusteringproblem. Andstill, eveninthishighly structured regime, our results show a very intricate picture: forinstance, fork-Clustering parameterized just by D,we provideahighlynon-trivialFPTalgorithminthecase0

<

p

1.Whileontheother hand,conditionally,thesamescheme could notlead toan analogousalgorithm inthecase p

=

2,andthere couldnot beanyFPT algorithmatallinthe cases p

=

0 andp

= ∞

.Finally webelievethat studyingk-Clusteringwithrespecttotheparameter D isan essentialquestion providedthe notorioushardness oftheproblem.Recall thatforthecombinationofthetwoother naturalparameters, the dimensiondandthenumberofclustersk,onlya O

(

ndk+1

)

algorithmofInabaetal.isknown [22],andthehardnessresult byCohen-Addadetal.in [23] servesasastrongindicationthatabetteralgorithmmightnotexist.

Observethatwealwaysconsiderinteger-valuedinstances.Webelievethisisthemostnaturalmodelforstudyingcom- plexity ofk-Clusteringwithrespectto theparameter D. Hereitis importanttonote thatconsidering D asa parameter only makes sense ifthe input values are suitablydiscretized. Imagineinput vectorscould have arbitraryreal-valued (or rational-valued)entries, thenfora giveninstanceit isalways possibleto scalethe valuesdown bythe samefactorsuch thatthecostofanoptimalclusteringisarbitrarilysmall,butthestructureoftheinstanceiscompletelypreserved.Thusthe restrictiontointegervaluesinourstudyisanaturaldiscretizationoftheproblem.Itallowstheparameter D tobeardeep structuralsignificance,asourresultsdemonstrate.

Theremaining partofthispaperisorganizedasfollows.Section 2containspreliminaries.InSection 3we proveTheo- rem10whichprovidesuswithFPTTuringreductionfromk-ClusteringtoClusterSelection.Theorem10appearstobe a handy tool toestablish tractabilityof k-Clustering. InSection 4we collect theresults onclusteringwith Lp-normfor p

(

0

,

1

]

.Inparticular,inSubsection4.1,weproveTheorem1,themainalgorithmicresultofthiswork,statingthatwhen p

(

0

,

1

]

,k-ClusteringandClusterSelectionadmit FPTalgorithmswithparameter D.InSubsection4.2wecomplement thealgorithmicupperboundswithlower boundsbyprovingthatClusterSelectionisW

[

1

]

-hardwhen p

=

1 andparam- eterist

+

d(Theorem20).In Section5,we considerthe case p

=

0 andprove Theorem2establishing W

[

1

]

-hardness of k-ClusteringandClusterSelection.Section6isdevotedtothecasep

= ∞

.Hereweestablishtwohardnessresultsabout k-Clustering:W

[

1

]

-hardness when parameterizedby D andNP-hardnessin thecasek

=

2.In Section 7, welook atthe case p

(

1

,)

,withtheparticularemphasison themostcommonlyusedcase p

=

2.Weshow that whend

+

D is the parameter, thenClusterSelectionandk-Clusteringinthe L2 distanceareFPT.Wealso showthat ClusterSelectionis W

[

1

]

-hardwhenparameterizedbyt

+

D forall p

(

1

,)

.WeconcludewithopenproblemsinSection8.

2. Preliminariesandnotation

Clusternotation. Bya cluster we always mean a multisetof vectorsin Zd. Fordistance dist, the cost ofa givencluster C is the total distance fromall vectors in the cluster to the optimally selected cluster centroid, minc∈Rd

xCdist

(

x

,

c

)

. An optimal cluster centroid for a given cluster C is any c

Rd minimizing

xCdist

(

x

,

c

)

. For most of the considered distances,wearguethatanoptimalclustercentroidcouldalwaysbechosenamongaspecificfamilyofvectors(e.g.integral).

Wheneverweshowthis,weonlyconsideroptimalclustercentroidsofthestatedformafterwards.

(5)

Complexity. Aparameterizedproblem isa language Q

×

Nwhere

isthe setof strings overa finite alphabet

. Respectively,an input of Q isa pair

(

I

,

k

)

where I

andk

N; k isthe parameteroftheproblem. A parameterized problemQ isfixed-parametertractable(FPT)ifitcanbedecidedwhether

(

I

,

k

)

Q intime f

(

k

) ·|

I

|

O(1)forsomefunction f thatdependsoftheparameterkonly.Respectively,theparameterizedcomplexityclassFPTiscomposedbyfixed-parameter tractable problems.The W-hierarchyisacollectionofcomputationalcomplexityclasses:we omitthetechnicaldefinitions here. Thefollowingrelationisknownamongst theclassesintheW-hierarchy:FPT

=

W

[

0

] ⊆

W

[

1

] ⊆

W

[

2

] ⊆ . . .

W

[

P

]

.It iswidelybelievedthatFPT

=

W

[

1

]

,andhenceifaproblemishardfortheclassW

[

i

]

(foranyi

1)thenitisconsideredto befixed-parameterintractable.Werefertobooks[29,30] forthedetailedintroductiontoparameterizedcomplexity.

WealsoprovideconditionallowerboundsbymakinguseofthefollowingcomplexityhypothesisformulatedbyImpagli- azzo,Paturi,andZane[31].

ExponentialTimeHypothesis(ETH):Thereisapositiverealssuchthat3-CNF-SATwithnvariablesandmclausescannot besolvedintime2sn

(

n

+

m

)

O(1).

Graphs.In ourW

[

1

]

-hardness proofs, we heavily employgraph-theoretical notation.Whenever we workwitha graph G, wealwaysfixsomeorderingonthevertices

π

V

:

V

(

G

) → {

1

, . . . , |

V

(

G

) |}

andontheedges

π

E

:

E

(

G

) → {

1

, . . . , |

E

(

G

) |}

.We drop

π

V and

π

E tosimplifynotation, sowhen weconsider avertex v

V

(

G

)

oranedge e

E

(

G

)

, v ande alsodenote integers—numbersofv andeaccordingtotheorderings

π

V and

π

E correspondingly.

Realcomputations. Since we deal with the problem concerning real-valued matrices, we express the running time of algorithms intermsofnumberofoperationsoverthereals.Thisisnaturalsincetocompute Lp-distanceswehavetodeal withnumbersofformxp wherexisanintegerandpisanyrealnumber.However,inspecialcasestheboundsholdeven for morerestrictive models, e.g. when p

=

1 or p

=

2 thealgorithms operate only onintegers of polynomially bounded length.

3. Fromk-CLUSTERINGto CLUSTERSELECTION

Inthissection wepresentageneralschemeforobtainingan FPTalgorithmparameterizedby D,whichislaterapplied tovariousdistances.

First,weformalizethefollowingintuition:thereisnoreasontoassignequalvectorstodifferentclusters.

Definition6(Initialclusterandregularpartition).Foramultisetofvectors X,an inclusion-wisemaximalmultisetI

X such thatallvectorsinI areequaliscalledaninitialcluster.

Wesaythataclustering

{

C1

, . . . ,

Ck

}

ofX isregularifforeveryinitialclusterI thereisai

∈ {

1

, . . . ,

k

}

suchthatI

Ci. Nowweprovethatitsufficestolookonlyforregularsolutions.

Proposition7.Let

(

X

,

k

,

D

)

beayes-instanceto k-Clustering.Thenthereexistsasolutionof

(

X

,

k

,

D

)

whichisaregularclustering.

Proof. Letusassume that theinstance

(

X

,

k

,

D

)

has a solution.There are k clusters

{

Ci

}

ki=1 andk vectors

{

ci

}

ki=1 in Rd such that k

i=1

xCidist

(

x

,

ci

)

D. Notethat forevery x

Cj, dist

(

x

,

cj

)

min1ikdist

(

x

,

ci

)

.So ifwe considera new clustering

{

C1

, . . . ,

Ck

}

withthe samecentroids,whereCj areall vectorsfrom X forwhichcj istheclosest centroid,the total distancedoesnot increase.Ifwe alsobreak tiesinfavorofthe lower index, thenforanyinitial cluster I the same centroidciwillbetheclosest,andallvectorsfromI willendupinCi,so

{

C1

, . . . ,

Ck

}

isaregularclustering.

Fromnowon,weconsideronlyregularsolutions.

Definition8(Simpleandcompositeclusters).Wesaythatacluster C issimpleifitisaninitialcluster.Otherwise,thecluster iscomposite.

Nextwestateapropertyofk-Clusteringwithaparticulardistance,whichisrequiredforthealgorithm.Intuitively,each uniquevectoraddsatleastsomeconstanttotheclustercost.

Definition9(

α

-property).Wesaythat adistancehasthe

α

-propertyforsome

α >

0 ifforanysthecostofanycomposite clusterwhichconsistsofsinitialclustersisatleast

α (

s

1

)

.

In the subsequent sections we show that the

α

-property holds for all the distance measures for which we present algorithms.Namely,theLp-distancehasthe

α

-propertywithacertainconstant

α

,foreach p

∈ [

0

,

1

] ∪ {

2

, ∞}

.Analogously

(6)

Fig. 2.AnillustrationofthealgorithminTheorem10.WestartwithaparticularrandomcoloringandaparticularpartitionofcolorsP= {P1,P2},where P1= { , }andP2= { , , }.WemaketwocallstoClusterSelectionwithrespecttoP1andP2andconstructtheresultingclustering.Intheexample, allinputvectorsaredistinct.

tothecase p

=

2,onecanshowthatitholdsforallothervaluesofpbetween1 and

aswell,althoughwedonotneed thisfact.

The ClusterSelection problem defined inthe introduction is a key subroutine in our algorithm. In some cases the problemissolvabletrivially,butitpresentsthemain challengeforourmainalgorithmic resultwiththe L1 distance.The intuitionto theweightfunctioninthedefinitionofClusterSelectionisthat itrepresentssizes ofinitialclusters,that is, howmanyequalvectorsarethere.

We also need a procedure to enumerate all values of the cost ofeach possible cluster, withrespect to an optimally selected cluster centroid, that are at most D.It may not be straightforward since not all distances in ourconsideration are integer.So forthepurposeofstatingTheorem 10forgeneralmetrics,we assumethat theset ofallpossible optimal clustercostswhicharelessthanD isalsogivenintheinput.FortheLp-distancesweconsider,intherespectivealgorithmic theoremsweshowhowtoprovidethissetwithoutraisinganyadditionalassumptionsorincreasingtherunningtime.Now wearereadytostatetheresultformally.

Theorem10.Assumethatthe

α

-propertyholds,ClusterSelectionissolvableintime

(

m

,

d

,

t

,

D

)

,where

isanon-decreasing functionofitsarguments,andwearegiventhesetDofallpossibleoptimalclustercostswhichareatmostD.Then k-Clusteringis solvableintime

2O(DlogD)

(

nd

)

O(1)

|

D

|(

n

,

d

,

2D

/ α ,

D

).

Proof. Bythe

α

-property,inanysolutionthereareatmostD

/ α

compositeclusters,sinceeachcontainsatleasttwoinitial clusters.Moreover,thereareatmost2D

/ α

initialclustersinallcompositeclusters.

ThusbyProposition7,solvingk-ClusteringisequivalenttoselectingatmostT

:=

2D

/ α

initialclustersandgrouping themintocompositeclusterssuchthatthetotalcostoftheseclustersisatmost D.Wedesignanalgorithmwhich,taking asasubroutineanalgorithmforClusterSelection,solvesk-Clustering.ThealgorithmissketchedinFig.3,anexampleis showninFig.2.

Toperformtheselection andgrouping,ouralgorithm usesthecolorcodingtechnique ofAlon,Yuster,andZwickfrom [27]. Consider the input as a family of initial clusters I. We color initial clusters from I independently and uniformly at random by T colors 1,2, . . . , T. Consider anysolution, andthe particular set ofat most T initial clusters whichare includedintocompositeclustersinthissolution.Theseinitialclustersarecoloredbydistinctcolorswithprobabilityatleast

T!

TT

eT.Nowweconstructanalgorithmforfindingacolorfulsolution.

Weconsiderallpossiblewaystosplitcolorsbetweenclusters(somecolorsmaybeunused).Henceweconsiderallpos- siblefamilies P

= {

P1

, . . . ,

Ph

}

ofpairwise disjointnon-emptysubsetsof

{

c

∈ {

1

, . . . ,

T

} :

there exists J

Icolored byc

}

. EachfamilyP correspondstoapartitionofthesetofcolors

{

1

, . . . ,

T

}

ifweaddonefictitioussubsetforcolorswhichare notusedinthecompositeclusters.ThetotalnumberofpartitionsdoesnotexceedTT

=

2O(DlogD).

(7)

k-Clustering(X,k,D,α,D)

Input :AmultisetXZd,apositiveintegerk,realnonnegativevaluesDandα,asetD,analgorithmAforClusterSelection Output:YesorNo

1 T2D

2 Iinitial clusters ofX 3 foreTiterationsdo

4 FixarandomcoloringcofIwithcolors{1,. . . ,T} 5 forvalidpartitionsPof{1,. . . ,T}do

6 fori=1to|P|do 7 Pi= {i1,. . . ,it} 8 forj=1totdo

9 Xj← ∅

10 for JI:c(J)=ijdo

11 xa point from J

12 XjXj∪ {x}

13 w(x)← |J|

14 diD+1

15 foreachdDdo

16 ifA( X1,. . . ,Xt,w,d)then

17 did

18 BREAK

19 ift

i=1diDthen

20 Yes,STOP

21 No,STOP

Fig. 3.k-Clusteringalgorithm from Theorem10.

Whenpartition P isfixed,weformclustersbysolvinginstances ofClusterSelection: Foreach i

∈ {

1

, . . . ,

h

}

,wetake initial clusters colored by elements of Pi, bundle together those with the same color, and pass the resulting familyto ClusterSelection.Firstnotethattherecannot be P

P ofsizeatmostone,sincethenClusterSelectionhastomakea simpleclusterwhileweassume thatallclustersobtainedfromP arecomposite.Second,thetotalnumberofclustershas to bek,the numberofclustersis

|I| −

P∈P

|

P

| + |P|

. Foreach P we check thatboth conditionshold, andifnot, we discardthechoiceofP andmovetothenextone,beforecallingtheClusterSelectionsubroutine.

Next,weformalizehowwecalltheClusterSelectionsubroutine.Wefixthesetofcolors Pi

= {

c1

, . . . ,

ct

}

,thentakethe sets Ij

= {

J

I

:

Jis colored bycj

}

for j

∈ {

1

, . . . ,

t

}

.Weturneachsetofinitialclusters Ij intoasetofweighted vectors Xj naturally: Foreach J

Ij,we putone vector x

J into Xj, and w

(

x

) := |

J

|

. Thefamilyof setsofvectors X1,. . . , Xt andtheweightfunctionwaretheinputforClusterSelection.Thenwesearchfortheminimumclustercostbounddi

D fromD,forwhichtheinstance

(

X1

, . . . ,

Xt

,

di

)

ofClusterSelectionisayes-instance,runningeachtimethealgorithmfor ClusterSelection.

Ifforsomei settingdito D leadstoano-instance,orifh

i=1di

>

D,thenwediscardthechoiceofthepartitionPand movetothenextone.Otherwise,wereportthatk-Clusteringhasasolutionandstop.Next,weprovethatinthiscasethe solutionindeedexists.

We reconstruct the solutionto k-Clustering asfollows: For each i

∈ {

1

, . . . ,

h

}

the corresponding to Pi

= {

c1

, . . . ,

ct

}

instanceof ClusterSelectionhasasolution

{

x1

, . . . ,

xt

}

.Foreach j

∈ {

1

, . . . ,

t

}

, considerthecorresponding initialcluster Jj consistingof w

(

xj

)

vectorsequaltoxj.Foreach i

∈ {

1

, . . . ,

h

}

we obtainacompositecluster

tj=1Jj,allother clusters are simple. So the total cost is h

i=1di, which isat most D. Thus, ifthe algorithm finds a solution, then

(

X

,

d

,

D

)

isa yes-instance.

Intheopposite direction.Ifthereisasolutiontok-Clustering,thenthereisaregularsolution,andwithprobabilityat leasteT initialclusterswhicharepartsofcompositeclustersinthissolutionarecoloredbydistinctcolors.Then,thereisa partitionP

= {

P1

, . . . ,

Ph

}

whichcorrespondstothissolution.Thispartitionisobtainedasfollows:putinto P1 colorsfrom thefirstcompositecluster,intoP2 fromthesecondandsoon.Atsome pointouralgorithmchecksthepartitionP,andas itfindstheoptimalcostvalueforeachcluster,thenitisatmostthecostofthecorrespondingclusterofthesolutionfrom whichwestarted.

Toanalyzetherunningtime,weconsider 2O(DlogD) partitionsP,foreachP we

|P| =

O

(

D

)

timessearchforoptimal di.And foreachof

|D|

possiblevalues1 ofdi we makeone callto theClusterSelection algorithm,whichtakestime at most

(

n

,

d

,

T

,

D

)

.

Toamplifythe errorprobability tobe atmost1

/

e,we doN

=

eT

iterationsofthe algorithm,eachtime withanew randomcoloring.AseachiterationsucceedswithprobabilityatleasteT,theprobabilityofnotfindingacolorful solution afterN iterationsisatmost

(

1

eT

)

eT

e1

<

1.Sothetotalrunningtimeis2O(DlogD)

· (

nd

)

O(1)

|D|(

n

,

d

,

2D

/ α ,

D

)

.

1 Wecouldalsobinarysearchforthe optimaldiD instead,thusreplacing|D|bylog|D|intherunningtime.However,forallchoicesofD we considerthisdoesnotmakeadifference.

(8)

0 2 4 6 8 10 5

10 15 20

z

cost

0 2 4 6 8 10

4 6 8 10

z

(a) cost(z)= |z2| + |z3| + |z6| + |z8| (b) cost(z)= |z2|1/2+ |z3|1/2+ |z6|1/2+ |z8|1/2

Fig. 4.Graphsofclustercostoverdifferentvaluesofz:dist1intheleftplot,dist1/2intherightplot.Thesetofcoordinatevaluesisgivenasy1=2,y2=3, y3=6,y4=8.

The algorithmcould be derandomizedbythe standardderandomization technique usingperfecthash families[27,32].

Sok-Clusteringissolvableinthesamedeterministictime.

4. Algorithmsandcomplexityfordistanceswithp(0,1]

The mainmotivationfortheresultsinthissection isthestudy ofk-Clusteringwiththe L1 distance,the casewidely knownask-Medians.However,ourmainalgorithmicresultalsoextendstodistancesoforderp

(

0

,

1

)

sinceinsomesense theybehavesimilarlytotheL1 distance.

4.1. FPTalgorithmwhenparameterizedbyD

Inthissubsection,weproveTheorem1:when p

(

0

,

1

]

,k-ClusteringadmitsanFPTalgorithmwithparameter D.First we state basicgeometricalobservations forcases p

=

1 and p

(

0

,

1

)

. Thenwepropose a generalalgorithmforCluster Selectionwhichreliesonlyontheseproperties.Finally,weshowhowTheorem10couldbeapplied.

The next two claims deal with the structure of optimal cluster centroids. We state and prove them in the case of weighted vectorswhereeach vector hasa positiveinteger weightgivenby aweight function w.The unweighted caseis justaspecialcasewhentheweightofeachvectorisone.

First, we show that coordinates of cluster centroids could always be selected amongthe values presentin theinput, whichhelpsgreatlyinenumeratingclustercentroidsthatmaybeoptimal.

Claim11.Assumep

(

0

,

1

]

,letC

= {

x1

, . . . ,

xt

}

beaclusterandw

: {

x1

, · · · ,

xt

} →

Z+beaweightfunction.Thereisanoptimal (subjecttotheweighteddistancew

(

xi

) ·

distp

(

xi

,

c

)

)centroidc ofC suchthatforeachi

∈ {

1

, . . . ,

d

}

,thei-thcoordinatec

[

i

]

ofthe centroidisfromthevaluespresentintheinputinthiscoordinate,thatisc

[

i

] ∈ {

x1

[

i

], . . . ,

xt

[

i

]}

.Moreover,forp

=

1wemayassume thattheoptimalvalueisaweightedmedianofthevaluespresentinthei-thcoordinate.

Proof. Forcluster C,considerthecorrespondingmultisetofunweightedvectorsC

= {

x1

, . . . ,

xt

}

,whereeachvector x

C isrepeated w

(

x

)

times.Wedefine yj

=

xj

[

i

]

for j

∈ {

1

, . . . ,

t

}

.Assumethat y1

y2

≤ · · · ≤

yt.Letusconsideran optimal clustercentroidc forC anddenotez

=

c

[

i

]

.Fig.4showshowtheclustercostbehaves withrespecttoz onaconcreteset ofvalues

{

yi

}

forp

=

1 and p

=

1

/

2.

Fortheformalproof,westartwiththecasep

=

1.ThetotalcostofC contributedbythei-thecoordinateis

|

y1

z

| + |

y2

z

| + · · · + |

yt

z

| .

Ifz

(

yi

,

yi+1

)

fori

∈ {

1

, . . . ,

t

1

}

,thenthederivativewithrespecttoz is

((

z

y1

) + · · · + (

z

yi

) + (

yi+1

z

) + · · · + (

yt

z

))

=

i

(

t

i

).

Analogously, when z

=

yi fori

∈ {

1

, . . . ,

t

}

,the derivative is i

1

(

t

i

)

. When z

<

y1 thederivative is

t,and when z

>

yt thederivative ist.Soift isodd,then thederivative iszero at yt/2, strictlynegative before andstrictlypositive after,so yt/2,whichistheonlymedian,istheoptimalvalueforz.Ift iseven,thenthederivativeiszeroon

[

yt/2

,

yt/2+1

]

, strictlynegativebeforeandstrictlypositiveafter.Soanyvaluefrom

[

yt/2

,

yt/2+1

]

isoptimal,andwemayassumethatitis oneofthetwomedians yt/2,yt/2+1.

Referanser

RELATERTE DOKUMENTER