Fedor V. Fomin, Petr A. Golovach, Kirill Simonov

(1)

Contents lists available atScienceDirect

Journal of Computer and System Sciences

www.elsevier.com/locate/jcss

Parameterized k-Clustering: Tractability island ^✩

Fedor V. Fomin, Petr A. Golovach, Kirill Simonov

^∗

DepartmentofInformatics,UniversityofBergen,ThormøhlensGate55,5008Bergen,Norway

a rt i c l e i nf o a b s t ra c t

Articlehistory:

Received18December2019

Receivedinrevisedform 17August2020 Accepted12October2020

Availableonline19November2020

Keywords:

Clustering

Parameterizedcomplexity k-means

k-median

Ink-ClusteringwearegivenamultisetofnvectorsX⊂Z^dandanonnegativenumberD, andweneedtodecidewhether XcanbepartitionedintokclustersC1,. . . ,C_k suchthat thecost

k

i=¹ min

ci∈^R^d

x∈^Ci

x−c_i^pp≤D,

where· p istheLp-norm.Forp=^2,^k-Clusteringisk-Means.Westudyk-Clustering fromtheperspectiveofparameterizedcomplexity.TheproblemisknowntobeNP-hardfor k=^{2 and}âlso^for^d=^2.Îtîsâlong-standingopenquestion,whethertheproblemisfixed- parametertractable(FPT)forthecombinedparameterd+^k.În^this^paper,^we^focusôn^the parameterization byD.We complementtheknown negativeresults byshowingthat for p=^{0 and} ^p= ∞^,^k-ClusteringisW[¹]^-hard^whenparameterizedby D.Interestingly,we discoveratractabilityislandofk-Clustering:foreveryp∈(0,1]^,^k-Clusteringissolvable intime2Ô⁽^D^logD⁾(nd)Ô⁽¹⁾.

©²⁰²⁰^TheÂuthor(s).^Published^byÊlsevierÎnc.^Thisîsânôpenâccessârticleûnder^the CCBYlicense(http://creativecommons.org/licenses/by/4.0/).

1. Introduction

Recallthatforp

>

0,theMinkowskiorLp-normofavectorx

= (

x

[

¹

] , . . . ,

x

[

^d

] ) ∈

R^d isdeﬁnedas

^x

p

=

^d

i=¹

|

^x

[

ⁱ

]|

^p

1/p

.

Respectively,wedeﬁnethe(L_p-norm)distancebetweentwovectorsx

= (

x

[

¹

], . . . ,

x

[

^d

])

^and ^y

= (

y

[

¹

], . . . ,

y

[

^d

])

^as

dist_p

(

x

,

y

) =

^x

−

^y

^pp

=

d

i=¹

|

^x

[

ⁱ

] −

^y

[

ⁱ

]|

^p

.

We also consider distp for p

=

^{0 and} ^p

= ∞

^. ^For ^p

=

^0, ^distp is L0 (or the Hamming) distance, that is the number of differentcoordinatesinxand y:

✩ ThisworkissupportedbytheResearchCouncilofNorwayviatheproject“MULTIVAL”.Thegrantnumberis263317.

*

Correspondingauthor.

E-mailaddresses:[email protected](F.V. Fomin),[email protected](P.A. Golovach),[email protected](K. Simonov).

https://doi.org/10.1016/j.jcss.2020.10.005

0022-0000/©²⁰²⁰^TheÂuthor(s).^Published^byÊlsevierÎnc.^Thisîsânôpenâccessârticleûnder^the^CC^BY^license (http://creativecommons.org/licenses/by/4.0/).

(2)

Fig. 1.Optimalclusteringsofthesamesetofvectorswithdifferentdistances:dist1 intheleftsubﬁgure,dist1/4 intherightsubﬁgure.Shapesdenote clusters,crossesdenoteclustercentroids.

dist0

(

x

,

y

) = |{

ⁱ

∈ {

¹

, . . . ,

d

} |

^x

[

ⁱ

] =

^y

[

ⁱ

]}| .

Forp

= ∞

^,^distpis L_∞-distance,whichisdeﬁnedas dist_∞

(

x

,

y

) =

^max

i∈{1,...,d}

|

^x

[

ⁱ

] −

^y

[

ⁱ

]|.

Thek-Clusteringproblemisdeﬁnedasfollows.Foragiven(multi)datasetofn vectors(points) X

⊂

Z^d,thetaskisto ﬁndapartitionof X intokclustersC1

, . . . ,

Ckminimizingthecost

k i=¹

min

c_i∈R^d

x∈^Ci

dist_p

(

x

,

c_i

),

intuitively,ci isacentroidoftheclusterCi.

Inparticular,forp

=

^1,^distp isthe L₁-distanceandthecorrespondingclusteringproblemisknownask-Median.(Often intheliterature,k-MedianisalsousedforclusteringminimizingthesumsoftheEuclideandistances.) Forp

=

^2,^distp is theL2(Euclidean)distance,andthentheclusteringproblembecomesk-Means.

Letusnote that optimalclusteringsforthesame setofvectorscan bedrasticallydifferent forvariousvaluesof p,as showninFig.1.Asweshowinthepaper,thecomplexityofk-Clusteringalsostronglydependsonthechoiceofp.

k-Clustering, and especially k-Median and k-Means, are among the most prevalent problems occurring in virtually every subareaof data science. We refer to the survey of Jain [1] for an extensive overview. While inpractice the most commonapproachestoclusteringare basedondifferentvariationsofLloyd’sheuristic[2],theproblemisinterestingfrom the theoretical perspective aswell. Inparticular, thereis a vastamount ofliterature on approximation algorithms fork- Clusteringwhosebehaviorcanbeanalyzedrigorously,seee.g.[3–17].

When it comesto exact solutions,we observethe followingphenomena. While heuristic algorithms fork-Clustering work surprisinglywell in practice,from the perspective of parameterized complexity, k-Clustering is intractable forall previously studiedparameterizations,seeTable1.Thek-Clusteringproblemisnaturally“multivariate”:inadditiontothe number ofpoints n,there are alsoparameters like spacedimension d,number ofclustersk or thecost ofclustering D. TheproblemisknowntobeNP-completefork

=

2 [18,19] andford

=

^{2 [20,21].}^By^the^classical^workôfÎnabaêt^{al. [22],} inthecasewhenbothd andk areconstants, k-Clustering issolvableinpolynomial time O(ⁿ^dk⁺¹

)

.It isalong-standing openproblemwhetherk-ClusteringisFPTparameterizedbyd

+

^k.ÛnderÊTH,^the^lower^boundôfⁿ⁽^k⁾^,êven^when^d

=

^4, wasshownbyCohen-Addadetal.in [23] forthesettingswherethesetofpotentialcandidatecentersisexplicitlygivenas input. Howeverthelower bound ofCohen-Addadetal.doesnot generalizeto thesettingsofthispaperwhereanypoint inEuclideanspacecanserveasacenter.Forthespecialcase,whentheinputconsistsofbinaryvectorsandthedistanceis Hamming,theproblemissolvableintime2^O⁽^D^log^D⁾

(

nd

)

^O⁽¹⁾ [24].

Ourresultsandapproaches.Inthispaperweinvestigatethedependenceofthecomplexityofk-Clusteringonthecostof clustering D. Itappears thataddingthisnew“dimension” makes thecomplexity landscapeofk-Clusteringintricate and interesting.Moreprecisely,weconsiderthefollowingproblem.

Input: Amultiset X ofnvectorsinZ^d,apositiveintegerk,andanonnegativenumberD.

Task: DecidewhetherthereisapartitionofXintokclusters

{

^Ci

}

^k_i₌₁^and^k^vectors

{

^ci

}

^k_i₌₁^,^called^centroids, inR^dsuchthat

k i=¹

x∈^Ci

dist

(

x

,

c_i

) ≤

^D

.

k-Clusteringwith distance dist

(3)

Letusremarkthatvectorset X (like thecolumnsetofamatrix)cancontainmanyequalvectors. Alsoweconsiderthe situationwhenvectorsfromX areintegervectors,whilecentroidvectorsarenotnecessarilyfromX.Moreover,coordinates ofcentroidscanbereals.

Ourmainalgorithmicresultisthefollowingtheorem.

Theorem1.k-Clusteringwithdistancedist_pissolvableintime2^O⁽^D^log^D⁾

(

nd

)

^O⁽¹⁾foreveryp

∈ (

0

,

1

]

^.

Thus k-Clusteringwhenparameterized by D isﬁxed-parametertractable (FPT) forMinkowskidistancedistp oforder 0

<

p

≤

^1.În^the ^first^step ôfôurâlgorithm ^weûse ^color^coding^to ^reduce^the ^problem^to ^Cluster^Selection^,^which^we findinterestingonitsown.InClusterSelectionwehavet groupsofweightedvectorsandthetaskistoselectexactlyone vectorfromeachgroupsuchthattheweightedcostofthecompositeclusterisatmostD.Moreformally,

Input: AsetofmvectorsX giventogetherwithapartition X

=

^X1

∪ · · · ∪

^Xt intot disjointsets,aweight functionw

:

^X

→

Z₊,andanonnegativenumberD.

Task: Decidewhetheritispossible toselectexactlyone vector xi fromeach set Xi suchthat thetotal costofthecompositeclusterformedbyx1,. . . ,xt isatmostD:

min

c∈R^d

t i=¹

w

(

x_i

) ·

^dist

(

x_i

,

c

) ≤

^D

.

Cluster Selectionwith distance dist

The ClusterSelection problemis closelyrelatedto variantsof thewell-known ConsensusPatternproblem. Namely, fortheHammingdistance,thedeﬁnitionofClusterSelectionnearlycoincideswiththeColoredConsensusStringswith Outliersproblemstudiedin[25],onlyinthelatterthealphabetisassumedtobeofconstantsize.

Informally (see Theorem 10for the precise statement),our reduction showsthat if the distancenorm satisfies some specific properties (which dist_p satisfies for all p) and if ClusterSelection is FPT parameterized by D, then so is k- Clustering.Therefore,inorderto proveTheorem1,allwe needistoshow thatClusterSelectionis FPTparameterized by D when p

∈ (

0

,

1

]

^.^Thisîs^the^most^difficult^partôf^the^proof.^Here^weînvoke^the^theoremôf^Marx^{[26] on}^the^number ofsubhypergraphsinhypergraphsofboundedfractionaledgecover.

Superﬁcially,thegeneralideaoftheproofofTheorem1issimilartotheideabehindthealgorithmforBinaryr-Means for L₀ from[24]. In both cases,the classical color coding technique ofAlon etal. [27] is used as a preprocessing step.

However, the further steps in [24] strongly exploit the fact that the data is binary. As we will see in Theorem 2, the existenceofanFPTalgorithmfork-ClusteringinL0ishighlyunlikely.Thusthereductionsfrom[24] cannotbeappliedin ourcase,andweneedanewapproach.

Moreprecisely,forclusteringinL₀weprovethefollowingtheorem.

Theorem2.Withdistancedist0, k-Clusteringparameterizedbyd

+

^{D and}^Cluster^Selectionparameterizedbyd

+

^t

+

^{D are} W

[

¹

]

^-hard.

Inparticular,thismeansthatup toawidely-believedassumptionincomplexity that FPT

=

^W

[

¹

]

^,^Theorem²^rules^out algorithms solvingk-Clusteringintime f

(

d

,

D

) ·

ⁿÔ⁽¹⁾ândâlgorithms^solving^Cluster^Selectionⁱⁿ^L0 intime g

(

t

,

d

,

D

) ·

n^O⁽¹⁾foranyfunctions f

(

d

,

D

)

andg

(

t

,

d

,

D

)

.AsimilarhardnessresultholdsforL_∞.

Theorem3.Withdistancedist_∞, k-ClusteringparameterizedbyD andClusterSelectionparameterizedbyt

+

^{D are}^W

[

¹

]

^-hard.

Thisnaturally brings ustothequestion:What happenswithk-Clustering for p

∈ (

1

, ∞ )

,especially fortheEuclidean distance,thatis p

=

^2.Unfortunately,wearenotabletoanswerthisquestion whentheparameterisD only.However,we canprovethat

Theorem4.k-ClusteringandClusterSelectionwithdistancedist2areFPTwhenparameterizedbyd

+

^D.

Thusinparticular,Theorem4impliesthatk-Clusteringwithdistancedist2isFPTparameterizedbyd

+

^D.^On^the^other hand,weprovethat

Theorem5.ClusterSelectionwithdistancedistpisW

[

¹

]

^-hard^for^every^p

∈ (

1

, ∞ )

whenparameterizedbyt

+

^D.

(4)

Table 1

Complexityofk-ClusteringandClusterSelection.

distp k-Clustering Cluster Selection

p=⁰ ^W[¹]-hard param.d+^D^[Theorem^2]

NP-c fork=^{2 [19]} ^W[¹]-hard param.d+^t+^D^[Theorem^2]

0<p≤¹ ²

O(DlogD)(nd)^O⁽¹⁾[Theorem1]

NP-c fork=2 whenp=1 [19]

NP-c ford=^{2 when}^p=^{1 [20]}

2^O⁽^D^logD⁾(nd)^O⁽¹⁾[Theorem15]

W[¹]-hard param.t+^d^for^p=^{1 [Theorem}^20]

1<p<+∞ ^FPT^param.^d+Dforp=2 [Theorem4]

NP-c fork=2 whenp=2 [18]

NP-c ford=2 whenp=2 [21]

FPTparam.d+^D^for^p=^{2 [Theorem}^4]

W[¹]-hard param.t+^D^[Theorem^5]

p= ∞ ^W[1]-hard param.D[Theorem3]

NP-c fork=2 [Theorem30] W[¹]-hard param.t+^D^[Theorem^3]

In particular, Theorem 5 yields that the approach we used to establish the tractability (with parameter D) of k- Clusteringforp

=

^{1 will}^not^work^for ^p

>

1.

We summarize our and previously known algorithmic and hardness results for k-Clustering and ClusterSelection withdifferentdistancesinTable 1.Observethat Theorem10worksalsointhesettingwherepossible clustercentersare restricted to be from a set givenin the input, and so doour algorithmic Theorems 1 and4 since ClusterSelection is triviallysolvableinpolynomialtimeinthissetting.

NowwediscussthechoiceoftheparameterD.ItmightbenotedthattheregimewherethecostofclusteringD issmall comparedtothenumberofpointsn,isquitespecial.Indeed,ifthecostofclusteringisatmostD,thentherearebutafew pointsthatarenotequaltotherespectiveclustercenters.Thus,theproblemwestudyhasthespiritofaneditingproblem:

check whether a given instanceis close to a “structured”one, where in our casea “structured” instance hasat mostk distinct points, and closeness is measured via the sumof L_p-distances. Editing problems are extensivelystudied in the parameterizedalgorithmsliterature,rangingfromthevastareaofgraphmodiﬁcation(seee.g.arecentsurveybyCrespelle et al. [28]) to studies very closeto ours, like theConsensusPatterns algorithm by Marx [26], andthe study ofBinary r-Meansby Fominetal. [24] thatis essentiallya specialcaseofourk-Clusteringproblem. Andstill, eveninthishighly structured regime, our results show a very intricate picture: forinstance, fork-Clustering parameterized just by D,we provideahighlynon-trivialFPTalgorithminthecase0

<

p

≤

^1.^While^on^the^other ^hand,conditionally,thesamescheme could notlead toan analogousalgorithm inthecase p

=

^2,ând^there ^could^not ^beâny^FPT âlgorithmâtâllⁱⁿ^the ^cases p

=

^{0 and}^p

= ∞

^.^Finally ^we^believe^that ^studying^k-Clusteringwithrespecttotheparameter D isan essentialquestion providedthe notorioushardness oftheproblem.Recall thatforthecombinationofthetwoother naturalparameters, the dimensiondandthenumberofclustersk,onlya O

(

n^dk⁺¹

)

algorithmofInabaetal.isknown [22],andthehardnessresult byCohen-Addadetal.in [23] servesasastrongindicationthatabetteralgorithmmightnotexist.

Observethatwealwaysconsiderinteger-valuedinstances.Webelievethisisthemostnaturalmodelforstudyingcom- plexity ofk-Clusteringwithrespectto theparameter D. Hereitis importanttonote thatconsidering D asa parameter only makes sense ifthe input values are suitablydiscretized. Imagineinput vectorscould have arbitraryreal-valued (or rational-valued)entries, thenfora giveninstanceit isalways possibleto scalethe valuesdown bythe samefactorsuch thatthecostofanoptimalclusteringisarbitrarilysmall,butthestructureoftheinstanceiscompletelypreserved.Thusthe restrictiontointegervaluesinourstudyisanaturaldiscretizationoftheproblem.Itallowstheparameter D tobeardeep structuralsigniﬁcance,asourresultsdemonstrate.

Theremaining partofthispaperisorganizedasfollows.Section 2containspreliminaries.InSection 3we proveTheo- rem10whichprovidesuswithFPTTuringreductionfromk-ClusteringtoClusterSelection.Theorem10appearstobe a handy tool toestablish tractabilityof k-Clustering. InSection 4we collect theresults onclusteringwith Lp-normfor p

∈ (

0

,

1

]

^.^Inparticular,inSubsection4.1,weproveTheorem1,themainalgorithmicresultofthiswork,statingthatwhen p

∈ (

0

,

1

]

^,^k-ClusteringandClusterSelectionadmit FPTalgorithmswithparameter D.InSubsection4.2wecomplement thealgorithmicupperboundswithlower boundsbyprovingthatClusterSelectionisW

[

¹

]

^-hard^when ^p

=

^{1 and}^param- eterist

+

^d^(Theorem^20).^In ^Section^5,^we ^consider^the ^case ^p

=

^{0 and}^prove ^Theorem²establishing W

[

¹

]

^-hardness ^of k-ClusteringandClusterSelection.Section6isdevotedtothecasep

= ∞

.Hereweestablishtwohardnessresultsabout k-Clustering:W

[

¹

]

^-hardness ^when parameterizedby D andNP-hardnessin thecasek

=

^2.^In ^Section ^7, ^we^look ^at^the case p

∈ (

1

, ∞ )

,withtheparticularemphasison themostcommonlyusedcase p

=

^2.^We^show ^that ^when^d

+

^D ^is ^the parameter, thenClusterSelectionandk-Clusteringinthe L2 distanceareFPT.Wealso showthat ClusterSelectionis W

[

¹

]

^-hard^whenparameterizedbyt

+

^D ^for^all ^p

∈ (

1

, ∞ )

.WeconcludewithopenproblemsinSection8.

2. Preliminariesandnotation

Clusternotation. Bya cluster we always mean a multisetof vectorsin Z^d. Fordistance dist, the cost ofa givencluster C is the total distance fromall vectors in the cluster to the optimally selected cluster centroid, min_c_∈Rd

x∈^Cdist

(

x

,

c

)

. An optimal cluster centroid for a given cluster C is any c

∈

R^d minimizing

x∈^Cdist

(

x

,

c

)

. For most of the considered distances,wearguethatanoptimalclustercentroidcouldalwaysbechosenamongaspeciﬁcfamilyofvectors(e.g.integral).

Wheneverweshowthis,weonlyconsideroptimalclustercentroidsofthestatedformafterwards.

(5)

Complexity. Aparameterizedproblem isa language Q

⊆

^∗

×

Nwhere

^∗ isthe setof strings overa ﬁnite alphabet

. Respectively,an input of Q isa pair

(

I

,

k

)

where I

⊆

^∗ andk

∈

N; k isthe parameteroftheproblem. A parameterized problemQ isﬁxed-parametertractable(FPT)ifitcanbedecidedwhether

(

I

,

k

) ∈

^Q ⁱⁿ^time ^f

(

k

) ·|

^I

|

Ô⁽¹⁾^for^some^function ^f thatdependsoftheparameterkonly.Respectively,theparameterizedcomplexityclassFPTiscomposedbyfixed-parameter tractable problems.The W-hierarchyisacollectionofcomputationalcomplexityclasses:we omitthetechnicaldefinitions here. Thefollowingrelationisknownamongst theclassesintheW-hierarchy:FPT

=

^W

[

⁰

] ⊆

^W

[

¹

] ⊆

^W

[

²

] ⊆ . . . ⊆

^W

[

^P

]

^.^It iswidelybelievedthatFPT

=

^W

[

¹

]

^,ând^henceîfâ^problemîs^hard^for^the^class^W

[

ⁱ

]

^(for^anyⁱ

≥

¹⁾^thenîtîs^considered^to befixed-parameterintractable.Werefertobooks[29,30] forthedetailedintroductiontoparameterizedcomplexity.

WealsoprovideconditionallowerboundsbymakinguseofthefollowingcomplexityhypothesisformulatedbyImpagli- azzo,Paturi,andZane[31].

ExponentialTimeHypothesis(ETH):Thereisapositiverealssuchthat3-CNF-SATwithnvariablesandmclausescannot besolvedintime2^sn

(

n

+

^m

)

^O⁽¹⁾.

Graphs.In ourW

[

¹

]

^-hardness ^proofs, ^we ^heavily ^employgraph-theoretical notation.Whenever we workwitha graph G, wealwaysﬁxsomeorderingonthevertices

π

V

:

^V

(

G

) → {

¹

, . . . , |

^V

(

G

) |}

ândôn^theêdges

π

E

:

^E

(

G

) → {

¹

, . . . , |

^E

(

G

) |}

^.^We drop

π

V and

π

E tosimplifynotation, sowhen weconsider avertex v

∈

^V

(

G

)

oranedge e

∈

^E

(

G

)

, v ande alsodenote integers—numbersofv andeaccordingtotheorderings

π

V and

π

E correspondingly.

Realcomputations. Since we deal with the problem concerning real-valued matrices, we express the running time of algorithms intermsofnumberofoperationsoverthereals.Thisisnaturalsincetocompute Lp-distanceswehavetodeal withnumbersofformx^p wherexisanintegerandpisanyrealnumber.However,inspecialcasestheboundsholdeven for morerestrictive models, e.g. when p

=

^{1 or} ^p

=

^{2 the}âlgorithms ôperate ônly ônîntegers ôf polynomially bounded length.

3. Fromk-CLUSTERINGto CLUSTERSELECTION

Inthissection wepresentageneralschemeforobtainingan FPTalgorithmparameterizedby D,whichislaterapplied tovariousdistances.

First,weformalizethefollowingintuition:thereisnoreasontoassignequalvectorstodifferentclusters.

Deﬁnition6(Initialclusterandregularpartition).Foramultisetofvectors X,an inclusion-wisemaximalmultisetI

⊂

^X ^such thatallvectorsinI areequaliscalledaninitialcluster.

Wesaythataclustering

{

^C1

, . . . ,

C_k

}

ôf^X îs^regularîf^forêveryînitial^clusterÎ ^thereîsâⁱ

∈ {

¹

, . . . ,

k

}

^such^that^I

⊂

^Ci. Nowweprovethatitsuﬃcestolookonlyforregularsolutions.

Proposition7.Let

(

X

,

k

,

D

)

beayes-instanceto k-Clustering.Thenthereexistsasolutionof

(

X

,

k

,

D

)

whichisaregularclustering.

Proof. Letusassume that theinstance

(

X

,

k

,

D

)

has a solution.There are k clusters

{

^Ci

}

^k_i₌₁ ^and^k ^vectors

{

^ci

}

^k_i₌₁ ⁱⁿ R^d such that k

i=¹

x∈^Cidist

(

x

,

c_i

) ≤

^D. ^Note^that ^for^every ^x

∈

^Cj, dist

(

x

,

c_j

) ≥

^min1≤i≤kdist

(

x

,

c_i

)

.So ifwe considera new clustering

{

^C1

, . . . ,

C_k

}

^with^the ^same^centroids,^where^Cj areall vectorsfrom X forwhichcj istheclosest centroid,the total distancedoesnot increase.Ifwe alsobreak tiesinfavorofthe lower index, thenforanyinitial cluster I the same centroidciwillbetheclosest,andallvectorsfromI willendupinC_i,so

{

^C1

, . . . ,

C_k

}

^is^a^regularclustering.

Fromnowon,weconsideronlyregularsolutions.

Deﬁnition8(Simpleandcompositeclusters).Wesaythatacluster C issimpleifitisaninitialcluster.Otherwise,thecluster iscomposite.

Nextwestateapropertyofk-Clusteringwithaparticulardistance,whichisrequiredforthealgorithm.Intuitively,each uniquevectoraddsatleastsomeconstanttotheclustercost.

Deﬁnition9(

α

-property).Wesaythat adistancehasthe

α

-propertyforsome

α >

0 ifforanysthecostofanycomposite clusterwhichconsistsofsinitialclustersisatleast

α (

s

−

¹

)

.

In the subsequent sections we show that the

α

-property holds for all the distance measures for which we present algorithms.Namely,theLp-distancehasthe

α

-propertywithacertainconstant

α

,foreach p

∈ [

⁰

,

1

] ∪ {

²

, ∞}

^.Analogously

(6)

Fig. 2.AnillustrationofthealgorithminTheorem10.WestartwithaparticularrandomcoloringandaparticularpartitionofcolorsP= {^P1,P2}^,^where P1= { , }ând^P²= { , , }^.^We^make^two^calls^to^Cluster^Selection^with^respect^to^P¹ând^P²ând^construct^the^resultingclustering.Intheexample, allinputvectorsaredistinct.

tothecase p

=

^2,ône^can^show^thatît^holds^forâllôther^valuesôf^p^between^{1 and}

∞

^as^well,^although^we^do^not^need thisfact.

The ClusterSelection problem deﬁned inthe introduction is a key subroutine in our algorithm. In some cases the problemissolvabletrivially,butitpresentsthemain challengeforourmainalgorithmic resultwiththe L₁ distance.The intuitionto theweightfunctioninthedeﬁnitionofClusterSelectionisthat itrepresentssizes ofinitialclusters,that is, howmanyequalvectorsarethere.

We also need a procedure to enumerate all values of the cost ofeach possible cluster, withrespect to an optimally selected cluster centroid, that are at most D.It may not be straightforward since not all distances in ourconsideration are integer.So forthepurposeofstatingTheorem 10forgeneralmetrics,we assumethat theset ofallpossible optimal clustercostswhicharelessthanD isalsogivenintheinput.FortheLp-distancesweconsider,intherespectivealgorithmic theoremsweshowhowtoprovidethissetwithoutraisinganyadditionalassumptionsorincreasingtherunningtime.Now wearereadytostatetheresultformally.

Theorem10.Assumethatthe

α

-propertyholds,ClusterSelectionissolvableintime

(

m

,

d

,

t

,

D

)

,where

isanon-decreasing functionofitsarguments,andwearegiventhesetDôfâll^possibleôptimal^cluster^costs^whichâreât^most^D.^{Then k}-Clusteringis solvableintime

2^O⁽^D^logD⁾

(

nd

)

^O⁽¹⁾

|

D

|(

ⁿ

,

d

,

2D

/ α ,

D

).

Proof. Bythe

α

-property,inanysolutionthereareatmostD

/ α

compositeclusters,sinceeachcontainsatleasttwoinitial clusters.Moreover,thereareatmost2D

/ α

initialclustersinallcompositeclusters.

ThusbyProposition7,solvingk-ClusteringisequivalenttoselectingatmostT

:=

^2D

/ α

^initial^clusters^and^grouping themintocompositeclusterssuchthatthetotalcostoftheseclustersisatmost D.Wedesignanalgorithmwhich,taking asasubroutineanalgorithmforClusterSelection,solvesk-Clustering.ThealgorithmissketchedinFig.3,anexampleis showninFig.2.

Toperformtheselection andgrouping,ouralgorithm usesthecolorcodingtechnique ofAlon,Yuster,andZwickfrom [27]. Consider the input as a family of initial clusters I. We color initial clusters from I independently and uniformly at random by T colors 1,2, . . . , T. Consider anysolution, andthe particular set ofat most T initial clusters whichare includedintocompositeclustersinthissolution.Theseinitialclustersarecoloredbydistinctcolorswithprobabilityatleast

T!

T^T

≥

ê⁻^T^.^Now^we^constructânâlgorithm^for^findingâ^colorful^solution.

Weconsiderallpossiblewaystosplitcolorsbetweenclusters(somecolorsmaybeunused).Henceweconsiderallpos- siblefamilies P

= {

^P1

, . . . ,

Ph

}

^of^pairwise ^disjoint^non-empty^subsets^of

{

^c

∈ {

¹

, . . . ,

T

} :

there exists J

∈

I^{colored by}^c

}

^. EachfamilyP correspondstoapartitionofthesetofcolors

{

¹

, . . . ,

T

}

îf^weâddône^fictitious^subset^for^colors^whichâre notusedinthecompositeclusters.ThetotalnumberofpartitionsdoesnotexceedT^T

=

²^O⁽^D^log^D⁾^.

(7)

k-Clustering(X,k,D,α,D⁾

Input :AmultisetX⊂Z^d,apositiveintegerk,realnonnegativevaluesDandα,asetD^,^an^algorithmA^for^Cluster^Selection Output:YesorNo

1 T← ^2D/α

2 I←initial clusters ofX 3 fore^Titerationsdo

4 FixarandomcoloringcofI^with^colors{¹,. . . ,T} 5 forvalidpartitionsP^of{¹,. . . ,T}^do

6 fori=¹^to|P|^do 7 Pi= {ⁱ¹,. . . ,it} 8 forj=1totdo

9 Xj← ∅

10 for J∈I:c(J)=ijdo

11 x←a point from J

12 Xj←^Xj∪ {^x}

13 w(x)← |^J|

14 di←^D+¹

15 foreachd∈D^do

16 ifA^{( X}1,. . . ,Xt,w,d)then

17 di←^d

18 BREAK

19 ift

i=1di≤Dthen

20 Yes,STOP

21 No,STOP

Fig. 3.k-Clusteringalgorithm from Theorem10.

Whenpartition P îs^fixed,^we^form^clusters^by^solvingînstances ôf^Cluster^Selection^: ^Forêach ⁱ

∈ {

¹

, . . . ,

h

}

^,^we^take initial clusters colored by elements of Pi, bundle together those with the same color, and pass the resulting familyto ClusterSelection.Firstnotethattherecannot be P

∈

P ôf^sizeât^mostône,^since^then^Cluster^Selection^has^to^makeâ simpleclusterwhileweassume thatallclustersobtainedfromP âre^composite.^Second,^the^total^numberôf^clusters^has to bek,the numberofclustersis

|I| −

P∈P

|

P

| + |P|

. Foreach P we check thatboth conditionshold, andifnot, we discardthechoiceofP ^and^move^to^the^next^one,^before^calling^the^Cluster^Selectionsubroutine.

Next,weformalizehowwecalltheClusterSelectionsubroutine.Weﬁxthesetofcolors Pi

= {

^c1

, . . . ,

ct

}

^,^then^take^the sets I_j

= {

^J

∈

I

:

^Jis colored byc_j

}

^for ^j

∈ {

¹

, . . . ,

t

}

^.^We^turnêach^setôfînitial^clusters Îj intoasetofweighted vectors Xj naturally: Foreach J

∈

^Ij,we putone vector x

∈

^J ^into ^Xj, and w

(

x

) := |

^J

|

^. ^The^family^of ^sets^of^vectors ^X1,. . . , Xt andtheweightfunctionwaretheinputforClusterSelection.Thenwesearchfortheminimumclustercostbounddi

≤

^D fromD^,^for^which^the^instance

(

X₁

, . . . ,

X_t

,

d_i

)

ofClusterSelectionisayes-instance,runningeachtimethealgorithmfor ClusterSelection.

Ifforsomei settingdito D leadstoano-instance,orifh

i=1di

>

D,thenwediscardthechoiceofthepartitionPand movetothenextone.Otherwise,wereportthatk-Clusteringhasasolutionandstop.Next,weprovethatinthiscasethe solutionindeedexists.

We reconstruct the solutionto k-Clustering asfollows: For each i

∈ {

¹

, . . . ,

h

}

^the corresponding to P_i

= {

^c1

, . . . ,

c_t

}

instanceof ClusterSelectionhasasolution

{

^x1

, . . . ,

xt

}

^.^For^each ^j

∈ {

¹

, . . . ,

t

}

^, ^consider^thecorresponding initialcluster Jj consistingof w

(

xj

)

vectorsequaltoxj.Foreach i

∈ {

¹

, . . . ,

h

}

^we ^obtain^a^composite^cluster

∪

^t_j₌₁^Jj,allother clusters are simple. So the total cost is h

i=¹di, which isat most D. Thus, ifthe algorithm ﬁnds a solution, then

(

X

,

d

,

D

)

isa yes-instance.

Intheopposite direction.Ifthereisasolutiontok-Clustering,thenthereisaregularsolution,andwithprobabilityat leaste⁻^T initialclusterswhicharepartsofcompositeclustersinthissolutionarecoloredbydistinctcolors.Then,thereisa partitionP

= {

^P1

, . . . ,

P_h

}

^whichcorrespondstothissolution.Thispartitionisobtainedasfollows:putinto P₁ colorsfrom thefirstcompositecluster,intoP2 fromthesecondandsoon.Atsome pointouralgorithmchecksthepartitionP^,ândâs itfindstheoptimalcostvalueforeachcluster,thenitisatmostthecostofthecorrespondingclusterofthesolutionfrom whichwestarted.

Toanalyzetherunningtime,weconsider 2^O⁽^D^log^D⁾ partitionsP^,^for^eachP ^we

|P| =

O

(

D

)

timessearchforoptimal di.And foreachof

|D|

possiblevalues¹ ofdi we makeone callto theClusterSelection algorithm,whichtakestime at most

(

n

,

d

,

T

,

D

)

.

Toamplifythe errorprobability tobe atmost1

/

e,we doN

=

^e^T

îterationsôf^the âlgorithm,êach^time ^withâ^new randomcoloring.Aseachiterationsucceedswithprobabilityatleaste⁻^T,theprobabilityofnotfindingacolorful solution afterN iterationsisatmost

(

1

−

^e⁻^T

)

^e^T

≤

^e⁻¹

<

1.Sothetotalrunningtimeis2^O⁽^D^log^D⁾

· (

nd

)

^O⁽¹⁾

|D|(

ⁿ

,

d

,

2D

/ α ,

D

)

.

1 Wecouldalsobinarysearchforthe optimaldi∈D înstead,^thus^replacing|D|^by^log|D|ⁱⁿ^the^running^time.^However,^forâll^choicesôfD ^we considerthisdoesnotmakeadifference.

(8)

0 2 4 6 8 10 5

10 15 20

z

cost

0 2 4 6 8 10

4 6 8 10

z

(a) cost(z)= |z−2| + |z−3| + |z−6| + |z−8| (b) cost(z)= |z−2|¹^/²+ |z−3|¹^/²+ |z−6|¹^/²+ |z−8|¹^/²

Fig. 4.Graphsofclustercostoverdifferentvaluesofz:dist1intheleftplot,dist1/2intherightplot.Thesetofcoordinatevaluesisgivenasy1=2,y2=3, y3=^6,^y4=^8.

The algorithmcould be derandomizedbythe standardderandomization technique usingperfecthash families[27,32].

Sok-Clusteringissolvableinthesamedeterministictime.

4. Algorithmsandcomplexityfordistanceswithp∈(0,1]

The mainmotivationfortheresultsinthissection isthestudy ofk-Clusteringwiththe L1 distance,the casewidely knownask-Medians.However,ourmainalgorithmicresultalsoextendstodistancesoforderp

∈ (

0

,

1

)

sinceinsomesense theybehavesimilarlytotheL₁ distance.

4.1. FPTalgorithmwhenparameterizedbyD

Inthissubsection,weproveTheorem1:when p

∈ (

0

,

1

]

^,^k-ClusteringadmitsanFPTalgorithmwithparameter D.First we state basicgeometricalobservations forcases p

=

^{1 and} ^p

∈ (

0

,

1

)

. Thenwepropose a generalalgorithmforCluster Selectionwhichreliesonlyontheseproperties.Finally,weshowhowTheorem10couldbeapplied.

The next two claims deal with the structure of optimal cluster centroids. We state and prove them in the case of weighted vectorswhereeach vector hasa positiveinteger weightgivenby aweight function w.The unweighted caseis justaspecialcasewhentheweightofeachvectorisone.

First, we show that coordinates of cluster centroids could always be selected amongthe values presentin theinput, whichhelpsgreatlyinenumeratingclustercentroidsthatmaybeoptimal.

Claim11.Assumep

∈ (

0

,

1

]

^,^let^C

= {

^x1

, . . . ,

xt

}

^be^a^cluster^and^w

: {

^x1

, · · · ,

xt

} →

Z₊beaweightfunction.Thereisanoptimal (subjecttotheweighteddistancew

(

xi

) ·

^distp

(

xi

,

c

)

)centroidc ofC suchthatforeachi

∈ {

¹

, . . . ,

d

}

^,^the^i-th^coordinate^c

[

ⁱ

]

^of^the centroidisfromthevaluespresentintheinputinthiscoordinate,thatisc

[

ⁱ

] ∈ {

^x1

[

ⁱ

], . . . ,

x_t

[

ⁱ

]}

^.^Moreover,^for^p

=

¹^we^may^assume thattheoptimalvalueisaweightedmedianofthevaluespresentinthei-thcoordinate.

Proof. Forcluster C,considerthecorrespondingmultisetofunweightedvectorsC

= {

^x1

, . . . ,

xt

}

^,^where^each^vector ^x

∈

^C isrepeated w

(

x

)

times.Wedeﬁne y_j

=

^xj

[

ⁱ

]

^for ^j

∈ {

¹

, . . . ,

t

}

^.^Assume^that ^y1

≤

^y2

≤ · · · ≤

^yt.Letusconsideran optimal clustercentroidc forC anddenotez

=

^c

[

ⁱ

]

^.^Fig.⁴^shows^how^the^cluster^cost^behaves ^with^respect^to^z ^on^a^concrete^set ofvalues

{

^yi

}

^for^p

=

^{1 and} ^p

=

¹

/

2.

Fortheformalproof,westartwiththecasep

=

^1.^The^total^cost^of^C contributedbythei-thecoordinateis

|

^y1

−

^z

| + |

^y2

−

^z

| + · · · + |

^yt

−

^z

| .

Ifz

∈ (

yi

,

yi+¹

)

fori

∈ {

¹

, . . . ,

t

−

¹

}

^,^then^the^derivative^with^respect^to^z ^is

((

z

−

^y1

) + · · · + (

z

−

^yi

) + (

y_i₊₁

−

^z

) + · · · + (

y_t

−

^z

))

=

ⁱ

− (

t

−

ⁱ

).

Analogously, when z

=

^yi fori

∈ {

¹

, . . . ,

t

}

^,^the ^derivative ^is ⁱ

−

¹

− (

t

−

ⁱ

)

. When z

<

y1 thederivative is

−

^t,^and ^when z

>

yt thederivative ist.Soift isodd,then thederivative iszero at yt/2, strictlynegative before andstrictlypositive after,so yt/2,whichistheonlymedian,istheoptimalvalueforz.Ift iseven,thenthederivativeiszeroon

[

^yt/2

,

yt/2+¹

]

^, strictlynegativebeforeandstrictlypositiveafter.Soanyvaluefrom

[

^yt/2

,

y_t_/₂₊₁

]

îsôptimal,ând^we^mayâssume^thatîtîs oneofthetwomedians yt/2,yt/2+¹.

Fedor V. Fomin, Petr A. Golovach, Kirill Simonov

Journal of Computer and System Sciences

Parameterized k-Clustering: Tractability island ✩