Contents lists available atScienceDirect
Journal of Computer and System Sciences
www.elsevier.com/locate/jcss
Parameterized k-Clustering: Tractability island ✩
Fedor V. Fomin, Petr A. Golovach, Kirill Simonov
∗DepartmentofInformatics,UniversityofBergen,ThormøhlensGate55,5008Bergen,Norway
a rt i c l e i nf o a b s t ra c t
Articlehistory:
Received18December2019
Receivedinrevisedform 17August2020 Accepted12October2020
Availableonline19November2020
Keywords:
Clustering
Parameterizedcomplexity k-means
k-median
Ink-ClusteringwearegivenamultisetofnvectorsX⊂ZdandanonnegativenumberD, andweneedtodecidewhether XcanbepartitionedintokclustersC1,. . . ,Ck suchthat thecost
k
i=1 min
ci∈Rd
x∈Ci
x−cipp≤D,
where· p istheLp-norm.Forp=2,k-Clusteringisk-Means.Westudyk-Clustering fromtheperspectiveofparameterizedcomplexity.TheproblemisknowntobeNP-hardfor k=2 andalsoford=2.Itisalong-standingopenquestion,whethertheproblemisfixed- parametertractable(FPT)forthecombinedparameterd+k.Inthispaper,wefocusonthe parameterization byD.We complementtheknown negativeresults byshowingthat for p=0 and p= ∞,k-ClusteringisW[1]-hardwhenparameterizedby D.Interestingly,we discoveratractabilityislandofk-Clustering:foreveryp∈(0,1],k-Clusteringissolvable intime2O(DlogD)(nd)O(1).
©2020TheAuthor(s).PublishedbyElsevierInc.Thisisanopenaccessarticleunderthe CCBYlicense(http://creativecommons.org/licenses/by/4.0/).
1. Introduction
Recallthatforp
>
0,theMinkowskiorLp-normofavectorx= (
x[
1] , . . . ,
x[
d] ) ∈
Rd isdefinedas xp=
di=1
|
x[
i]|
p1/p.
Respectively,wedefinethe(Lp-norm)distancebetweentwovectorsx
= (
x[
1], . . . ,
x[
d])
and y= (
y[
1], . . . ,
y[
d])
asdistp
(
x,
y) =
x−
ypp=
di=1
|
x[
i] −
y[
i]|
p.
We also consider distp for p
=
0 and p= ∞
. For p=
0, distp is L0 (or the Hamming) distance, that is the number of differentcoordinatesinxand y:✩ ThisworkissupportedbytheResearchCouncilofNorwayviatheproject“MULTIVAL”.Thegrantnumberis263317.
*
Correspondingauthor.E-mailaddresses:[email protected](F.V. Fomin),[email protected](P.A. Golovach),[email protected](K. Simonov).
https://doi.org/10.1016/j.jcss.2020.10.005
0022-0000/©2020TheAuthor(s).PublishedbyElsevierInc.ThisisanopenaccessarticleundertheCCBYlicense (http://creativecommons.org/licenses/by/4.0/).
Fig. 1.Optimalclusteringsofthesamesetofvectorswithdifferentdistances:dist1 intheleftsubfigure,dist1/4 intherightsubfigure.Shapesdenote clusters,crossesdenoteclustercentroids.
dist0
(
x,
y) = |{
i∈ {
1, . . . ,
d} |
x[
i] =
y[
i]}| .
Forp
= ∞
,distpis L∞-distance,whichisdefinedas dist∞(
x,
y) =
maxi∈{1,...,d}
|
x[
i] −
y[
i]|.
Thek-Clusteringproblemisdefinedasfollows.Foragiven(multi)datasetofn vectors(points) X
⊂
Zd,thetaskisto findapartitionof X intokclustersC1, . . . ,
Ckminimizingthecost k i=1min
ci∈Rd
x∈Ci
distp
(
x,
ci),
intuitively,ci isacentroidoftheclusterCi.
Inparticular,forp
=
1,distp isthe L1-distanceandthecorrespondingclusteringproblemisknownask-Median.(Often intheliterature,k-MedianisalsousedforclusteringminimizingthesumsoftheEuclideandistances.) Forp=
2,distp is theL2(Euclidean)distance,andthentheclusteringproblembecomesk-Means.Letusnote that optimalclusteringsforthesame setofvectorscan bedrasticallydifferent forvariousvaluesof p,as showninFig.1.Asweshowinthepaper,thecomplexityofk-Clusteringalsostronglydependsonthechoiceofp.
k-Clustering, and especially k-Median and k-Means, are among the most prevalent problems occurring in virtually every subareaof data science. We refer to the survey of Jain [1] for an extensive overview. While inpractice the most commonapproachestoclusteringare basedondifferentvariationsofLloyd’sheuristic[2],theproblemisinterestingfrom the theoretical perspective aswell. Inparticular, thereis a vastamount ofliterature on approximation algorithms fork- Clusteringwhosebehaviorcanbeanalyzedrigorously,seee.g.[3–17].
When it comesto exact solutions,we observethe followingphenomena. While heuristic algorithms fork-Clustering work surprisinglywell in practice,from the perspective of parameterized complexity, k-Clustering is intractable forall previously studiedparameterizations,seeTable1.Thek-Clusteringproblemisnaturally“multivariate”:inadditiontothe number ofpoints n,there are alsoparameters like spacedimension d,number ofclustersk or thecost ofclustering D. TheproblemisknowntobeNP-completefork
=
2 [18,19] andford=
2 [20,21].BytheclassicalworkofInabaetal. [22], inthecasewhenbothd andk areconstants, k-Clustering issolvableinpolynomial time O(ndk+1)
.It isalong-standing openproblemwhetherk-ClusteringisFPTparameterizedbyd+
k.UnderETH,thelowerboundofn(k),evenwhend=
4, wasshownbyCohen-Addadetal.in [23] forthesettingswherethesetofpotentialcandidatecentersisexplicitlygivenas input. Howeverthelower bound ofCohen-Addadetal.doesnot generalizeto thesettingsofthispaperwhereanypoint inEuclideanspacecanserveasacenter.Forthespecialcase,whentheinputconsistsofbinaryvectorsandthedistanceis Hamming,theproblemissolvableintime2O(DlogD)(
nd)
O(1) [24].Ourresultsandapproaches.Inthispaperweinvestigatethedependenceofthecomplexityofk-Clusteringonthecostof clustering D. Itappears thataddingthisnew“dimension” makes thecomplexity landscapeofk-Clusteringintricate and interesting.Moreprecisely,weconsiderthefollowingproblem.
Input: Amultiset X ofnvectorsinZd,apositiveintegerk,andanonnegativenumberD.
Task: DecidewhetherthereisapartitionofXintokclusters
{
Ci}
ki=1andkvectors{
ci}
ki=1,calledcentroids, inRdsuchthat k i=1x∈Ci
dist
(
x,
ci) ≤
D.
k-Clusteringwith distance dist
Letusremarkthatvectorset X (like thecolumnsetofamatrix)cancontainmanyequalvectors. Alsoweconsiderthe situationwhenvectorsfromX areintegervectors,whilecentroidvectorsarenotnecessarilyfromX.Moreover,coordinates ofcentroidscanbereals.
Ourmainalgorithmicresultisthefollowingtheorem.
Theorem1.k-Clusteringwithdistancedistpissolvableintime2O(DlogD)
(
nd)
O(1)foreveryp∈ (
0,
1]
.Thus k-Clusteringwhenparameterized by D isfixed-parametertractable (FPT) forMinkowskidistancedistp oforder 0
<
p≤
1.Inthe firststep ofouralgorithm weuse colorcodingto reducethe problemto ClusterSelection,whichwe findinterestingonitsown.InClusterSelectionwehavet groupsofweightedvectorsandthetaskistoselectexactlyone vectorfromeachgroupsuchthattheweightedcostofthecompositeclusterisatmostD.Moreformally,Input: AsetofmvectorsX giventogetherwithapartition X
=
X1∪ · · · ∪
Xt intot disjointsets,aweight functionw:
X→
Z+,andanonnegativenumberD.Task: Decidewhetheritispossible toselectexactlyone vector xi fromeach set Xi suchthat thetotal costofthecompositeclusterformedbyx1,. . . ,xt isatmostD:
min
c∈Rd
t i=1w
(
xi) ·
dist(
xi,
c) ≤
D.
Cluster Selectionwith distance dist
The ClusterSelection problemis closelyrelatedto variantsof thewell-known ConsensusPatternproblem. Namely, fortheHammingdistance,thedefinitionofClusterSelectionnearlycoincideswiththeColoredConsensusStringswith Outliersproblemstudiedin[25],onlyinthelatterthealphabetisassumedtobeofconstantsize.
Informally (see Theorem 10for the precise statement),our reduction showsthat if the distancenorm satisfies some specific properties (which distp satisfies for all p) and if ClusterSelection is FPT parameterized by D, then so is k- Clustering.Therefore,inorderto proveTheorem1,allwe needistoshow thatClusterSelectionis FPTparameterized by D when p
∈ (
0,
1]
.Thisisthemostdifficultpartoftheproof.HereweinvokethetheoremofMarx[26] onthenumber ofsubhypergraphsinhypergraphsofboundedfractionaledgecover.Superficially,thegeneralideaoftheproofofTheorem1issimilartotheideabehindthealgorithmforBinaryr-Means for L0 from[24]. In both cases,the classical color coding technique ofAlon etal. [27] is used as a preprocessing step.
However, the further steps in [24] strongly exploit the fact that the data is binary. As we will see in Theorem 2, the existenceofanFPTalgorithmfork-ClusteringinL0ishighlyunlikely.Thusthereductionsfrom[24] cannotbeappliedin ourcase,andweneedanewapproach.
Moreprecisely,forclusteringinL0weprovethefollowingtheorem.
Theorem2.Withdistancedist0, k-Clusteringparameterizedbyd
+
D andClusterSelectionparameterizedbyd+
t+
D are W[
1]
-hard.Inparticular,thismeansthatup toawidely-believedassumptionincomplexity that FPT
=
W[
1]
,Theorem2rulesout algorithms solvingk-Clusteringintime f(
d,
D) ·
nO(1)andalgorithmssolvingClusterSelectioninL0 intime g(
t,
d,
D) ·
nO(1)foranyfunctions f(
d,
D)
andg(
t,
d,
D)
.AsimilarhardnessresultholdsforL∞.Theorem3.Withdistancedist∞, k-ClusteringparameterizedbyD andClusterSelectionparameterizedbyt
+
D areW[
1]
-hard.Thisnaturally brings ustothequestion:What happenswithk-Clustering for p
∈ (
1, ∞ )
,especially fortheEuclidean distance,thatis p=
2.Unfortunately,wearenotabletoanswerthisquestion whentheparameterisD only.However,we canprovethatTheorem4.k-ClusteringandClusterSelectionwithdistancedist2areFPTwhenparameterizedbyd
+
D.Thusinparticular,Theorem4impliesthatk-Clusteringwithdistancedist2isFPTparameterizedbyd
+
D.Ontheother hand,weprovethatTheorem5.ClusterSelectionwithdistancedistpisW
[
1]
-hardforeveryp∈ (
1, ∞ )
whenparameterizedbyt+
D.Table 1
Complexityofk-ClusteringandClusterSelection.
distp k-Clustering Cluster Selection
p=0 W[1]-hard param.d+D[Theorem2]
NP-c fork=2 [19] W[1]-hard param.d+t+D[Theorem2]
0<p≤1 2
O(DlogD)(nd)O(1)[Theorem1]
NP-c fork=2 whenp=1 [19]
NP-c ford=2 whenp=1 [20]
2O(DlogD)(nd)O(1)[Theorem15]
W[1]-hard param.t+dforp=1 [Theorem20]
1<p<+∞ FPTparam.d+Dforp=2 [Theorem4]
NP-c fork=2 whenp=2 [18]
NP-c ford=2 whenp=2 [21]
FPTparam.d+Dforp=2 [Theorem4]
W[1]-hard param.t+D[Theorem5]
p= ∞ W[1]-hard param.D[Theorem3]
NP-c fork=2 [Theorem30] W[1]-hard param.t+D[Theorem3]
In particular, Theorem 5 yields that the approach we used to establish the tractability (with parameter D) of k- Clusteringforp
=
1 willnotworkfor p>
1.We summarize our and previously known algorithmic and hardness results for k-Clustering and ClusterSelection withdifferentdistancesinTable 1.Observethat Theorem10worksalsointhesettingwherepossible clustercentersare restricted to be from a set givenin the input, and so doour algorithmic Theorems 1 and4 since ClusterSelection is triviallysolvableinpolynomialtimeinthissetting.
NowwediscussthechoiceoftheparameterD.ItmightbenotedthattheregimewherethecostofclusteringD issmall comparedtothenumberofpointsn,isquitespecial.Indeed,ifthecostofclusteringisatmostD,thentherearebutafew pointsthatarenotequaltotherespectiveclustercenters.Thus,theproblemwestudyhasthespiritofaneditingproblem:
check whether a given instanceis close to a “structured”one, where in our casea “structured” instance hasat mostk distinct points, and closeness is measured via the sumof Lp-distances. Editing problems are extensivelystudied in the parameterizedalgorithmsliterature,rangingfromthevastareaofgraphmodification(seee.g.arecentsurveybyCrespelle et al. [28]) to studies very closeto ours, like theConsensusPatterns algorithm by Marx [26], andthe study ofBinary r-Meansby Fominetal. [24] thatis essentiallya specialcaseofourk-Clusteringproblem. Andstill, eveninthishighly structured regime, our results show a very intricate picture: forinstance, fork-Clustering parameterized just by D,we provideahighlynon-trivialFPTalgorithminthecase0
<
p≤
1.Whileontheother hand,conditionally,thesamescheme could notlead toan analogousalgorithm inthecase p=
2,andthere couldnot beanyFPT algorithmatallinthe cases p=
0 andp= ∞
.Finally webelievethat studyingk-Clusteringwithrespecttotheparameter D isan essentialquestion providedthe notorioushardness oftheproblem.Recall thatforthecombinationofthetwoother naturalparameters, the dimensiondandthenumberofclustersk,onlya O(
ndk+1)
algorithmofInabaetal.isknown [22],andthehardnessresult byCohen-Addadetal.in [23] servesasastrongindicationthatabetteralgorithmmightnotexist.Observethatwealwaysconsiderinteger-valuedinstances.Webelievethisisthemostnaturalmodelforstudyingcom- plexity ofk-Clusteringwithrespectto theparameter D. Hereitis importanttonote thatconsidering D asa parameter only makes sense ifthe input values are suitablydiscretized. Imagineinput vectorscould have arbitraryreal-valued (or rational-valued)entries, thenfora giveninstanceit isalways possibleto scalethe valuesdown bythe samefactorsuch thatthecostofanoptimalclusteringisarbitrarilysmall,butthestructureoftheinstanceiscompletelypreserved.Thusthe restrictiontointegervaluesinourstudyisanaturaldiscretizationoftheproblem.Itallowstheparameter D tobeardeep structuralsignificance,asourresultsdemonstrate.
Theremaining partofthispaperisorganizedasfollows.Section 2containspreliminaries.InSection 3we proveTheo- rem10whichprovidesuswithFPTTuringreductionfromk-ClusteringtoClusterSelection.Theorem10appearstobe a handy tool toestablish tractabilityof k-Clustering. InSection 4we collect theresults onclusteringwith Lp-normfor p
∈ (
0,
1]
.Inparticular,inSubsection4.1,weproveTheorem1,themainalgorithmicresultofthiswork,statingthatwhen p∈ (
0,
1]
,k-ClusteringandClusterSelectionadmit FPTalgorithmswithparameter D.InSubsection4.2wecomplement thealgorithmicupperboundswithlower boundsbyprovingthatClusterSelectionisW[
1]
-hardwhen p=
1 andparam- eterist+
d(Theorem20).In Section5,we considerthe case p=
0 andprove Theorem2establishing W[
1]
-hardness of k-ClusteringandClusterSelection.Section6isdevotedtothecasep= ∞
.Hereweestablishtwohardnessresultsabout k-Clustering:W[
1]
-hardness when parameterizedby D andNP-hardnessin thecasek=
2.In Section 7, welook atthe case p∈ (
1, ∞ )
,withtheparticularemphasison themostcommonlyusedcase p=
2.Weshow that whend+
D is the parameter, thenClusterSelectionandk-Clusteringinthe L2 distanceareFPT.Wealso showthat ClusterSelectionis W[
1]
-hardwhenparameterizedbyt+
D forall p∈ (
1, ∞ )
.WeconcludewithopenproblemsinSection8.2. Preliminariesandnotation
Clusternotation. Bya cluster we always mean a multisetof vectorsin Zd. Fordistance dist, the cost ofa givencluster C is the total distance fromall vectors in the cluster to the optimally selected cluster centroid, minc∈Rd
x∈Cdist
(
x,
c)
. An optimal cluster centroid for a given cluster C is any c∈
Rd minimizingx∈Cdist
(
x,
c)
. For most of the considered distances,wearguethatanoptimalclustercentroidcouldalwaysbechosenamongaspecificfamilyofvectors(e.g.integral).Wheneverweshowthis,weonlyconsideroptimalclustercentroidsofthestatedformafterwards.
Complexity. Aparameterizedproblem isa language Q
⊆
∗×
Nwhere∗ isthe setof strings overa finite alphabet
. Respectively,an input of Q isa pair
(
I,
k)
where I⊆
∗ andk∈
N; k isthe parameteroftheproblem. A parameterized problemQ isfixed-parametertractable(FPT)ifitcanbedecidedwhether(
I,
k) ∈
Q intime f(
k) ·|
I|
O(1)forsomefunction f thatdependsoftheparameterkonly.Respectively,theparameterizedcomplexityclassFPTiscomposedbyfixed-parameter tractable problems.The W-hierarchyisacollectionofcomputationalcomplexityclasses:we omitthetechnicaldefinitions here. Thefollowingrelationisknownamongst theclassesintheW-hierarchy:FPT=
W[
0] ⊆
W[
1] ⊆
W[
2] ⊆ . . . ⊆
W[
P]
.It iswidelybelievedthatFPT=
W[
1]
,andhenceifaproblemishardfortheclassW[
i]
(foranyi≥
1)thenitisconsideredto befixed-parameterintractable.Werefertobooks[29,30] forthedetailedintroductiontoparameterizedcomplexity.WealsoprovideconditionallowerboundsbymakinguseofthefollowingcomplexityhypothesisformulatedbyImpagli- azzo,Paturi,andZane[31].
ExponentialTimeHypothesis(ETH):Thereisapositiverealssuchthat3-CNF-SATwithnvariablesandmclausescannot besolvedintime2sn
(
n+
m)
O(1).Graphs.In ourW
[
1]
-hardness proofs, we heavily employgraph-theoretical notation.Whenever we workwitha graph G, wealwaysfixsomeorderingontheverticesπ
V:
V(
G) → {
1, . . . , |
V(
G) |}
andontheedgesπ
E:
E(
G) → {
1, . . . , |
E(
G) |}
.We dropπ
V andπ
E tosimplifynotation, sowhen weconsider avertex v∈
V(
G)
oranedge e∈
E(
G)
, v ande alsodenote integers—numbersofv andeaccordingtotheorderingsπ
V andπ
E correspondingly.Realcomputations. Since we deal with the problem concerning real-valued matrices, we express the running time of algorithms intermsofnumberofoperationsoverthereals.Thisisnaturalsincetocompute Lp-distanceswehavetodeal withnumbersofformxp wherexisanintegerandpisanyrealnumber.However,inspecialcasestheboundsholdeven for morerestrictive models, e.g. when p
=
1 or p=
2 thealgorithms operate only onintegers of polynomially bounded length.3. Fromk-CLUSTERINGto CLUSTERSELECTION
Inthissection wepresentageneralschemeforobtainingan FPTalgorithmparameterizedby D,whichislaterapplied tovariousdistances.
First,weformalizethefollowingintuition:thereisnoreasontoassignequalvectorstodifferentclusters.
Definition6(Initialclusterandregularpartition).Foramultisetofvectors X,an inclusion-wisemaximalmultisetI
⊂
X such thatallvectorsinI areequaliscalledaninitialcluster.Wesaythataclustering
{
C1, . . . ,
Ck}
ofX isregularifforeveryinitialclusterI thereisai∈ {
1, . . . ,
k}
suchthatI⊂
Ci. Nowweprovethatitsufficestolookonlyforregularsolutions.Proposition7.Let
(
X,
k,
D)
beayes-instanceto k-Clustering.Thenthereexistsasolutionof(
X,
k,
D)
whichisaregularclustering.Proof. Letusassume that theinstance
(
X,
k,
D)
has a solution.There are k clusters{
Ci}
ki=1 andk vectors{
ci}
ki=1 in Rd such that ki=1
x∈Cidist
(
x,
ci) ≤
D. Notethat forevery x∈
Cj, dist(
x,
cj) ≥
min1≤i≤kdist(
x,
ci)
.So ifwe considera new clustering{
C1, . . . ,
Ck}
withthe samecentroids,whereCj areall vectorsfrom X forwhichcj istheclosest centroid,the total distancedoesnot increase.Ifwe alsobreak tiesinfavorofthe lower index, thenforanyinitial cluster I the same centroidciwillbetheclosest,andallvectorsfromI willendupinCi,so{
C1, . . . ,
Ck}
isaregularclustering.Fromnowon,weconsideronlyregularsolutions.
Definition8(Simpleandcompositeclusters).Wesaythatacluster C issimpleifitisaninitialcluster.Otherwise,thecluster iscomposite.
Nextwestateapropertyofk-Clusteringwithaparticulardistance,whichisrequiredforthealgorithm.Intuitively,each uniquevectoraddsatleastsomeconstanttotheclustercost.
Definition9(
α
-property).Wesaythat adistancehastheα
-propertyforsomeα >
0 ifforanysthecostofanycomposite clusterwhichconsistsofsinitialclustersisatleastα (
s−
1)
.In the subsequent sections we show that the
α
-property holds for all the distance measures for which we present algorithms.Namely,theLp-distancehastheα
-propertywithacertainconstantα
,foreach p∈ [
0,
1] ∪ {
2, ∞}
.AnalogouslyFig. 2.AnillustrationofthealgorithminTheorem10.WestartwithaparticularrandomcoloringandaparticularpartitionofcolorsP= {P1,P2},where P1= { , }andP2= { , , }.WemaketwocallstoClusterSelectionwithrespecttoP1andP2andconstructtheresultingclustering.Intheexample, allinputvectorsaredistinct.
tothecase p
=
2,onecanshowthatitholdsforallothervaluesofpbetween1 and∞
aswell,althoughwedonotneed thisfact.The ClusterSelection problem defined inthe introduction is a key subroutine in our algorithm. In some cases the problemissolvabletrivially,butitpresentsthemain challengeforourmainalgorithmic resultwiththe L1 distance.The intuitionto theweightfunctioninthedefinitionofClusterSelectionisthat itrepresentssizes ofinitialclusters,that is, howmanyequalvectorsarethere.
We also need a procedure to enumerate all values of the cost ofeach possible cluster, withrespect to an optimally selected cluster centroid, that are at most D.It may not be straightforward since not all distances in ourconsideration are integer.So forthepurposeofstatingTheorem 10forgeneralmetrics,we assumethat theset ofallpossible optimal clustercostswhicharelessthanD isalsogivenintheinput.FortheLp-distancesweconsider,intherespectivealgorithmic theoremsweshowhowtoprovidethissetwithoutraisinganyadditionalassumptionsorincreasingtherunningtime.Now wearereadytostatetheresultformally.
Theorem10.Assumethatthe
α
-propertyholds,ClusterSelectionissolvableintime(
m,
d,
t,
D)
,whereisanon-decreasing functionofitsarguments,andwearegiventhesetDofallpossibleoptimalclustercostswhichareatmostD.Then k-Clusteringis solvableintime
2O(DlogD)
(
nd)
O(1)|
D|(
n,
d,
2D/ α ,
D).
Proof. Bythe
α
-property,inanysolutionthereareatmostD/ α
compositeclusters,sinceeachcontainsatleasttwoinitial clusters.Moreover,thereareatmost2D/ α
initialclustersinallcompositeclusters.ThusbyProposition7,solvingk-ClusteringisequivalenttoselectingatmostT
:=
2D/ α
initialclustersandgrouping themintocompositeclusterssuchthatthetotalcostoftheseclustersisatmost D.Wedesignanalgorithmwhich,taking asasubroutineanalgorithmforClusterSelection,solvesk-Clustering.ThealgorithmissketchedinFig.3,anexampleis showninFig.2.Toperformtheselection andgrouping,ouralgorithm usesthecolorcodingtechnique ofAlon,Yuster,andZwickfrom [27]. Consider the input as a family of initial clusters I. We color initial clusters from I independently and uniformly at random by T colors 1,2, . . . , T. Consider anysolution, andthe particular set ofat most T initial clusters whichare includedintocompositeclustersinthissolution.Theseinitialclustersarecoloredbydistinctcolorswithprobabilityatleast
T!
TT
≥
e−T.Nowweconstructanalgorithmforfindingacolorfulsolution.Weconsiderallpossiblewaystosplitcolorsbetweenclusters(somecolorsmaybeunused).Henceweconsiderallpos- siblefamilies P
= {
P1, . . . ,
Ph}
ofpairwise disjointnon-emptysubsetsof{
c∈ {
1, . . . ,
T} :
there exists J∈
Icolored byc}
. EachfamilyP correspondstoapartitionofthesetofcolors{
1, . . . ,
T}
ifweaddonefictitioussubsetforcolorswhichare notusedinthecompositeclusters.ThetotalnumberofpartitionsdoesnotexceedTT=
2O(DlogD).k-Clustering(X,k,D,α,D)
Input :AmultisetX⊂Zd,apositiveintegerk,realnonnegativevaluesDandα,asetD,analgorithmAforClusterSelection Output:YesorNo
1 T← 2D/α
2 I←initial clusters ofX 3 foreTiterationsdo
4 FixarandomcoloringcofIwithcolors{1,. . . ,T} 5 forvalidpartitionsPof{1,. . . ,T}do
6 fori=1to|P|do 7 Pi= {i1,. . . ,it} 8 forj=1totdo
9 Xj← ∅
10 for J∈I:c(J)=ijdo
11 x←a point from J
12 Xj←Xj∪ {x}
13 w(x)← |J|
14 di←D+1
15 foreachd∈Ddo
16 ifA( X1,. . . ,Xt,w,d)then
17 di←d
18 BREAK
19 ift
i=1di≤Dthen
20 Yes,STOP
21 No,STOP
Fig. 3.k-Clusteringalgorithm from Theorem10.
Whenpartition P isfixed,weformclustersbysolvinginstances ofClusterSelection: Foreach i
∈ {
1, . . . ,
h}
,wetake initial clusters colored by elements of Pi, bundle together those with the same color, and pass the resulting familyto ClusterSelection.Firstnotethattherecannot be P∈
P ofsizeatmostone,sincethenClusterSelectionhastomakea simpleclusterwhileweassume thatallclustersobtainedfromP arecomposite.Second,thetotalnumberofclustershas to bek,the numberofclustersis|I| −
P∈P
|
P| + |P|
. Foreach P we check thatboth conditionshold, andifnot, we discardthechoiceofP andmovetothenextone,beforecallingtheClusterSelectionsubroutine.Next,weformalizehowwecalltheClusterSelectionsubroutine.Wefixthesetofcolors Pi
= {
c1, . . . ,
ct}
,thentakethe sets Ij= {
J∈
I:
Jis colored bycj}
for j∈ {
1, . . . ,
t}
.Weturneachsetofinitialclusters Ij intoasetofweighted vectors Xj naturally: Foreach J∈
Ij,we putone vector x∈
J into Xj, and w(
x) := |
J|
. Thefamilyof setsofvectors X1,. . . , Xt andtheweightfunctionwaretheinputforClusterSelection.Thenwesearchfortheminimumclustercostbounddi≤
D fromD,forwhichtheinstance(
X1, . . . ,
Xt,
di)
ofClusterSelectionisayes-instance,runningeachtimethealgorithmfor ClusterSelection.Ifforsomei settingdito D leadstoano-instance,orifh
i=1di
>
D,thenwediscardthechoiceofthepartitionPand movetothenextone.Otherwise,wereportthatk-Clusteringhasasolutionandstop.Next,weprovethatinthiscasethe solutionindeedexists.We reconstruct the solutionto k-Clustering asfollows: For each i
∈ {
1, . . . ,
h}
the corresponding to Pi= {
c1, . . . ,
ct}
instanceof ClusterSelectionhasasolution{
x1, . . . ,
xt}
.Foreach j∈ {
1, . . . ,
t}
, considerthecorresponding initialcluster Jj consistingof w(
xj)
vectorsequaltoxj.Foreach i∈ {
1, . . . ,
h}
we obtainacompositecluster∪
tj=1Jj,allother clusters are simple. So the total cost is hi=1di, which isat most D. Thus, ifthe algorithm finds a solution, then
(
X,
d,
D)
isa yes-instance.Intheopposite direction.Ifthereisasolutiontok-Clustering,thenthereisaregularsolution,andwithprobabilityat leaste−T initialclusterswhicharepartsofcompositeclustersinthissolutionarecoloredbydistinctcolors.Then,thereisa partitionP
= {
P1, . . . ,
Ph}
whichcorrespondstothissolution.Thispartitionisobtainedasfollows:putinto P1 colorsfrom thefirstcompositecluster,intoP2 fromthesecondandsoon.Atsome pointouralgorithmchecksthepartitionP,andas itfindstheoptimalcostvalueforeachcluster,thenitisatmostthecostofthecorrespondingclusterofthesolutionfrom whichwestarted.Toanalyzetherunningtime,weconsider 2O(DlogD) partitionsP,foreachP we
|P| =
O(
D)
timessearchforoptimal di.And foreachof|D|
possiblevalues1 ofdi we makeone callto theClusterSelection algorithm,whichtakestime at most(
n,
d,
T,
D)
.Toamplifythe errorprobability tobe atmost1
/
e,we doN=
eTiterationsofthe algorithm,eachtime withanew randomcoloring.Aseachiterationsucceedswithprobabilityatleaste−T,theprobabilityofnotfindingacolorful solution afterN iterationsisatmost(
1−
e−T)
eT≤
e−1<
1.Sothetotalrunningtimeis2O(DlogD)· (
nd)
O(1)|D|(
n,
d,
2D/ α ,
D)
.1 Wecouldalsobinarysearchforthe optimaldi∈D instead,thusreplacing|D|bylog|D|intherunningtime.However,forallchoicesofD we considerthisdoesnotmakeadifference.
0 2 4 6 8 10 5
10 15 20
z
cost
0 2 4 6 8 10
4 6 8 10
z
(a) cost(z)= |z−2| + |z−3| + |z−6| + |z−8| (b) cost(z)= |z−2|1/2+ |z−3|1/2+ |z−6|1/2+ |z−8|1/2
Fig. 4.Graphsofclustercostoverdifferentvaluesofz:dist1intheleftplot,dist1/2intherightplot.Thesetofcoordinatevaluesisgivenasy1=2,y2=3, y3=6,y4=8.
The algorithmcould be derandomizedbythe standardderandomization technique usingperfecthash families[27,32].
Sok-Clusteringissolvableinthesamedeterministictime.
4. Algorithmsandcomplexityfordistanceswithp∈(0,1]
The mainmotivationfortheresultsinthissection isthestudy ofk-Clusteringwiththe L1 distance,the casewidely knownask-Medians.However,ourmainalgorithmicresultalsoextendstodistancesoforderp
∈ (
0,
1)
sinceinsomesense theybehavesimilarlytotheL1 distance.4.1. FPTalgorithmwhenparameterizedbyD
Inthissubsection,weproveTheorem1:when p
∈ (
0,
1]
,k-ClusteringadmitsanFPTalgorithmwithparameter D.First we state basicgeometricalobservations forcases p=
1 and p∈ (
0,
1)
. Thenwepropose a generalalgorithmforCluster Selectionwhichreliesonlyontheseproperties.Finally,weshowhowTheorem10couldbeapplied.The next two claims deal with the structure of optimal cluster centroids. We state and prove them in the case of weighted vectorswhereeach vector hasa positiveinteger weightgivenby aweight function w.The unweighted caseis justaspecialcasewhentheweightofeachvectorisone.
First, we show that coordinates of cluster centroids could always be selected amongthe values presentin theinput, whichhelpsgreatlyinenumeratingclustercentroidsthatmaybeoptimal.
Claim11.Assumep
∈ (
0,
1]
,letC= {
x1, . . . ,
xt}
beaclusterandw: {
x1, · · · ,
xt} →
Z+beaweightfunction.Thereisanoptimal (subjecttotheweighteddistancew(
xi) ·
distp(
xi,
c)
)centroidc ofC suchthatforeachi∈ {
1, . . . ,
d}
,thei-thcoordinatec[
i]
ofthe centroidisfromthevaluespresentintheinputinthiscoordinate,thatisc[
i] ∈ {
x1[
i], . . . ,
xt[
i]}
.Moreover,forp=
1wemayassume thattheoptimalvalueisaweightedmedianofthevaluespresentinthei-thcoordinate.Proof. Forcluster C,considerthecorrespondingmultisetofunweightedvectorsC
= {
x1, . . . ,
xt}
,whereeachvector x∈
C isrepeated w(
x)
times.Wedefine yj=
xj[
i]
for j∈ {
1, . . . ,
t}
.Assumethat y1≤
y2≤ · · · ≤
yt.Letusconsideran optimal clustercentroidc forC anddenotez=
c[
i]
.Fig.4showshowtheclustercostbehaves withrespecttoz onaconcreteset ofvalues{
yi}
forp=
1 and p=
1/
2.Fortheformalproof,westartwiththecasep
=
1.ThetotalcostofC contributedbythei-thecoordinateis|
y1−
z| + |
y2−
z| + · · · + |
yt−
z| .
Ifz
∈ (
yi,
yi+1)
fori∈ {
1, . . . ,
t−
1}
,thenthederivativewithrespecttoz is((
z−
y1) + · · · + (
z−
yi) + (
yi+1−
z) + · · · + (
yt−
z))
=
i− (
t−
i).
Analogously, when z