THE K-MEANS ALGORITHM 15 The Cost Function

We define the cost function for GHAS to be J(θ, U) = whered(x_i,θ_j)is a dissimilarity measure between data vector x_i and cluster C_j parameterized by its associated parameter "vector" θ_j. Note that the parameter "vector" might contain a combination of scalars, vectors and ma-trices; for instance for a multivariate normal distribution it might contain the expectation vector and covariance matrix of the distribution. It might also contain so-called cluster representatives; one or more data vectors which in some way are representative for the cluster.

Ideally, a cost function gives a large value when we make incorrect as-signments. In unsupervised learning the correct labels are unknown and in practice our choice of cost function will define what kind of clusters we are looking for.

The Iterative Algorithm

Let us fix the parameter vectors, θ_j, j = 1, . . . , k. Since for each x_i only one uij is non-zero, it is easy to see that we minimize the cost function in Eq. (2.3) by assigning x_i to the closest cluster defined by the specified dissimilarity measure. Thus

u_ip =

1, ifd(x_i,θ_p) = min_j=1,...,kd(x_i,θ_j)

0, otherwise (2.4)

Now, having found a cluster assignment, we can use the points assigned to each cluster to estimate its associated parameter vector θ_j by keeping the membership functions constant. We do this by minimizing the cost function with respect to each of the parameter vectors. Taking the derivative with respect to a particular parameter vector θ_p gives

∂ equating this to0 gives

which can be further simplified by using the expression for u_ip from (2.1):

xi∈Cp

∂d(x_i,θ_p)

∂θ_p =0 p= 1, ..., k (2.6) The iteration procedure for GHAS can then be summarized:

1. Initialize the parameter vectors.

2. Assign data vectors to the nearest cluster.

3. Update the estimate of the parameter vector of each cluster based on the data assigned to it.

4. Repeat Steps 2 and 3 until some convergence criterion is met.

2.2.2 The K-Means Algorithm as a Special Case of GHAS

The k-means algorithm is a hard algorithmic clustering method and can be seen as a special case of Generalized Hard Algorithm Scheme (GHAS).

Note that we will see later in Section 2.3.3 that k-means also can be seen as a special case of the Expectation-Maximization algorithm. Because of its intuitive approach and simplicity, finding special conditions under which an algorithm is equivalent to k-means can provide valuable insight into an otherwise complicated clustering algorithm.

Assumptions

As a special case of GHAS, k-means use a hard membership function as de-fined in Eq. (2.1) and Eq. (2.2). It also characterizes a cluster by a single point representative, thus the parameter vectorsθ_jis a data vector (assumed) belonging to clusterC_j. This also gives us the option of choosing the dissim-ilarity measure between point and cluster to be a distance measure between two points.

The k-means algorithm can be derived from the cost function in Eq. (2.3) if the dissimilarity measure is the squared Euclidean distance between the data vectorx_i and cluster representativeθ_j:

d(x_i,θ_j) = ||x_i−θ_j||²

p=1

[x_i(p)−θ_j(p)]²

2.2. THE K-MEANS ALGORITHM 17 where the|| · ||denotes the Euclidean norm. Some write the Euclidean norm as || · ||₂ to indicate the use of a L₂-norm, but here the subscript will be dropped to simplifying the notation. Hence || · || =|| · ||₂ and the subscript will be reserved for cases where we use other norms. In vector notation we can write this (dropping the subscripts on xand θ) as

||x−θ||² = (x−θ)^T(x−θ) (2.7) Inserting this into the cost function in Eq. (2.3) gives

J(θ, U) =

Before we proceed to minimize this cost function, we will look at the initial-ization of the k-means algorithm.

Initialization

The name of the algorithm comes from the fact that we seek a solution in-volvingkclusters. This is a weakness of k-means and several other clustering algorithms as we might not know in advance the number of clusters we are seeking. However, due to its low computational complexity, it is possible to run the algorithm several times for different values of k and evaluate the result.

One way of achieving this is to optimize the Aikaike Information Criterion (AIC) or the Bayesian Information Criterion (BIC). An algorithm that uses ACI or BIC and does not require k as an input parameter presented in [29]

as X-means. This will provide a solution which seeks to optimize the cost function AND the number of clusters according to the criterion defined.

In general these strategies seeks to compensate for the fact that the overall cost decreases as the number of clusters increases. In the extreme case, when the number of clusters k is equal to the number of data pointsn, each point will be in its own one point cluster and the cost defined in Eq. (2.8) will be zero. We will not focus on this here and simply assume that we know the number of clustersk.

Having chosen a k we need to initialize the k cluster representatives θ_j. This is usually done by randomly drawing (without replacement) k of the data vectors. As k-means might converge to a local minima, it might be beneficial to re-run the algorithm with a different initialization.

The Algorithm

We now fix the (initial) cluster representatives and wish to minimize the cost function in Eq. (2.8) with respect to the cluster assignments u_ij. As mentioned for the GHAS, it is easy to see that this is done by assigning each data vector to its closest cluster representative. Then Eq. (2.4) can be written as

uip=

1, if||xi−θp||² = minj=1,...,k||xi−θj||²

0, otherwise (2.9)

We then fix the cluster assignments and optimize with respect to the cluster representatives. For a given cluster representative we do this by taking the derivative with respect to θ_p and equating this to zero:

∂J(θ, U)

By denoting the number of data vectors in cluster C_p as n_p and using that n_p =P

xi∈Cp1, we get the final expression for updating the cluster represen-tatives

We recognize this as the mean of the data vectors assigned to cluster C_p. This is done for eachk cluster representatives, and hence the name k-means.

Here we started out with an assumption of a hard membership function and k clusters that we wanted to represent (parametrize) by a single point

2.2. THE K-MEANS ALGORITHM 19

In document Mean shift spectral clustering using kernel entropy component analysis (sider 21-25)