3-3 K-means Clustering

[chinese][english]
Slides
K-means clustering (k-means for short), also known as Forgy's algorithm, is one of the most well-known methods for data clustering. The goal of k-means is to find k points of a dataset that can best represent the dataset in a certain mathematical sense (to be detailed later). These k points are also known as cluster centers, prototypes, centroids, or codewords, and so on. After obtaining these cluster centers, we can use them for numerous tasks, including:
在所有的分割式分群法之中，最基本的方法，就是所謂的 K-means 分群法（k-means clustering），又稱為Forgy's algorithm [6]。其主要目標是要在大量高維的資料點中找出具有代表性的資料點，這些資料點可以稱為是群中心（cluster centers）、代表點（prototypes）、codewords等，然後在根據這些群中心，進行後續的處理，這些處理可以包含：

Data compression: We can use these cluster centers to represent the original dataset. Since the number of centers is much less than the size of the original dataset, the goal of data compression can be achieved.
Data classification: We can use these cluster centers for data classification such that the computation load is lessened and the influence from noisy data is reduced.

資料壓縮：以少數的資料點來代表大量的資料，達到資料壓縮的功能。
資料分類：以少數代表點來代表特定類別的資料，可以降低資料量及計算量，並可以避免雜訊的不良影響。

K-means is also a method of partitional clustering in which we need to specify the number of clusters before starting the clustering process. Suppose that the number of clusters is m, then we can define an objective function as the sum of square distances between a data point and its nearest cluster centers. We can follow a procedure to minimize the objective function iteratively by finding a new set of cluster centers that can lower the value of the objective function at each iteration.
在使用分割式分群法（partitional clustering）時，我們必須先指定群聚的數目，然後藉著反覆疊代運算，逐次降低一個誤差目標函數的值，直到目標函數不再變化，就達到分群的最後結果。
More specifically, suppose that we have a data set to be divided into m clusters. The data points in cluster p are represented by a set G_p with a cluster center c_p. Then the square error of cluster p can be defined as
分割式分群法的目的是希望盡量減小每個群聚中，每一點與群中心的距離平方誤差（square error）。假設我們現在有一組包含c個群聚的資料，其中第 p 個群聚可以用集合 G_p 來表示，假設 G_p 包含n_p筆資料 {x₁, x₂, …, x_np），此群聚中心為c_p，則該群聚的平方誤差 e_p可以定義為： $$ e_j=\sum_{x_i \in G_j} \| x_i-c_j\|^2. $$
We can then define the objective function as the square error of all the m clusters:
而這c個群聚的總和平方誤差E便是每個群聚的平方誤差總和，可稱為分群的「誤差函數」（error function）或「失真度」（distortion）：
E = S_p=1^m e_p
The above objective function is also referred to as the distortion when using these cluster centers to represent the whole dataset. It is obvious that the objective function depends on the cluster centers. Therefore we need to find a systematic method for identifying these clusters and their corresponding centers for minimizing E. In other words, the problem of clustering can be formulated as a problem of optimization, which requires more rigorous notation, as explained next.
我們分群的方法，就變成是一個最佳化的問題，換句話說，我們要如何選取 c 個群聚以及相關的群中心，使得 E 的值為最小。
Suppose that the dataset X is composed of n vectors, X = {x₁, ..., x_n}, where each vector is of length d. The goal of k-means is to find a way of dividing the original dataset into k clusters with cluster centers C = {c₁, ..., c_k} such that the following objective function is minimized:
我們也可以使用另一套數學來描述，給定一組 n 點資料 X = {x₁, ..., x_n}，每一點都有 d 維，k-means 分群法的任務是找到一組 m 個代表點 C = {c₁, ..., c_m}，每一點也是d維，以使下列的目標函數越小越好：
J(X; C, A) = S_{x_i in G_p}|x_i-c_p|², where x_i belongs to G_p and c_p is the centers for G_p. In the above objective function, A is an mxn membership matrix (or partition matrix) with values of 0 or 1. A(i, j) = 1 if and only if data point j belongs to cluster i. It should be noted that there should be only a single 1 in each column of A since each data point only belongs to a cluster. Moreover, it should be clear that A appears in the right-hand side of the the above equation in an implicit manner.
其中 x_i屬於 G_p 而且 c_p 是 G_p 的代表點。同時在上述目標函數中，A 是一個 mxn 的分組矩陣，每一個元素值都是 0 或 1，而且每一直行的總和等於 1，如果 A(i, j) = 1，代表第j個資料點是屬於第i組。（請注意： A 是以隱含的方式出現在上述方程式的等號右方。）
It would be rather difficult to optimize the above objective function directly since it involves the optimizaiton of d*m (for matrix C) plus m*n (for matrix A) tunalbe parameters with certain constraints on matrix A. However, we can observe two facts:
若要直接對上述目標函數進行最小化，是一件蠻困難的事，因為這個目標函數有 m*n+m*d個可變參數，同時又有n個限制條件，很難直接採用傳統的方式來進行最佳化，但我們可以觀察到兩個現象：

When C (cluster centers) is fixed, we can easily identify the corresponding optimizing A such that the objective function J(X; C, A) is minimized. The optimizing A is equivalent to the clustering strategy that each data point belongs to the cluster whose centers is the nearest center to this data point.
When A (membership matrix) is fixed, we can simply find the optimizing centers such that the objective function J(X; C, A) is minimized. Since our objective function is the square error, the minimizing center for each cluster is the mean vector of all the data points in the cluster.

當C（群中心）固定時，我們可以很容易地找到對應的A（資料分群方式），使J(X; C, A)的值為最小。此最佳的A值，即代表最佳的分群方式，因此若將每個資料點歸類到距離最近的代表點，即可完成此目標。
當A（資料分群方式）固定實，我們也可以很快地找到對應的C（群中心），使J(X; C, A)的值為最小。此最佳的C集合，代表每個代表點至其組員的距離平方和為最小，因此此最佳的C集合即是每一群的平均點。
According to the above observations, we can describe k-means clustering algorithm as follows:

Randomly select k data points as the initial centers for k clusters. These centers form the columns of C.
Generate the best membership matrix A based on the given C. In other words, assign a data point x to the cluster whose center is the nearest center to x.
Compute the objective function J(X; C, A). Stop the iteration if the objective function does not improve much from the previous iteration.
Generate the best C based on the given A.
Repeat step 2 ~ 4 until a maximum number of iterations is reached.
在演算法開始進行前，我們必須事先決定好預期分群的群聚數目。假設預期的分群群聚數目為c，則根據上述觀察，我們可以經由下列步驟來進行 k-means 分群法：

隨機選取 m 個資料點，將之分別視為 m 個群聚的群中心，這就是C。
由固定的C，產生最佳的A。換句話說，對每一個資料點x，尋找與之最接近的群中心，並將x加入該群聚。
計算目標函數 J(X; C, A)，如果保持不變，代表分群結果已經穩定不變，所以可以結束此疊代方法。
再由固定的U，產生最佳的C。跳回第2個步驟。

In the above algorithm, we set the initial centers C first and obtain A for carrying out the subsequent iterations. It is also possible to set the initial clusters A first and obtain C for carrying out subsequent iterations. The result will be similar.
在上述方法中，我們是先找群中心，再開始反覆疊代的過程。事實上，我們也可以先進行任意分群，然後再進行反覆疊代的過程，得到的結果應該很類似。
At each iteration, we can prove that the value of the objective function J(X; C, A) is monotonically non-increasing based on the previous mentioned two observations. Moreover, if the centers stay the same in two consecutive iteration, then the objective funciton reaches a plateau and never changes afterwards.
這是一個反覆的過程，但在整個迴圈的過程中，我們可以保證J(X; C, A)絕對不會遞增（可用數學證明，請見上述前述兩點觀察），只會遞減或維持不變，一旦維持不變，我們就可以跳出迴圈、結束演算法。 The following example demonstrates the use of k-means on a two-dimensional dataset: 下面這個範例，顯示k-means分群法對於一組二維資料的分群效果：
Example 1: kMeans01.m

The upper left plot in the above example shows the scatter plot of the dataset. The upper right plot shows the clustering results. The lower plot demonstrates how the objective function (distortion) decreases with each iteration. Since the initial centers are selected randomly, you are likely to get slightly different results for each run of the example. 在上圖中，左圖是在迭代過程中，目標函數遞減的圖形，右圖則是最後的分群結果。在上述範例中，由於啟始中心點是由亂數產生，所以每次執行程式碼，所得到的圖形都會不太一樣。
The next example uses the same dataset for k-means and plots the trajectories of centers during the process of clustering:
下面這個範例，我們使用同一組二維資料來進行 k-means，並顯示在迭代過程中，中心點的移動過程：
Example 2: kMeans02.m

In the above example:

The upper-left plot is the inital centers and the corresponding clusters.
The upper-right plot is the final centers and the corresponding clusters.
The lower-left plot is the distortion with respect to the number of iterations.
The lower-right plot is the trajectories of the centers during the clustering process.
在上圖中：

左上圖是啟始的群中心和分群的結果。
右上圖是最後的群中心和分群的結果。
左下圖是目標函數J(X; C, A)隨著疊代次數而遞減的情況。
右下圖是群中心雖著疊代次數而移動的路徑。

In the above example, we can clearly identify 4 clusters by visual inspection. If we set the number of clusters to be 4 for running k-means, the result is often satisfactory. However, if there is no way for visual inspection (say, when the data dimensions are more than 3), then we need to use methods in cluster validation (see the exercises) for identify the optimum number of clusters.
在上述資料中，我們可以很明顯地看出，資料原先就可以概分為四群，如果我們也設定群數等於4，然後開始跑k-means，所得到的結果通常是正確的，但是如何決定適當的群數，這是尚待解決的另一個問題，此問題通稱為cluster validation，請見後續章節的說明。
The following example shows the use of k-means on another 2D dataset of a donut shape:
下面這個範例，顯示k-means分群法對於另一組二維資料（甜甜圈）的分群效果：
Example 3: kMeans03.m

There are some other facts of k-means that we should be aware of:

The iteration in k-means can only guarantee the non-increasingness of the objective function J(X; C, A). However, it cannot guarantee the finding of the global minimum of the objective function. In fact, there is no efficient methods that can guarantee the finding of the global minimum of the objective function. Hence it is adviceable to run k-means multiple times starting from different initial centers and then keep the best result.
A better set of initial centers will have positive influence on the final clustering results. Therefore it is an important issue to be able to rapidly select some good initial centers heuristically. Some commonly used methods for initial center selection include:

Randomly select m data points from the dataset.
Select the farthest m data points from the mean of the dataset.
Select the nearest m data points from the mean of the dataset.
Select m data points that have the largest sum of pairwise square distance.
Intuitively, the last method can usually selet a better set of initial centers at the cost of more computation. But again, there is no guarantee as which method can always generate the better result.
During the iteration of k-means, it could happen that a cluster is formed with no elements at all. This is more likely to happen for dataset of higher dimensions. Once we encounter such a situation, the most straightforward method is to remove the empty cluster and then split another cluster into two such that the total number of clusters is the same. As the selection criteria for cluster to be split, we have at least two:

Select the cluster that contributes most to the objective function.
Select the cluster with the maximal number of data points.

有關於k-means的方法，還需要注意幾點：

這個方法只能保證J(X; C, A)越來越小，但無法保證能夠找到J(X; C, A)的絕對最小值，因此若時間許可，應該由不同的啟始群中心，多跑幾次，然後再選取最好的結果。
啟始群中心對於最後的分群結果有很大的影響，因此如何以很快的方法找到比較好的群中心，就變成重要的研究課題。以上述方法來說，若要選取 c 個起始中心典，常用的選取方法有下列幾種：

從資料裡面隨意選出 m 點資料
找出離所有資料平均值最遠的 m 個資料點
找出離所有資料平均值最近的 m 個資料點
找出距離平方和最小的 m 個資料點
但是這些找出起始群中心點的方法也都無法保證達到更好的目標函數值。如果計算時間不是太久，建議在選取啟始群中心時，多試用不同的方法，找出多組啟始群中心，同時進行 k-means，最後再以目標函數最小的結果來作為最後分群的結果。
另一個可能發生的問題，就是在進行 k-means 的過程中，有可能出現一個空的群聚，此群聚是空集合，沒有資料點。碰到這種情況，一個簡單的解決方式，就是直接將另一個群聚拆解為兩個，這樣才能保持群聚數目的不變。至於要找那一個群聚來切成兩個，可已有下列標準：

找資料點最多的群聚
找貢獻於目標函數最大的群聚

我們前述的方法是先找群中心，然後進行反覆疊代。我們也可以先分群，然後再反覆疊代，也可以得到類似的效果。
雖然我們可以證明此方法一定收斂，但是此方法只能找到目標函數的局部最小值（local minimum），我們並無法證明此方法所找到的群聚及群中心會使目標函數到達全域最小值（global minimum）。而事實上到目前為止，也沒有出現任何一種方法，能保證所找到的分群結果可使目標函數達到全域最小值。
The k-means method we have discussed so far is also referred to as the batch k-means algorithm. Another similar method for on-line clustering is called the sequential/online k-means algorithm. The sequential version updates the centers whenever a data point x is available, as follows:

Randomly select m data points as the initial centers for m clusters.
Find the cluster center that is nearest to the incoming data point x. Add the data point x to the cluster and update the cluster center as the mean vector of all the data points in this cluster.
Check if the nearest center of a data point is the center this data point belongs to. Repeat to check all the data points. If the answer is yes for all the data points, stop the iteration. Otherwise go back to step 2.

上述討論的方法，通常又稱為 batch k-means algorithm，另一個類似的方法稱為 sequential k-means algorithm 或是 on-line k-mean algorithm，則是每當收集到一筆資料時，就可以更新群中心，方法如下：

隨機選取c的起始點，將之分別視為c個群聚的群中心
對每一個資料點x，尋找與之最接近的群中心，並將x加入該群聚，隨即計算新的群聚中心（該群聚中原有的資料點加上x後的平均向量）
檢查每一個資料點目前與之最接近的群聚中心是否和他的群聚分配一致，如果不是，則回到步驟二，反覆疊代，直到收斂。

In general, we have the following two situations for using the sequential k-mean algorithm:

Suitable for time varying system where the characteristics of the dataset is varying with time.
Suitable for low-end platform with less computing power.
In practice, we seldom use the sequential k-mean algorithm unless for the two reasons stated above.
一般而言， sequential k-mean algorithm的優點如下：

適用於資料特性隨時間而變的情況。
計算簡單，適用於硬體實現。
除非有特殊情況，否則我們很少使用 sequential k-mean algorithm。
Data Clustering and Pattern Recognition (資料分群與樣式辨認)