3-3 K-means Clustering

[chinese][english]

Slides

K-means clustering (k-means for short), also known as Forgy's algorithm, is one of the most well-known methods for data clustering. The goal of k-means is to find k points of a dataset that can best represent the dataset in a certain mathematical sense (to be detailed later). These k points are also known as cluster centers, prototypes, centroids, or codewords, and so on. After obtaining these cluster centers, we can use them for numerous tasks, including:

在所有的分割式分群法之中,最基本的方法,就是所謂的 K-means 分群法(k-means clustering),又稱為Forgy's algorithm [6]。其主要目標是要在大量高維的資料點中找出具有代表性的資料點,這些資料點可以稱為是群中心(cluster centers)、代表點(prototypes)、codewords等,然後在根據這些群中心,進行後續的處理,這些處理可以包含:

K-means is also a method of partitional clustering in which we need to specify the number of clusters before starting the clustering process. Suppose that the number of clusters is m, then we can define an objective function as the sum of square distances between a data point and its nearest cluster centers. We can follow a procedure to minimize the objective function iteratively by finding a new set of cluster centers that can lower the value of the objective function at each iteration.

在使用分割式分群法(partitional clustering)時,我們必須先指定群聚的數目,然後藉著反覆疊代運算,逐次降低一個誤差目標函數的值,直到目標函數不再變化,就達到分群的最後結果。

More specifically, suppose that we have a data set to be divided into m clusters. The data points in cluster p are represented by a set Gp with a cluster center cp. Then the square error of cluster p can be defined as

分割式分群法的目的是希望盡量減小每個群聚中,每一點與群中心的距離平方誤差(square error)。假設我們現在有一組包含c個群聚的資料,其中第 p 個群聚可以用集合 Gp 來表示,假設 Gp 包含np筆資料 {x1, x2, …, xnp),此群聚中心為cp,則該群聚的平方誤差 ep可以定義為: $$ e_j=\sum_{x_i \in G_j} \| x_i-c_j\|^2. $$

We can then define the objective function as the square error of all the m clusters:

而這c個群聚的總和平方誤差E便是每個群聚的平方誤差總和,可稱為分群的「誤差函數」(error function)或「失真度」(distortion):

E = Sp=1m ep

The above objective function is also referred to as the distortion when using these cluster centers to represent the whole dataset. It is obvious that the objective function depends on the cluster centers. Therefore we need to find a systematic method for identifying these clusters and their corresponding centers for minimizing E. In other words, the problem of clustering can be formulated as a problem of optimization, which requires more rigorous notation, as explained next.

我們分群的方法,就變成是一個最佳化的問題,換句話說,我們要如何選取 c 個群聚以及相關的群中心,使得 E 的值為最小。

Suppose that the dataset X is composed of n vectors, X = {x1, ..., xn}, where each vector is of length d. The goal of k-means is to find a way of dividing the original dataset into k clusters with cluster centers C = {c1, ..., ck} such that the following objective function is minimized:

我們也可以使用另一套數學來描述,給定一組 n 點資料 X = {x1, ..., xn},每一點都有 d 維,k-means 分群法的任務是找到一組 m 個代表點 C = {c1, ..., cm},每一點也是d維,以使下列的目標函數越小越好:

J(X; C, A) = Sxi in Gp|xi-cp|2,
where xi belongs to Gp and cp is the centers for Gp. In the above objective function, A is an mxn membership matrix (or partition matrix) with values of 0 or 1. A(i, j) = 1 if and only if data point j belongs to cluster i. It should be noted that there should be only a single 1 in each column of A since each data point only belongs to a cluster. Moreover, it should be clear that A appears in the right-hand side of the the above equation in an implicit manner.

其中 xi屬於 Gp 而且 cp 是 Gp 的代表點。同時在上述目標函數中,A 是一個 mxn 的分組矩陣,每一個元素值都是 0 或 1,而且每一直行的總和等於 1,如果 A(i, j) = 1,代表第j個資料點是屬於第i組。(請注意: A 是以隱含的方式出現在上述方程式的等號右方。)

It would be rather difficult to optimize the above objective function directly since it involves the optimizaiton of d*m (for matrix C) plus m*n (for matrix A) tunalbe parameters with certain constraints on matrix A. However, we can observe two facts:

若要直接對上述目標函數進行最小化,是一件蠻困難的事,因為這個目標函數有 m*n+m*d個可變參數,同時又有n個限制條件,很難直接採用傳統的方式來進行最佳化,但我們可以觀察到兩個現象:

  1. When C (cluster centers) is fixed, we can easily identify the corresponding optimizing A such that the objective function J(X; C, A) is minimized. The optimizing A is equivalent to the clustering strategy that each data point belongs to the cluster whose centers is the nearest center to this data point.
  2. When A (membership matrix) is fixed, we can simply find the optimizing centers such that the objective function J(X; C, A) is minimized. Since our objective function is the square error, the minimizing center for each cluster is the mean vector of all the data points in the cluster.
  1. 當C(群中心)固定時,我們可以很容易地找到對應的A(資料分群方式),使J(X; C, A)的值為最小。此最佳的A值,即代表最佳的分群方式,因此若將每個資料點歸類到距離最近的代表點,即可完成此目標。
  2. 當A(資料分群方式)固定實,我們也可以很快地找到對應的C(群中心),使J(X; C, A)的值為最小。此最佳的C集合,代表每個代表點至其組員的距離平方和為最小,因此此最佳的C集合即是每一群的平均點。
According to the above observations, we can describe k-means clustering algorithm as follows:
  1. Randomly select k data points as the initial centers for k clusters. These centers form the columns of C.
  2. Generate the best membership matrix A based on the given C. In other words, assign a data point x to the cluster whose center is the nearest center to x.
  3. Compute the objective function J(X; C, A). Stop the iteration if the objective function does not improve much from the previous iteration.
  4. Generate the best C based on the given A.
  5. Repeat step 2 ~ 4 until a maximum number of iterations is reached.
在演算法開始進行前,我們必須事先決定好預期分群的群聚數目。假設預期的分群群聚數目為c,則根據上述觀察,我們可以經由下列步驟來進行 k-means 分群法:
  1. 隨機選取 m 個資料點,將之分別視為 m 個群聚的群中心,這就是C。
  2. 由固定的C,產生最佳的A。換句話說,對每一個資料點x,尋找與之最接近的群中心,並將x加入該群聚。
  3. 計算目標函數 J(X; C, A),如果保持不變,代表分群結果已經穩定不變,所以可以結束此疊代方法。
  4. 再由固定的U,產生最佳的C。跳回第2個步驟。

In the above algorithm, we set the initial centers C first and obtain A for carrying out the subsequent iterations. It is also possible to set the initial clusters A first and obtain C for carrying out subsequent iterations. The result will be similar.

在上述方法中,我們是先找群中心,再開始反覆疊代的過程。事實上,我們也可以先進行任意分群,然後再進行反覆疊代的過程,得到的結果應該很類似。

At each iteration, we can prove that the value of the objective function J(X; C, A) is monotonically non-increasing based on the previous mentioned two observations. Moreover, if the centers stay the same in two consecutive iteration, then the objective funciton reaches a plateau and never changes afterwards.

這是一個反覆的過程,但在整個迴圈的過程中,我們可以保證J(X; C, A)絕對不會遞增(可用數學證明,請見上述前述兩點觀察),只會遞減或維持不變,一旦維持不變,我們就可以跳出迴圈、結束演算法。 The following example demonstrates the use of k-means on a two-dimensional dataset: 下面這個範例,顯示k-means分群法對於一組二維資料的分群效果:

Example 1: kMeans01.m% ====== Get the data set DS = dcData(5); subplot(2,2,1); plot(DS.input(1,:), DS.input(2,:), '.'); axis tight, axis image % ====== Run kmeans centerNum=4; [center, U, distortion, allCenters] = kMeansClustering(DS.input, centerNum); % ====== Plot the result subplot(2,2,2); vqDataPlot(DS.input, center); subplot(2,1,2); plot(distortion, 'o-'); xlabel('No. of iterations'); ylabel('Distortion'); grid on

The upper left plot in the above example shows the scatter plot of the dataset. The upper right plot shows the clustering results. The lower plot demonstrates how the objective function (distortion) decreases with each iteration. Since the initial centers are selected randomly, you are likely to get slightly different results for each run of the example. 在上圖中,左圖是在迭代過程中,目標函數遞減的圖形,右圖則是最後的分群結果。在上述範例中,由於啟始中心點是由亂數產生,所以每次執行程式碼,所得到的圖形都會不太一樣。

The next example uses the same dataset for k-means and plots the trajectories of centers during the process of clustering:

下面這個範例,我們使用同一組二維資料來進行 k-means,並顯示在迭代過程中,中心點的移動過程:

Example 2: kMeans02.m% ====== Get the data set DS = dcData(5); centerNum=4; % ====== Get initial cluster centers initCenter=vqCenterInit(DS.input, centerNum); subplot(2,2,1); vqDataPlot(DS.input, initCenter); % ====== Run k-means to get the final cluster centers [center, U, distortion, allCenters] = kMeansClustering(DS.input, initCenter); subplot(2,2,2); vqDataPlot(DS.input, center); % ====== Plot the distortion subplot(2,2,3); plot(1:length(distortion), distortion, 'o-'); xlabel('No. of iterations'); ylabel('Distortion'); grid on; axis tight % ====== Plot the moving paths of the centers subplot(2,2,4); for i=1:centerNum xData=allCenters(1,i,:); xData=xData(:); yData=allCenters(2,i,:); yData=yData(:); line(xData, yData, 'color', getColor(i), 'lineWidth', 3); end box on; axis image

In the above example: 在上圖中:

In the above example, we can clearly identify 4 clusters by visual inspection. If we set the number of clusters to be 4 for running k-means, the result is often satisfactory. However, if there is no way for visual inspection (say, when the data dimensions are more than 3), then we need to use methods in cluster validation (see the exercises) for identify the optimum number of clusters.

在上述資料中,我們可以很明顯地看出,資料原先就可以概分為四群,如果我們也設定群數等於4,然後開始跑k-means,所得到的結果通常是正確的,但是如何決定適當的群數,這是尚待解決的另一個問題,此問題通稱為cluster validation,請見後續章節的說明。

The following example shows the use of k-means on another 2D dataset of a donut shape:

下面這個範例,顯示k-means分群法對於另一組二維資料(甜甜圈)的分群效果:

Example 3: kMeans03.m% ====== Get the data set DS = dcData(2); centerNum=8; % ====== Get initial cluster centers initCenter=vqCenterInit(DS.input, centerNum); clf; subplot(2,2,1); vqDataPlot(DS.input, initCenter); % ====== Run k-means to get the final cluster centers [center, U, distortion, allCenters] = kMeansClustering(DS.input, initCenter); subplot(2,2,2); vqDataPlot(DS.input, center); % ====== Plot the distortion subplot(2,2,3); plot(1:length(distortion), distortion, 'o-'); xlabel('No. of iterations'); ylabel('Distortion'); grid on; axis tight % ====== Plot the moving paths of the centers subplot(2,2,4); for i=1:centerNum xData=allCenters(1,i,:); xData=xData(:); yData=allCenters(2,i,:); yData=yData(:); line(xData, yData, 'color', getColor(i), 'lineWidth', 3); end box on; axis image

There are some other facts of k-means that we should be aware of:

有關於k-means的方法,還需要注意幾點:

雖然我們可以證明此方法一定收斂,但是此方法只能找到目標函數的局部最小值(local minimum),我們並無法證明此方法所找到的群聚及群中心會使目標函數到達全域最小值(global minimum)。而事實上到目前為止,也沒有出現任何一種方法,能保證所找到的分群結果可使目標函數達到全域最小值。

The k-means method we have discussed so far is also referred to as the batch k-means algorithm. Another similar method for on-line clustering is called the sequential/online k-means algorithm. The sequential version updates the centers whenever a data point x is available, as follows:

  1. Randomly select m data points as the initial centers for m clusters.
  2. Find the cluster center that is nearest to the incoming data point x. Add the data point x to the cluster and update the cluster center as the mean vector of all the data points in this cluster.
  3. Check if the nearest center of a data point is the center this data point belongs to. Repeat to check all the data points. If the answer is yes for all the data points, stop the iteration. Otherwise go back to step 2.

上述討論的方法,通常又稱為 batch k-means algorithm,另一個類似的方法稱為 sequential k-means algorithm 或是 on-line k-mean algorithm,則是每當收集到一筆資料時,就可以更新群中心,方法如下:

  1. 隨機選取c的起始點,將之分別視為c個群聚的群中心
  2. 對每一個資料點x,尋找與之最接近的群中心,並將x加入該群聚,隨即計算新的群聚中心(該群聚中原有的資料點加上x後的平均向量)
  3. 檢查每一個資料點目前與之最接近的群聚中心是否和他的群聚分配一致,如果不是,則回到步驟二,反覆疊代,直到收斂。

In general, we have the following two situations for using the sequential k-mean algorithm:

In practice, we seldom use the sequential k-mean algorithm unless for the two reasons stated above.

一般而言, sequential k-mean algorithm的優點如下:

除非有特殊情況,否則我們很少使用 sequential k-mean algorithm。
Data Clustering and Pattern Recognition (資料分群與樣式辨認)