3-2 Hierarchical Clustering (hsk)

[chinese][english]

Slides

Hierarchical clustering tries to combine or divide a dataset into clusters iteratively, such that a tree-like hierarchical structure is created, primarily for data visualization. To construct such a hierarchical tree structure, we can adopt two approaches:

We shall focus on the agglomerative (bottom-up) hierarchical clustering in this section.

階層式分群法(hierarchical clustering)透過一種階層架構的方式,將資料層層反覆地進行分裂或聚合,以產生最後的樹狀結構,常見的方式有兩種:

本節將針對聚合式(由下而上)階層分群法來進行說明。

For agglomerative hierarchical clustering, we start with the leaves of the tree and move up. In other words, if we have a dataset of size n, then we have n clusters at the beginning, with each data point being a cluster. Then we merge clusters iteratively to move up the tree structure. The algorithm is described next.

  1. At the beginning, each data point is a cluster denoted by $C_i, i=1, 2, \dots, n$.
  2. Find the nearest two clusters $C_i$ and $C_j$ within all clusters.
  3. Combine $C_i$ and $C_j$ into a new cluster.
  4. If the number of clusters is equal to the desired one, stop. Otherwise go back to step 2 to create more clusters.
The following example demonstrates the dendrogram after agglomerative hierarchical clustering.

聚合式階層分群法(agglomerative hierarchical clustering)由樹狀結構的底部開始層層聚合。一開始我們將每一筆資料視為一個群聚(cluster),假設我們現在擁有n筆資料,則將這n筆資料視為n個群聚,亦即每個群聚包含一筆資料:

  1. 將每筆資料視為一個群聚 Ci, i=1 1 to n.
  2. 找出所有群聚間,距離最接近的兩個群聚 Ci、Cj
  3. 合併 Ci、 Cj 成為一個新的群聚
  4. 假如目前的群聚數目多於我們預期的群聚數目,則反覆重複步驟二至四,直到群聚數目已將降到我們所要求的數目。
下列範例展示聚合式階層分群法所產生的樹狀圖(dendrogram)。

Example 1: hierClusteringPlot01.mdata=rand(2, 50); % 50 data instances of dim 2 distMat=distPairwise(data); % Distance matrix of 50 by 50 hcOutput=hierClustering(distMat); hierClusteringPlot(hcOutput); % Plot the dendrogram

In order to make the algorithm more concrete, We need to define what is meant by "the nearest two clusters". In fact there are several distance functions to compute the distance between two clusters. Different cluster distance functions result in different tree structures. Commonly used cluster distance functions are listed next.

在步驟二中,何謂「距離最接近的兩個群聚 Ci、Cj」呢?事實上在定義兩個群聚之間的距離,有各種不同的方式,每一種方式所得到的結果都不太相同。這些常用的群聚距離的定義可以說明如下。

We apply different cluster distance functions to obtain the corresponding tree structures, as follows:

我們可以使用不同的群聚距離來產生階層式群聚樹,所得到的結果如下:

Example 2: hierClusteringPlot02.mdata=rand(2, 50); % 50 data instances of dim 2 distMat=distPairwise(data); % Distance matrix of 50 by 50 method='single'; hcOutput=hierClustering(distMat, method); subplot(1,2,1); hierClusteringPlot(hcOutput); title(['method=', method]); method='complete'; hcOutput=hierClustering(distMat, method); subplot(1,2,2); hierClusteringPlot(hcOutput); title(['method=', method]);

From the resultant tree structures, we can observe the following trends:

由上述群聚樹,我們可以發覺下列特性:

If you want to see the animation of the clustering process, try the next example:

若要觀看在群聚過程中的動態展示,請見下一個範例(此範例以 single linkage 為群聚距離):

Example 3: hierClusteringAnim01.mdata=dcData(6); data=data.input; dataNum=size(data,2); distMat=distPairwise(data, data); distMat(1:dataNum+1:dataNum^2)=inf; % Diagonal elements should always be inf. method='single'; % 'single' or 'complete' level=hierClustering(distMat, method); hierClusteringAnim(data, distMat, level);

It can be proved that the resultant connections via single linkage over a 2D dataset is actually the minimum spanning tree of the dataset.

事實上我們可以證明,如果資料集是二維平面上的點所成的集合,而且我們採用 single linkage,則所得到的連結圖即是這些點的 minimum spanning tree。

In general, the advantages of hierarchical clustering can be summarized as follows:

The major disadvantage is that hierarchical clustering can only handle a small amount of data.

整體來說,階層式分群法的優點如下:

但是階層式分群法也有缺點,它通常只適用於少量資料,很難處理大量資料。
Data Clustering and Pattern Recognition (資料分群與樣式辨認)