3-2 Hierarchical Clustering (嚙踝蕭嚙踝蕭嚙踝蕭嚙踝蕭嚙踝蕭嚙踝蕭嚙踝蕭嚙踝蕭嚙踝蕭)

[chinese][english]
Slides
Hierarchical clustering tries to combine or divide a dataset into clusters iteratively, such that a tree-like hierarchical structure is created, primarily for data visualization. To construct such a hierarchical tree structure, we can adopt two approaches:

Agglomerative hierarchical clustering: This is a bottom-up approach. We can start from the tree leaves (or data instances) and combine two nearest clusters into one at each iteration.
Divisive hierarchical clustering: This is a top-down approach. We can start from the root of the tree (which contains all data instances) and select a cluster to split at each iteration.
We shall focus on the agglomerative (bottom-up) hierarchical clustering in this section.
階層式分群法（hierarchical clustering）透過一種階層架構的方式，將資料層層反覆地進行分裂或聚合，以產生最後的樹狀結構，常見的方式有兩種：

如果採用聚合的方式，階層式分群法可由樹狀結構的底部開始，將資料或群聚逐次合併
如果採用分裂的方式，則由樹狀結構的頂端開始，將群聚逐次分裂。
本節將針對聚合式（由下而上）階層分群法來進行說明。
For agglomerative hierarchical clustering, we start with the leaves of the tree and move up. In other words, if we have a dataset of size n, then we have n clusters at the beginning, with each data point being a cluster. Then we merge clusters iteratively to move up the tree structure. The algorithm is described next.

At the beginning, each data point is a cluster denoted by $C_i, i=1, 2, \dots, n$.
Find the nearest two clusters $C_i$ and $C_j$ within all clusters.
Combine $C_i$ and $C_j$ into a new cluster.
If the number of clusters is equal to the desired one, stop. Otherwise go back to step 2 to create more clusters.
The following example demonstrates the dendrogram after agglomerative hierarchical clustering.
聚合式階層分群法（agglomerative hierarchical clustering）由樹狀結構的底部開始層層聚合。一開始我們將每一筆資料視為一個群聚（cluster），假設我們現在擁有n筆資料，則將這n筆資料視為n個群聚，亦即每個群聚包含一筆資料：

將每筆資料視為一個群聚 C_i, i=1 1 to n.
找出所有群聚間，距離最接近的兩個群聚 C_i、C_j
合併 C_i、 C_j 成為一個新的群聚
假如目前的群聚數目多於我們預期的群聚數目，則反覆重複步驟二至四，直到群聚數目已將降到我們所要求的數目。
下列範例展示聚合式階層分群法所產生的樹狀圖（dendrogram）。
Example 1: hierClusteringPlot01.m

In order to make the algorithm more concrete, We need to define what is meant by "the nearest two clusters". In fact there are several distance functions to compute the distance between two clusters. Different cluster distance functions result in different tree structures. Commonly used cluster distance functions are listed next.

Single-linkage agglomerative algorithm: The distance between two clusters is the the shortest distance between members of these two clusters: $$d(C_i, C_j)=\min_{\mathbf{a}\in C_i, \mathbf{b}\in C_j} d(\mathbf{a}, \mathbf{b})$$
Complete-linkage agglomerative algorithm: The distance between two clusters is the longest distance between members of these two clusters: $$d(C_i, C_j)=\max_{\mathbf{a}\in C_i, \mathbf{b}\in C_j} d(\mathbf{a}, \mathbf{b})$$
Average-linkage agglomerative algorithm: The distance between two clusters is the average distnace between members of these two clusters: $$d(C_i, C_j)=\sum_{\mathbf{a}\in C_i, \mathbf{b}\in C_j} \frac{d(\mathbf{a}, \mathbf{b})}{|C_i||C_j|},$$ where $|C_i|$ and $|C_j|$ are the sizes for $C_i$ and $C_j$, respectively.
Ward's method: The distance between two clusters is defined as the sum of the squared distance to the mean of the combined clusters: $$d(C_i, C_j)=\sum_{\mathbf{a}\in C_i \cup C_j} \|\mathbf{a}-\mathbf{\mu}\|,$$ where $\mathbf{\mu}$ is the mean vector of $C_i \cup C_j$.

在步驟二中，何謂「距離最接近的兩個群聚 C_i、C_j」呢？事實上在定義兩個群聚之間的距離，有各種不同的方式，每一種方式所得到的結果都不太相同。這些常用的群聚距離的定義可以說明如下。

單一連結聚合演算法（single-linkage agglomerative algorithm）：群聚與群聚間的距離可以定義為不同群聚中最接近兩點間的距離： $$d(C_i, C_j)=\min_{\mathbf{a}\in C_i, \mathbf{b}\in C_j} d(\mathbf{a}, \mathbf{b})$$
完整連結聚合演算法（complete-linkage agglomerative algorithm）：群聚間的距離定義為不同群聚中最遠兩點間的距離： $$d(C_i, C_j)=\max_{\mathbf{a}\in C_i, \mathbf{b}\in C_j} d(\mathbf{a}, \mathbf{b})$$
平均連結聚合演算法（average-linkage agglomerative algorithm）：群聚間的距離則定義為不同群聚間各點與各點間距離總和的平均（其中，表示的資料個數。）： $$d(C_i, C_j)=\sum_{\mathbf{a}\in C_i, \mathbf{b}\in C_j} \frac{d(\mathbf{a}, \mathbf{b})}{|C_i||C_j|},$$ where $|C_i|$ and $|C_j|$ are the sizes for $C_i$ and $C_j$, respectively.
沃德法（Ward's method）：群聚間的距離定義為在將兩群合併後，各點到合併後的群中心的距離平方和（其中，m 表示 C_i∪C_j 的平均值） $$d(C_i, C_j)=\sum_{\mathbf{a}\in C_i \cup C_j} \|\mathbf{a}-\mathbf{\mu}\|,$$ where $\mathbf{\mu}$ is the mean vector of $C_i \cup C_j$.

We apply different cluster distance functions to obtain the corresponding tree structures, as follows:
我們可以使用不同的群聚距離來產生階層式群聚樹，所得到的結果如下：
Example 2: hierClusteringPlot02.m

From the resultant tree structures, we can observe the following trends:

Single linkage tends to make the tree slide to one side, with bigger clusters stay big and smaller clusters stay small.
Complete linkage tends to generate balanced trees, with all clusters growing big simultaneously.

由上述群聚樹，我們可以發覺下列特性：

single linkage 會在群聚的過程中產生「大者恆大」的效果。
而 complete linkage 和 average linkage 比較容易產生「齊頭並進」的效果。
If you want to see the animation of the clustering process, try the next example:
若要觀看在群聚過程中的動態展示，請見下一個範例（此範例以 single linkage 為群聚距離）：
Example 3: hierClusteringAnim01.m

It can be proved that the resultant connections via single linkage over a 2D dataset is actually the minimum spanning tree of the dataset.
事實上我們可以證明，如果資料集是二維平面上的點所成的集合，而且我們採用 single linkage，則所得到的連結圖即是這些點的 minimum spanning tree。
In general, the advantages of hierarchical clustering can be summarized as follows:

The concept is quite strqightforward, and the result can be represented by a tree structure.
You only need to have the pairwise distance to start clustering. That is, you don't need to represent objects in a cluster as points in a high-dimensional space.
The major disadvantage is that hierarchical clustering can only handle a small amount of data.
整體來說，階層式分群法的優點如下：

概念簡單，可用樹狀結構來表現整個計算過程。
只需要資料點兩兩之間的距離，就可以建構分群結果，而不需要資料點的實際座標。（因此每一個資料點可以示一個物件，而不必是空間中的一點。）
但是階層式分群法也有缺點，它通常只適用於少量資料，很難處理大量資料。
Data Clustering and Pattern Recognition (資料分群與樣式辨認)