3-2 Hierarchical Clustering (?庡堡寮忓?缇ゆ?)

[chinese][english]

Slides

Hierarchical clustering tries to combine or divide a dataset into clusters iteratively, such that a tree-like hierarchical structure is created, primarily for data visualization. To construct such a hierarchical tree structure, we can adopt two approaches:

We shall focus on the agglomerative (bottom-up) hierarchical clustering in this section.

顶糷Αだ竤猭hierarchical clustering硓筁贺顶糷琜篶よΑ盢戈糷糷は滦秈︽だ吊┪籈玻ネ程攫挡篶盽ǎよΑΤㄢ贺

セ竊盢皐癸籈Αパτ顶糷だ竤猭ㄓ秈︽弧

For agglomerative hierarchical clustering, we start with the leaves of the tree and move up. In other words, if we have a dataset of size n, then we have n clusters at the beginning, with each data point being a cluster. Then we merge clusters iteratively to move up the tree structure. The algorithm is described next.

  1. At the beginning, each data point is a cluster denoted by $C_i, i=1, 2, \dots, n$.
  2. Find the nearest two clusters $C_i$ and $C_j$ within all clusters.
  3. Combine $C_i$ and $C_j$ into a new cluster.
  4. If the number of clusters is equal to the desired one, stop. Otherwise go back to step 2 to create more clusters.
The following example demonstrates the dendrogram after agglomerative hierarchical clustering.

籈Α顶糷だ竤猭agglomerative hierarchical clusteringパ攫挡篶┏场秨﹍糷糷籈秨﹍и盢–掸戈跌竤籈cluster安砞и瞷局Τn掸戈玥盢硂n掸戈跌n竤籈ョ–竤籈掸戈

  1. 盢–掸戈跌竤籈 Ci, i=1 1 to n.
  2. т┮Τ竤籈丁禯瞒程钡ㄢ竤籈 CiCj
  3. ㄖ Ci Cj Θ穝竤籈
  4. 安ヘ玡竤籈计ヘи箇戳竤籈计ヘ玥は滦狡˙艼竤籈计ヘ盢и┮璶―计ヘ
絛ㄒ甶ボ籈Α顶糷だ竤猭┮玻ネ攫瓜dendrogram

Example 1: hierClusteringPlot01.mdata=rand(2, 50); % 50 data instances of dim 2 distMat=distPairwise(data); % Distance matrix of 50 by 50 hcOutput=hierClustering(distMat); hierClusteringPlot(hcOutput); % Plot the dendrogram

In order to make the algorithm more concrete, We need to define what is meant by "the nearest two clusters". In fact there are several distance functions to compute the distance between two clusters. Different cluster distance functions result in different tree structures. Commonly used cluster distance functions are listed next.

˙艼い孔禯瞒程钡ㄢ竤籈 CiCj㎡ㄆ龟﹚竡ㄢ竤籈ぇ丁禯瞒Τ贺ぃよΑ–贺よΑ┮眔挡狦常ぃび硂ㄇ盽ノ竤籈禯瞒﹚竡弧

We apply different cluster distance functions to obtain the corresponding tree structures, as follows:

иㄏノぃ竤籈禯瞒ㄓ玻ネ顶糷Α竤籈攫┮眔挡狦

Example 2: hierClusteringPlot02.mdata=rand(2, 50); % 50 data instances of dim 2 distMat=distPairwise(data); % Distance matrix of 50 by 50 method='single'; hcOutput=hierClustering(distMat, method); subplot(1,2,1); hierClusteringPlot(hcOutput); title(['method=', method]); method='complete'; hcOutput=hierClustering(distMat, method); subplot(1,2,2); hierClusteringPlot(hcOutput); title(['method=', method]);

From the resultant tree structures, we can observe the following trends:

パ瓃竤籈攫и祇谋疭┦

If you want to see the animation of the clustering process, try the next example:

璝璶芠竤籈筁祘い笆篈甶ボ叫ǎ絛ㄒ絛ㄒ single linkage 竤籈禯瞒

Example 3: hierClusteringAnim01.mdata=dcData(6); data=data.input; dataNum=size(data,2); distMat=distPairwise(data, data); distMat(1:dataNum+1:dataNum^2)=inf; % Diagonal elements should always be inf. method='single'; % 'single' or 'complete' level=hierClustering(distMat, method); hierClusteringAnim(data, distMat, level);

It can be proved that the resultant connections via single linkage over a 2D dataset is actually the minimum spanning tree of the dataset.

ㄆ龟и靡狦戈栋琌蝴キ翴┮Θ栋τи蹦ノ single linkage玥┮眔硈挡瓜琌硂ㄇ翴 minimum spanning tree

In general, the advantages of hierarchical clustering can be summarized as follows:

The major disadvantage is that hierarchical clustering can only handle a small amount of data.

俱砰ㄓ弧顶糷Αだ竤猭纔翴

琌顶糷Αだ竤猭Τ翴ウ硄盽続ノぶ秖戈螟矪瞶秖戈
Data Clustering and Pattern Recognition (戈だ竤籔妓Α侩粄)