[chinese][english] Slides
Hierarchical clustering tries to combine or divide a dataset into clusters iteratively, such that a tree-like hierarchical structure is created, primarily for data visualization. To construct such a hierarchical tree structure, we can adopt two approaches:
We shall focus on the agglomerative (bottom-up) hierarchical clustering in this section.
- Agglomerative hierarchical clustering: This is a bottom-up approach. We can start from the tree leaves (or data instances) and combine two nearest clusters into one at each iteration.
- Divisive hierarchical clustering: This is a top-down approach. We can start from the root of the tree (which contains all data instances) and select a cluster to split at each iteration.
¶¥¼h¦¡¤À¸sªk¡]hierarchical clustering¡^³z¹L¤@ºØ¶¥¼h¬[ºcªº¤è¦¡¡A±N¸ê®Æ¼h¼h¤ÏÂЦa¶i¦æ¤Àµõ©Î»E¦X¡A¥H²£¥Í³Ì«áªº¾ðª¬µ²ºc¡A±`¨£ªº¤è¦¡¦³¨âºØ¡G
¥»¸`±N°w¹ï»E¦X¦¡¡]¥Ñ¤U¦Ó¤W¡^¶¥¼h¤À¸sªk¨Ó¶i¦æ»¡©ú¡C
- ¦pªG±Ä¥Î»E¦Xªº¤è¦¡¡A¶¥¼h¦¡¤À¸sªk¥i¥Ñ¾ðª¬µ²ºcªº©³³¡¶}©l¡A±N¸ê®Æ©Î¸s»E³v¦¸¦X¨Ö
- ¦pªG±Ä¥Î¤Àµõªº¤è¦¡¡A«h¥Ñ¾ðª¬µ²ºcªº³»ºÝ¶}©l¡A±N¸s»E³v¦¸¤Àµõ¡C
For agglomerative hierarchical clustering, we start with the leaves of the tree and move up. In other words, if we have a dataset of size n, then we have n clusters at the beginning, with each data point being a cluster. Then we merge clusters iteratively to move up the tree structure. The algorithm is described next.
The following example demonstrates the dendrogram after agglomerative hierarchical clustering.
- At the beginning, each data point is a cluster denoted by $C_i, i=1, 2, \dots, n$.
- Find the nearest two clusters $C_i$ and $C_j$ within all clusters.
- Combine $C_i$ and $C_j$ into a new cluster.
- If the number of clusters is equal to the desired one, stop. Otherwise go back to step 2 to create more clusters.
»E¦X¦¡¶¥¼h¤À¸sªk¡]agglomerative hierarchical clustering¡^¥Ñ¾ðª¬µ²ºcªº©³³¡¶}©l¼h¼h»E¦X¡C¤@¶}©l§Ú̱N¨C¤@µ§¸ê®Æµø¬°¤@Ó¸s»E¡]cluster¡^¡A°²³]§Ú̲{¦b¾Ö¦³nµ§¸ê®Æ¡A«h±N³onµ§¸ê®Æµø¬°nÓ¸s»E¡A¥ç§Y¨CÓ¸s»E¥]§t¤@µ§¸ê®Æ¡G
¤U¦C½d¨Ò®i¥Ü»E¦X¦¡¶¥¼h¤À¸sªk©Ò²£¥Íªº¾ðª¬¹Ï¡]dendrogram¡^¡C
- ±N¨Cµ§¸ê®Æµø¬°¤@Ó¸s»E Ci, i=1 1 to n.
- §ä¥X©Ò¦³¸s»E¶¡¡A¶ZÂ÷³Ì±µªñªº¨âÓ¸s»E Ci¡BCj
- ¦X¨Ö Ci¡B Cj ¦¨¬°¤@Ó·sªº¸s»E
- °²¦p¥Ø«eªº¸s»E¼Æ¥Ø¦h©ó§Ú̹w´Áªº¸s»E¼Æ¥Ø¡A«h¤ÏÂЫ½Æ¨BÆJ¤G¦Ü¥|¡Aª½¨ì¸s»E¼Æ¥Ø¤w±N°¨ì§ÚÌ©Òn¨Dªº¼Æ¥Ø¡C
In order to make the algorithm more concrete, We need to define what is meant by "the nearest two clusters". In fact there are several distance functions to compute the distance between two clusters. Different cluster distance functions result in different tree structures. Commonly used cluster distance functions are listed next.
- Single-linkage agglomerative algorithm: The distance between two clusters is the the shortest distance between members of these two clusters: $$d(C_i, C_j)=\min_{\mathbf{a}\in C_i, \mathbf{b}\in C_j} d(\mathbf{a}, \mathbf{b})$$
- Complete-linkage agglomerative algorithm: The distance between two clusters is the longest distance between members of these two clusters: $$d(C_i, C_j)=\max_{\mathbf{a}\in C_i, \mathbf{b}\in C_j} d(\mathbf{a}, \mathbf{b})$$
- Average-linkage agglomerative algorithm: The distance between two clusters is the average distnace between members of these two clusters: $$d(C_i, C_j)=\sum_{\mathbf{a}\in C_i, \mathbf{b}\in C_j} \frac{d(\mathbf{a}, \mathbf{b})}{|C_i||C_j|},$$ where $|C_i|$ and $|C_j|$ are the sizes for $C_i$ and $C_j$, respectively.
- Ward's method: The distance between two clusters is defined as the sum of the squared distance to the mean of the combined clusters: $$d(C_i, C_j)=\sum_{\mathbf{a}\in C_i \cup C_j} \|\mathbf{a}-\mathbf{\mu}\|,$$ where $\mathbf{\mu}$ is the mean vector of $C_i \cup C_j$.
¦b¨BÆJ¤G¤¤¡A¦ó¿×¡u¶ZÂ÷³Ì±µªñªº¨âÓ¸s»E Ci¡BCj¡v©O¡H¨Æ¹ê¤W¦b©w¸q¨âÓ¸s»E¤§¶¡ªº¶ZÂ÷¡A¦³¦UºØ¤£¦Pªº¤è¦¡¡A¨C¤@ºØ¤è¦¡©Ò±o¨ìªºµ²ªG³£¤£¤Ó¬Û¦P¡C³o¨Ç±`¥Îªº¸s»E¶ZÂ÷ªº©w¸q¥i¥H»¡©ú¦p¤U¡C
- ³æ¤@³sµ²»E¦Xºtºâªk¡]single-linkage agglomerative algorithm¡^¡G¸s»E»P¸s»E¶¡ªº¶ZÂ÷¥i¥H©w¸q¬°¤£¦P¸s»E¤¤³Ì±µªñ¨âÂI¶¡ªº¶ZÂ÷¡G $$d(C_i, C_j)=\min_{\mathbf{a}\in C_i, \mathbf{b}\in C_j} d(\mathbf{a}, \mathbf{b})$$
- §¹¾ã³sµ²»E¦Xºtºâªk¡]complete-linkage agglomerative algorithm¡^¡G¸s»E¶¡ªº¶ZÂ÷©w¸q¬°¤£¦P¸s»E¤¤³Ì»·¨âÂI¶¡ªº¶ZÂ÷¡G $$d(C_i, C_j)=\max_{\mathbf{a}\in C_i, \mathbf{b}\in C_j} d(\mathbf{a}, \mathbf{b})$$
- ¥§¡³sµ²»E¦Xºtºâªk¡]average-linkage agglomerative algorithm¡^¡G¸s»E¶¡ªº¶ZÂ÷«h©w¸q¬°¤£¦P¸s»E¶¡¦UÂI»P¦UÂI¶¡¶ZÂ÷Á`©Mªº¥§¡¡]¨ä¤¤¡A ªí¥Ü ªº¸ê®ÆӼơC¡^¡G $$d(C_i, C_j)=\sum_{\mathbf{a}\in C_i, \mathbf{b}\in C_j} \frac{d(\mathbf{a}, \mathbf{b})}{|C_i||C_j|},$$ where $|C_i|$ and $|C_j|$ are the sizes for $C_i$ and $C_j$, respectively.
- ¨U¼wªk¡]Ward's method¡^¡G¸s»E¶¡ªº¶ZÂ÷©w¸q¬°¦b±N¨â¸s¦X¨Ö«á¡A¦UÂI¨ì¦X¨Ö«áªº¸s¤¤¤ßªº¶ZÂ÷¥¤è©M¡]¨ä¤¤¡Am ªí¥Ü Ci¡åCj ªº¥§¡È¡^ $$d(C_i, C_j)=\sum_{\mathbf{a}\in C_i \cup C_j} \|\mathbf{a}-\mathbf{\mu}\|,$$ where $\mathbf{\mu}$ is the mean vector of $C_i \cup C_j$.
We apply different cluster distance functions to obtain the corresponding tree structures, as follows:
§ÚÌ¥i¥H¨Ï¥Î¤£¦Pªº¸s»E¶ZÂ÷¨Ó²£¥Í¶¥¼h¦¡¸s»E¾ð¡A©Ò±o¨ìªºµ²ªG¦p¤U¡G
From the resultant tree structures, we can observe the following trends:
- Single linkage tends to make the tree slide to one side, with bigger clusters stay big and smaller clusters stay small.
- Complete linkage tends to generate balanced trees, with all clusters growing big simultaneously.
¥Ñ¤Wz¸s»E¾ð¡A§ÚÌ¥i¥Hµoı¤U¦C¯S©Ê¡G
If you want to see the animation of the clustering process, try the next example:
- single linkage ·|¦b¸s»Eªº¹Lµ{¤¤²£¥Í¡u¤jªÌ«í¤j¡vªº®ÄªG¡C
- ¦Ó complete linkage ©M average linkage ¤ñ¸û®e©ö²£¥Í¡u»ôÀY¨Ã¶i¡vªº®ÄªG¡C
YnÆ[¬Ý¦b¸s»E¹Lµ{¤¤ªº°ÊºA®i¥Ü¡A½Ð¨£¤U¤@Ó½d¨Ò¡]¦¹½d¨Ò¥H single linkage ¬°¸s»E¶ZÂ÷¡^¡G
It can be proved that the resultant connections via single linkage over a 2D dataset is actually the minimum spanning tree of the dataset.
¨Æ¹ê¤W§ÚÌ¥i¥HÃÒ©ú¡A¦pªG¸ê®Æ¶°¬O¤Gºû¥±¤WªºÂI©Ò¦¨ªº¶°¦X¡A¦Ó¥B§Ú̱ĥΠsingle linkage¡A«h©Ò±o¨ìªº³sµ²¹Ï§Y¬O³o¨ÇÂIªº minimum spanning tree¡C
In general, the advantages of hierarchical clustering can be summarized as follows:
The major disadvantage is that hierarchical clustering can only handle a small amount of data.
- The concept is quite strqightforward, and the result can be represented by a tree structure.
- You only need to have the pairwise distance to start clustering. That is, you don't need to represent objects in a cluster as points in a high-dimensional space.
¾ãÅé¨Ó»¡¡A¶¥¼h¦¡¤À¸sªkªºÀuÂI¦p¤U¡G
¦ý¬O¶¥¼h¦¡¤À¸sªk¤]¦³¯ÊÂI¡A¥¦³q±`¥u¾A¥Î©ó¤Ö¶q¸ê®Æ¡A«ÜÃø³B²z¤j¶q¸ê®Æ¡C
- ·§©À²³æ¡A¥i¥Î¾ðª¬µ²ºc¨Óªí²{¾ãÓpºâ¹Lµ{¡C
- ¥u»Ýn¸ê®ÆÂI¨â¨â¤§¶¡ªº¶ZÂ÷¡A´N¥i¥H«Øºc¤À¸sµ²ªG¡A¦Ó¤£»Ýn¸ê®ÆÂIªº¹ê»Ú®y¼Ð¡C¡]¦]¦¹¨C¤@Ó¸ê®ÆÂI¥i¥H¥Ü¤@Óª«¥ó¡A¦Ó¤£¥²¬OªÅ¶¡¤¤ªº¤@ÂI¡C¡^
Data Clustering and Pattern Recognition (¸ê®Æ¤À¸s»P¼Ë¦¡¿ë»{)