[chinese][all] The objective of data clustering is to identify clusters within the given dataset, such that similar data instances are likely to be within the same cluster. The original dataset is thus decomposed into disjoint (or fuzzy) clusters, with each cluster having a center to represent the cluster. We can use the cluster ceters (also known as centroids or prototypes) to represent the original dataset to acheve the following goals:
- Data visualization
- Data compression
- Noise supression
- Computation reduction
In general, clustering algorithms can be divided into two types:
- Hierarchical clustering: For agglomerative hierarchical clustering, the number of clusters is increased from 1 until the desired number of clusters is reached. On the other hand, for divisive hierarchical clustering, the number of clusters is decreased from the size of the dataset until the desired number of clusters is reached.
- Partitional clustering: The number of clusters is fixed in advance. And then a number of iterations are performed to identify the best clusters with their cluster centers.
Each data clustering task has similiar procedures:
- Collect dataset.
- Apply a certain clustering algorithm to get clustering results.
- Test the clustering results.
- If the test passes, stops. Otherwise go back to step 2 to repeat the clustering process.
Vector quantization (VQ) is a specific field of data clustering which emphasizes on algorithmic aspect of minimizing a distortion measure for a large amount of data. VQ is commonly used for image and audio data, with a goal similar to partitional clustering but a process similar to hierarchical clustering.
Data Clustering and Pattern Recognition (資料分群與樣式辨認)