[chinese][all] Slides
In the subsequent chapters of this book, we shall use several datasets to demonstrate the concepts of data clustering and pattern recognition. For pattern recognition, the following 3 dataset are commonly used:
In fact, these datasets are available from UCI Machine Learning Repository at
- Iris dataset
- Wine dataset
- Abalone dataset
http://www.ics.uci.edu/~mlearn/MLRepository.html The website is maintained by Donald Bren School of Information and Computer Science at University of California at Irvine. There are numerous dataset in this website together with their documentations, so researchers on pattern recognition and machine learning can use these datasets for evaluation and comparisons of their approaches.In order to facility the access of the dataset via MATLAB, we usually use a structure variable DS (dataset) for storing all the information of a dataset, as explained next:
Also we have a function prData.m for returning these three datasets. For instance, if you want to read the commonly used IRIS dataset, please follow next example:
- DS: the structure variable for storing all information of a dataset
- DS.input: the input part (also known as features) of the dataset
- DS.output: the output part (also known as desired classes or ground truth) of the dataset. Each entry of this vector is an index into a class denoted by DS.outputName.
- DS.dataName: a string representing the name of this dataset
- DS.inputName: a cell string that represents the name of the inputs.
- DS.outputName: a cell string that represents the name of the output classes. Note that each entry in DS.output is actually an index into this cell string. As a result, the range of DS.output should be between 1 and length(DS.outputName) inclusively.
From the above example, we know that:
Based on the dataset DS, we have a set of functions for visualizing the dataset along different perspectives:
- DS.dataName is 'iris', representing the name of this dataset.
- DS.inputName is a cell string of length 4, representing the name of the 4 input features.
- DS.input is a matrix of 4x150, indicating the dimensions and number of the feature vector is 4 and 150, respectively.
- DS.output is a vector of 1x150, representing the output classes for these 150 feature vectors. Possible values of DS.output are 1, 2, and 3, indicating the index into the 3 classes whose names are denoted in DS.outputName.
- DS.outputName is a cell string of length 3, representing the name of the classes in this dataset.
For the usage of these functions, please refer to the subsequent sections of this chapter.
- classClassSize(DS): Compute the size of each class
- dsProjPlot1(DS): Plot the classes w.r.t. the projected 1D features
- dsProjPlot2(DS): Plot the classes w.r.t. the projected 2D features
- dsProjPlot3(DS): Plot the classes w.r.t. the projected 3D features
- dsFormatCheck(DS): Check the format of the dataset
- dsNameAdd(DS): Add names to the inputs and outputs of a given dataset
- dsRangePlot(DS): Plot the range of each input of a given dataset
- dsDistPlot(DS): Plot the distributions of inputs over different classes of a given dataset
- dsScatterPlot(DS): Scatter plot of a dataset in a 2D space
- dsScatterPlot3(DS): Scatter plot of a dataset in a 3D space
Data Clustering and Pattern Recognition (資料分群與樣式辨認)