2-1 Intro. to Datasets

[chinese][all]
Slides
In the subsequent chapters of this book, we shall use several datasets to demonstrate the concepts of data clustering and pattern recognition. For pattern recognition, the following 3 dataset are commonly used:

Iris dataset
Wine dataset
Abalone dataset
In fact, these datasets are available from UCI Machine Learning Repository at http://www.ics.uci.edu/~mlearn/MLRepository.html The website is maintained by Donald Bren School of Information and Computer Science at University of California at Irvine. There are numerous dataset in this website together with their documentations, so researchers on pattern recognition and machine learning can use these datasets for evaluation and comparisons of their approaches.
In order to facility the access of the dataset via MATLAB, we usually use a structure variable DS (dataset) for storing all the information of a dataset, as explained next:

DS: the structure variable for storing all information of a dataset
DS.input: the input part (also known as features) of the dataset
DS.output: the output part (also known as desired classes or ground truth) of the dataset. Each entry of this vector is an index into a class denoted by DS.outputName.
DS.dataName: a string representing the name of this dataset
DS.inputName: a cell string that represents the name of the inputs.
DS.outputName: a cell string that represents the name of the output classes. Note that each entry in DS.output is actually an index into this cell string. As a result, the range of DS.output should be between 1 and length(DS.outputName) inclusively.
Also we have a function prData.m for returning these three datasets. For instance, if you want to read the commonly used IRIS dataset, please follow next example:
Example 1: irisDataSet01.m

From the above example, we know that:

DS.dataName is 'iris', representing the name of this dataset.
DS.inputName is a cell string of length 4, representing the name of the 4 input features.
DS.input is a matrix of 4x150, indicating the dimensions and number of the feature vector is 4 and 150, respectively.
DS.output is a vector of 1x150, representing the output classes for these 150 feature vectors. Possible values of DS.output are 1, 2, and 3, indicating the index into the 3 classes whose names are denoted in DS.outputName.
DS.outputName is a cell string of length 3, representing the name of the classes in this dataset.
Based on the dataset DS, we have a set of functions for visualizing the dataset along different perspectives:

classClassSize(DS): Compute the size of each class
dsProjPlot1(DS): Plot the classes w.r.t. the projected 1D features
dsProjPlot2(DS): Plot the classes w.r.t. the projected 2D features
dsProjPlot3(DS): Plot the classes w.r.t. the projected 3D features
dsFormatCheck(DS): Check the format of the dataset
dsNameAdd(DS): Add names to the inputs and outputs of a given dataset
dsRangePlot(DS): Plot the range of each input of a given dataset
dsDistPlot(DS): Plot the distributions of inputs over different classes of a given dataset
dsScatterPlot(DS): Scatter plot of a dataset in a 2D space
dsScatterPlot3(DS): Scatter plot of a dataset in a 3D space
For the usage of these functions, please refer to the subsequent sections of this chapter.
Data Clustering and Pattern Recognition (資料分群與樣式辨認)