2-1 Intro. to Datasets

[chinese][english]
Slides
In the subsequent chapters of this book, we shall use several datasets to demonstrate the concepts of data clustering and pattern recognition. For pattern recognition, the following 3 dataset are commonly used:

Iris dataset
Wine dataset
Abalone dataset
In fact, these datasets are available from UCI Machine Learning Repository at http://www.ics.uci.edu/~mlearn/MLRepository.html The website is maintained by Donald Bren School of Information and Computer Science at University of California at Irvine. There are numerous dataset in this website together with their documentations, so researchers on pattern recognition and machine learning can use these datasets for evaluation and comparisons of their approaches.
In order to facility the access of the dataset via MATLAB, we usually use a structure variable DS (dataset) for storing all the information of a dataset, as explained next:

DS: the structure variable for storing all information of a dataset
DS.input: the input part (also known as features) of the dataset
DS.output: the output part (also known as desired classes or ground truth) of the dataset. Each entry of this vector is an index into a class denoted by DS.outputName.
DS.dataName: a string representing the name of this dataset
DS.inputName: a cell string that represents the name of the inputs.
DS.outputName: a cell string that represents the name of the output classes. Note that each entry in DS.output is actually an index into this cell string. As a result, the range of DS.output should be between 1 and length(DS.outputName) inclusively.
Also we have a function prData.m for returning these three datasets. For instance, if you want to read the commonly used IRIS dataset, please follow next example:
在本書以下的介紹中，將會引用到幾組資料集（Data Set），以便介紹 DC & PR 的概念，同時也用這些現實世界的資料集來檢測先關演算法的效能。本章所介紹的資料集包含

Iris 資料集
Wine 資料集
Abalone 資料集
事實上，這些資料集都是來自於 UCI Machine Learning Repository，網址是 http://www.ics.uci.edu/~mlearn/MLRepository.html 此網站稱為「UCI Machine Learning Repository」，是來自於美國加州大學爾灣分校（University of California at Irvine）的資訊電腦學院（Donald Bren School of Information and Computer Science），這個網站收集了各式各樣的資料，並加以整理說明，以便各個研究學者可以使用各種樣式辨認或是機器學習的方法，來對這些資料進行分類，並比較所得的結果。
為了便於 MATLAB 程式碼的設計，本書將一個資料集用一個結構變數來表示（通常變數名稱是 DS，代表 data set），說明如下：

DS: 用來儲存資料集的結構變數
DS.input: 資料集的輸入部分，或稱為此資料集的特徵（Features）。
DS.output: 資料集的輸出部分，通常也是我們要預測的數值或是類別。
DS.dataName: 一個字串，代表此資料集的名稱。
DS.inputName: 一個存放字串的異值陣列，代表此資料集每一個輸入的名稱
DS.outputName: 一個存放字串的異值陣列，代表此資料集的所有可能類別的名稱
我們寫了一個函式 prData.m，來傳回常用的這三個資料集。例如，若要讀入常用的 Iris 資料集，我們可以輸入如下：
Example 1: irisDataSet01.m

From the above example, we know that:

DS.dataName is 'iris', representing the name of this dataset.
DS.inputName is a cell string of length 4, representing the name of the 4 input features.
DS.input is a matrix of 4x150, indicating the dimensions and number of the feature vector is 4 and 150, respectively.
DS.output is a vector of 1x150, representing the output classes for these 150 feature vectors. Possible values of DS.output are 1, 2, and 3, indicating the index into the 3 classes whose names are denoted in DS.outputName.
DS.outputName is a cell string of length 3, representing the name of the classes in this dataset.
Based on the dataset DS, we have a set of functions for visualizing the dataset along different perspectives:

classClassSize(DS): Compute the size of each class
dsProjPlot1(DS): Plot the classes w.r.t. the projected 1D features
dsProjPlot2(DS): Plot the classes w.r.t. the projected 2D features
dsProjPlot3(DS): Plot the classes w.r.t. the projected 3D features
dsFormatCheck(DS): Check the format of the dataset
dsNameAdd(DS): Add names to the inputs and outputs of a given dataset
dsRangePlot(DS): Plot the range of each input of a given dataset
dsDistPlot(DS): Plot the distributions of inputs over different classes of a given dataset
dsScatterPlot(DS): Scatter plot of a dataset in a 2D space
dsScatterPlot3(DS): Scatter plot of a dataset in a 3D space
For the usage of these functions, please refer to the subsequent sections of this chapter.
從上述範例可得知：

DS.dataName 是 iris，代表此資料集的名稱。
DS.inputName 含有四個字串，分別代表四個輸入變數（或是特徵）的名稱。
DS.input 代表輸入部分，維度是 4x150，每一個直行代表一筆資料，因此本範例共有 150 筆資料，每一筆資料的輸入部分的維度是 4。
DS.output 代表每一筆資料的類別，維度是 1x150，共有三種類別，分別使用 1, 2, 3 來代表。
根據所給的資料集，我們也準備了幾個函式來進行資料的簡單分析與呈現，列表如下：

classDataCount(DS)：計算每一個類別的資料量。
dsProjPlot1(DS)：將資料投影到一度空間，來觀察類別對特徵的分佈。
dsProjPlot2(DS)：將資料投影到二度空間，來觀察類別對特徵的分佈。
dsProjPlot3(DS)：將資料投影到三度空間，來觀察類別對特徵的分佈。
dsFeatureVsIndexPlot(DS)：進行特徵對資料索引的作圖。
有關上述函式的使用，以及常用資料集的說明，請見本章後續相關小節。
Data Clustering and Pattern Recognition (資料分群與樣式辨認)