2-1 Intro. to Datasets

[chinese][english]

Slides

In the subsequent chapters of this book, we shall use several datasets to demonstrate the concepts of data clustering and pattern recognition. For pattern recognition, the following 3 dataset are commonly used:

In fact, these datasets are available from UCI Machine Learning Repository at
http://www.ics.uci.edu/~mlearn/MLRepository.html
The website is maintained by Donald Bren School of Information and Computer Science at University of California at Irvine. There are numerous dataset in this website together with their documentations, so researchers on pattern recognition and machine learning can use these datasets for evaluation and comparisons of their approaches.

In order to facility the access of the dataset via MATLAB, we usually use a structure variable DS (dataset) for storing all the information of a dataset, as explained next:

Also we have a function prData.m for returning these three datasets. For instance, if you want to read the commonly used IRIS dataset, please follow next example:

在本書以下的介紹中,將會引用到幾組資料集(Data Set),以便介紹 DC & PR 的概念,同時也用這些現實世界的資料集來檢測先關演算法的效能。本章所介紹的資料集包含

事實上,這些資料集都是來自於 UCI Machine Learning Repository,網址是
http://www.ics.uci.edu/~mlearn/MLRepository.html
此網站稱為「UCI Machine Learning Repository」,是來自於美國加州大學爾灣分校(University of California at Irvine)的資訊電腦學院(Donald Bren School of Information and Computer Science),這個網站收集了各式各樣的資料,並加以整理說明,以便各個研究學者可以使用各種樣式辨認或是機器學習的方法,來對這些資料進行分類,並比較所得的結果。

為了便於 MATLAB 程式碼的設計,本書將一個資料集用一個結構變數來表示(通常變數名稱是 DS,代表 data set),說明如下:

我們寫了一個函式 prData.m,來傳回常用的這三個資料集。例如,若要讀入常用的 Iris 資料集,我們可以輸入如下:

Example 1: irisDataSet01.mDS = prData('iris') DS = dataName: 'iris' inputName: {'sepal length' 'sepal width' 'petal length' 'petal width'} outputName: {'Setosa' 'Versicolour' 'Virginica'} input: [4x150 double] output: [1x150 double]

From the above example, we know that:

  1. DS.dataName is 'iris', representing the name of this dataset.
  2. DS.inputName is a cell string of length 4, representing the name of the 4 input features.
  3. DS.input is a matrix of 4x150, indicating the dimensions and number of the feature vector is 4 and 150, respectively.
  4. DS.output is a vector of 1x150, representing the output classes for these 150 feature vectors. Possible values of DS.output are 1, 2, and 3, indicating the index into the 3 classes whose names are denoted in DS.outputName.
  5. DS.outputName is a cell string of length 3, representing the name of the classes in this dataset.
Based on the dataset DS, we have a set of functions for visualizing the dataset along different perspectives: For the usage of these functions, please refer to the subsequent sections of this chapter.

從上述範例可得知:

  1. DS.dataName 是 iris,代表此資料集的名稱。
  2. DS.inputName 含有四個字串,分別代表四個輸入變數(或是特徵)的名稱。
  3. DS.input 代表輸入部分,維度是 4x150,每一個直行代表一筆資料,因此本範例共有 150 筆資料,每一筆資料的輸入部分的維度是 4。
  4. DS.output 代表每一筆資料的類別,維度是 1x150,共有三種類別,分別使用 1, 2, 3 來代表。
根據所給的資料集,我們也準備了幾個函式來進行資料的簡單分析與呈現,列表如下: 有關上述函式的使用,以及常用資料集的說明,請見本章後續相關小節。
Data Clustering and Pattern Recognition (資料分群與樣式辨認)