2-1 Intro. to Datasets

[chinese][all]

Slides

In the subsequent chapters of this book, we shall use several datasets to demonstrate the concepts of data clustering and pattern recognition. For pattern recognition, the following 3 dataset are commonly used:

In fact, these datasets are available from UCI Machine Learning Repository at
http://www.ics.uci.edu/~mlearn/MLRepository.html
The website is maintained by Donald Bren School of Information and Computer Science at University of California at Irvine. There are numerous dataset in this website together with their documentations, so researchers on pattern recognition and machine learning can use these datasets for evaluation and comparisons of their approaches.

In order to facility the access of the dataset via MATLAB, we usually use a structure variable DS (dataset) for storing all the information of a dataset, as explained next:

Also we have a function prData.m for returning these three datasets. For instance, if you want to read the commonly used IRIS dataset, please follow next example:

Example 1: irisDataSet01.mDS = prData('iris') DS = dataName: 'iris' inputName: {'sepal length' 'sepal width' 'petal length' 'petal width'} outputName: {'Setosa' 'Versicolour' 'Virginica'} input: [4x150 double] output: [1x150 double]

From the above example, we know that:

  1. DS.dataName is 'iris', representing the name of this dataset.
  2. DS.inputName is a cell string of length 4, representing the name of the 4 input features.
  3. DS.input is a matrix of 4x150, indicating the dimensions and number of the feature vector is 4 and 150, respectively.
  4. DS.output is a vector of 1x150, representing the output classes for these 150 feature vectors. Possible values of DS.output are 1, 2, and 3, indicating the index into the 3 classes whose names are denoted in DS.outputName.
  5. DS.outputName is a cell string of length 3, representing the name of the classes in this dataset.
Based on the dataset DS, we have a set of functions for visualizing the dataset along different perspectives: For the usage of these functions, please refer to the subsequent sections of this chapter.
Data Clustering and Pattern Recognition (資料分群與樣式辨認)