2-2 Iris Dataset

[chinese][english]

Iris dataset is by far the earliest and the most commonly used dataset in the literature of pattern recognition. The dataset contains 150 instances of iris flowers collected in Hawaii. These instances are divided into 3 classes of Iris Setosa, Iris Versicolour and Iris Virginica, based on 4 measures of sepal's width and length, and petal's width and length. These measures are taken for each iris flower, as shown next:

Typical examples of these 3 iris species are shown next:
Iris Setosa (More info) Iris Versicolor (More info) Iris Virginica (More info)
Detailed information of the dataset is listed next: There are numerous technical papers that use Iris dataset. Here is a partial list:

Iris 資料集可說是在樣式辨認研究中,最常被引用到的資料集,此資料集包含鳶尾花的資料,的特性如下:

有關於使用 iris 資料集的論文,多到不勝枚舉,以下是幾個代表作:
  1. Fisher,R.A. "The use of multiple measurements in taxonomic problems" Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to Mathematical Statistics" (John Wiley, NY, 1950).
  2. Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analysis. (Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218.
  3. Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System Structure and Classification Rule for Recognition in Partially Exposed Environments". IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. PAMI-2, No. 1, 67-71.
  4. Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE Transactions on Information Theory, May 1972, 431-433.

In the dataset, Iris Setosa is easier to be distinguished from the other two classes, while the other two classes are partially overlapped and harder to be separated.

在這三類資料中,有一類 Iris Setosa 是比較容易分辨,而另外兩類則是有部分重疊。

We can display the data sizes among all classes, as follows:

我們可以計算每一個類別的資料量,如下:

Example 1: irisClassDataCount01.mds=prData('iris'); [classSize, classLabel]=dsClassSize(ds, 1); 4 features 150 instances 3 classes

We can display the feature distributions over different classes, as follows:

我們可以計算每一個類別的特徵分布圖,如下:

Example 2: irisClassDist01.mds=prData('iris'); dsDistPlot(ds);

We can plot the classes w.r.t. each of the features:

我們可以進行類別對單一特徵的作圖,如下:

Example 3: irisProjPlot1.mds = prData('iris'); dsProjPlot1(ds);

We can have a scatter plot after projecting the dataset onto a 2D plane:

我們也可以將資料投影到二度空間,來觀察資料的分佈,如下:

Example 5: irisProjPlot2.mds = prData('iris'); dsProjPlot2(ds);

From the above plot, it seems that the projection over features 3 and 4 can separate the classes most appropriately. If we need to identify a point in the scatter plot, we need to put annotation to each data point first. Then when we draw the mouse near a specific data point, the corresponding annotation will appear. The next example demonstrate how to achieve this. Please move the cursor to any point to see the corresponding annotation.

Example 6: irisPlot2dWithAnnotation.mds=prData('iris'); ds.input=ds.input(3:4, :); for i=1:length(ds.output) ds.annotation{i}=sprintf('Data index=%d\nPosition=%s\nClass=%s', i, mat2str(ds.input(:,i)), ds.outputName{ds.output(i)}); end opt.showAnnotation=1; opt.showLegend=1; dsScatterPlot(ds, opt);

In fact, the area of sepals and petals is an effective feature for classifying three species of iris. We can use the multiplication of sepal's width and length as the area of the sepal. And similar for petal. We can then use these area-based features to have the scatter plot, as follows:

Example 7: irisPlot2dfeaCombine.mds=prData('iris'); ds.input=[ds.input(1,:).*ds.input(2,:); ds.input(3,:).*ds.input(4,:)]; ds.inputName=''; ds.inputName{1}='Area of sepal'; ds.inputName{2}='Area of petal'; dsScatterPlot(ds);

From the plot, it is obvious that Setosa is quite separated from the other two classes, while the other two classes have partial overlap at their boundaries.

We can have another scatter plot after projecting the dataset onto a 3D space:

我們也可以將資料投影到三度空間,來觀察資料的分佈,如下:

Example 8: irisProjPlot3.mds = prData('iris'); dsProjPlot3(ds);

Basically, we can only visualize data points scattered in a 2D plane or a 3D space. If we want to visualize data points in a 4D space, then the fourth feature can be viewed as the time and the other three features becomes moving points in a 3D space.

基本上,人眼的觀察僅限於二度空間和三度空間,若要在四度空間中觀察,可以將第四維度想像成時間,因此四度空間的資料散佈圖,就是三度空間資料散佈圖隨時間而變化的動畫。


Data Clustering and Pattern Recognition (資料分群與樣式辨認)