6-2 Methods for Recognition Rate Estimate (辨識率預估)

[chinese][english]

(請注意:中文版本並未隨英文版本同步更新!)

In this section, we shall introduce the estimation of recognition rates (or error rates) for a classifier that can be constructed by any pattern recognition methods.

本節說明在分類器的設計過程中,如何預估其辨識率或錯誤率。

If we use the same dataset for both training and test, then the obtained recognition rate is referred to as the inside-test recognition rate or the resubstitution recognition rate. This inside-test result is usually overly optimistical since all data is used for training and the test is also based on the same data. In particular, if we use 1-NNR for our classifier, then the inside-test recognition rate will always be 100%.

「內部測試錯誤率」(inside test error)又稱為「重新帶入錯誤率」(resubstitution error)或「表面錯誤率」(apparent error rate),指的是使用全部的資料進行訓練以設計分類器,之後再以同一組資料進行測試。此方式雖然充分運用每一筆資料來進行分類器設計,但因為測試資料和訓練資料是同一份,所得到的辨識率會偏高(錯誤率偏低),這種「球員兼裁判」之的錯誤率,並不具客觀性。

Though the inside-test recognition rate is not objective, it can serve as the upper-bound of the true recognition rate. In general, we use the inside-test recognition rate as a first step for examining our classifier. If the inside-test recognition rate is already low, there are two possible reasons:

舉例來說,如果我們使用 1-NNR 為分類器,再使用內部錯誤率估測法,所得到的辨識率就是 100%(錯誤率為 0%),很明顯地,這是過於樂觀的結果,因此內部錯誤率估測法的結果只能姑且聽之,參考性比較低,我們只能將之視為實際錯誤率的下限值(或是實際辨識率的上限值)。一般而言,我們使用內部錯誤率來進行初步檢測,如果一個分類器的內部錯誤率已經很高,代表有下列兩種可能:

However, if the inside-test recognition rate is high, it does not mean we have reach a reliable classifier. Usually we need to prepare a set of "unseen" data to test the classifier, as explained next.

當然,這只是一個基本的檢測,內部錯誤率過高,表示可能有上述兩種錯誤,但是內部錯誤率若很低,並非代表分類器或資料正確,此時還必須靠「外部測試錯誤率」(outside test error)來進行進一步的檢定,如下所述。

After a classifier is constructed, usually it will face unseen data for further application. Therefore it is better to prepare a set of "unseen" data for evaluating the recognition rate of the classifier. In practice, we usually divide the available data set into two disjoint part of a training set and a test set. The training set is used for designing the classifier, while the test set is used for evaluating the recognition rate of the classifier. The obtained recognition rate is referred to as the outside-test recognition rate or the holdout recognition rate, with the following characteristics:

為了避免「球員兼裁判」之嫌,最簡單的方式便是在進行錯誤率預估之時,將資料切成設計資料 design set)和測試資料 test set,我們可以使用 DS 來進行分類器的設計,然後使用 TS 來進行辨識率的測試,此種辨識率稱為「外部測試錯誤率」(outside test error)或「遮蔽式錯誤率」(holdout error)。此種方法的特性如下:

We can extend the concept of outside test to have the so-called two-fold cross validation or two-way outside-test recognition rate. Namely, we can divide the data set into part A and B of equal size. In the first run, we use part A as the training set and part B as the test set. In the second run, we reverse the roles of part A and B. The overall recognition rate will be the average of these two outside-test recognition rates.

我們可以將外部測試錯誤率做進一步的延伸,先將所有資料等切成兩份 A 與 B,在第一次預估時以 A 為訓練資料、B 為測試資料,但在第二次預估時,改以以 B 為訓練資料、A 為測試資料;最後再求這兩次預估的平均錯誤率,稱為「雙向式外部錯誤率」(two-way outside test error)或 two-fold cross validation。

In two-fold cross validation, the dataset is divided into two equal-size parts, which lead to slight lower outside-test recognition rates since each classifier can only use 50% of the dataset. In order to estimate the recognition rate better, we can have m-fold cross validation in which the dataset S is divided into m sets of about equal size, S1, S2, ..., Sm, with the following characteristics: 使用前述的 two-fold cross validation 時,由於使用的設計資料量大約只有樣本資料的一半,因此得到的辨識率會偏低。為了更有效地預估辨識率,我們可以將資料切成 m 個子集合 S1, S2, ..., Sm,每個集合所包含的資料個數大約相等,並滿足下列條件:

Then we estimate the recognition according to the following steps:

然後以下列方式來估測辨識率:

  1. Use Si as the test set, while all the other data S-Si as the training set to design a classifier. Test the classifier using Si to obtain the outside-test recognition rate.
  2. Repeat the above step for each of Si, i = 1 to m. Compute the overall average outside-test recognition rate.
  1. 以 Si 為測試資料,以剩餘的資料 S-Si 設計分類器,再以 Si 對這個分類器進行測試,得到外部測試辨識率。
  2. 重複上述的步驟,直到得到每個子集合 Si 的辨識結果,並計算整體辨識率。
上述的方法稱為 m-fold cross validation,所得到的錯誤率稱為輪迴錯誤率。

The following example demonstrate the use of 5-fold cross validation on the IRIS dataset.

Example 1: rreViaCv01.mdataSet=prData('iris'); m=5; cvOpt=cvDataGen('defaultOpt'); cvOpt.foldNum=m; cvOpt.cvDataType='full'; cvData=cvDataGen(dataSet, cvOpt); foldNum=length(cvData); % Actual no. of folds for i=1:foldNum [qcPrm, logProb1, tRr(i)]=qcTrain(cvData(i).TS); tSize(i)=length(cvData(i).TS.output); [computedClass, logProb2, vRr(i)]=qcEval(cvData(i).VS, qcPrm); vSize(i)=length(cvData(i).VS.output); end tRrAll=dot(tRr, tSize)/sum(tSize); vRrAll=dot(vRr, vSize)/sum(vSize); plot(1:foldNum, tRr, '.-', 1:foldNum, vRr, '.-'); xlabel('Folds'); ylabel('Recog. rate (%)'); legend('Training RR', 'Validating RR', 'location', 'northOutside', 'orientation', 'horizontal'); fprintf('Training RR=%.2f%%, Validating RR=%.2f%%\n', tRrAll*100, vRrAll*100); Training RR=97.83%, Validating RR=97.33%

Since this type of performance evaluation using cross-validation is used often, we have created a function to serve this purpose, as shown in the next example where 10-fold cross-validation is applied to IRIS dataset:

Example 2: perfCv4qc01.mDS=prData('iris'); showPlot=1; foldNum=10; classifier='qc'; [vRrAll, tRrAll]=perfCv(DS, classifier, [], foldNum, showPlot); fprintf('Training RR=%.2f%%, Validating RR=%.2f%%\n', tRrAll*100, vRrAll*100); Training RR=98.07%, Validating RR=98.00%

A larger m will require more computation for constructing m classifiers. In practice, we select the value of m based on the size of the data set and the time needed to construct a specific classifier. In particular,

當 m 越來越大時,所需要的計算量也會越來越大,因此我們可以視實際情況(樣本資料量大小、分類器設計的計算時間)來決定 m 的值,說明如下:

Leave-one-out method is also known as the jackknife procedure, which the most objective method for recognition rate estimate since almost all the data (except one entry) is used for constructing the classifier. It involves the following steps:

「"一次挑一個"錯誤率」(leave-one-out error rate)是樣式辨認中最常被用到的錯誤率預估方法,因為每個測試資料都沒有參與分類器的設計,因此也是一種較為公平、客觀的錯誤率預估方式。整個錯誤率預估演算過程又稱鐮刀式流程(jackknife procedure),其主要步驟如下所述:

    Use xi (the i-th entry in the dataset) as the test set, while all the other data as the training set to design a classifier. Test the classifier using xi to obtain the outside-test recognition rate (which is either 0% or 100%).
  1. Repeat the above step for each of xi, i = 1 to n. Compute the overall average outside-test recognition rate.
  1. 先從資料集中取出一筆資料 xi,以剩餘的資料設計分類器,再以 xi 對這個分類器進行測試。
  2. 重複上述的步驟,直到得到每一筆資料的辨識結果,並計算整體 LOO 錯誤率或 LOO 辨識率。

The obtained recognition rate is known as the leave-one-out (LOO for short) recognition rate. The leave-one-out method has the following characteristics:

由上述方法可以看出,LOO 辨識率的特性如下:

In the following example, we use the function knncLoo.m to find the LOO recognition rates based 1-NNR. Each misclassified data point is labeled with a cross for easy visual inspection, as follows:

也由於計算量太大,因此我們通常只使用簡單的分類器,例如 KNNC,來估測 LOO 錯誤率,並進而推斷樣本資料的特徵是否能夠足夠的鑑別能力。在下面這個範例中,我們使用一組亂數來產一組包含四個類別的樣本資料,然後利用 knncLoo 指令來計算 1-NNR 所產生的 LOO,並將辨識錯誤的資料點打上「x」號,以便檢查,如下:

Example 3: knncLoo01.mDS=prData('random2'); dsScatterPlot(DS); knncPrm.k=1; plotOpt=1; [recogRate, computed, nearestIndex]=knncLoo(DS, knncPrm, plotOpt);

You can change the value of param.k to get the LOO recognition rates of various KNNC.

讀者可以改變上述的 k 值,就可以得到 KNNC 在不同的 k 值的辨識結果。


Data Clustering and Pattern Recognition (資料分群與樣式辨認)