[chinese][english] (請注意:中文版本並未隨英文版本同步更新!)
In this section, we shall introduce the estimation of recognition rates (or error rates) for a classifier that can be constructed by any pattern recognition methods.
本節說明在分類器的設計過程中,如何預估其辨識率或錯誤率。
If we use the same dataset for both training and test, then the obtained recognition rate is referred to as the inside-test recognition rate or the resubstitution recognition rate. This inside-test result is usually overly optimistical since all data is used for training and the test is also based on the same data. In particular, if we use 1-NNR for our classifier, then the inside-test recognition rate will always be 100%.
「內部測試錯誤率」(inside test error)又稱為「重新帶入錯誤率」(resubstitution error)或「表面錯誤率」(apparent error rate),指的是使用全部的資料進行訓練以設計分類器,之後再以同一組資料進行測試。此方式雖然充分運用每一筆資料來進行分類器設計,但因為測試資料和訓練資料是同一份,所得到的辨識率會偏高(錯誤率偏低),這種「球員兼裁判」之的錯誤率,並不具客觀性。
Though the inside-test recognition rate is not objective, it can serve as the upper-bound of the true recognition rate. In general, we use the inside-test recognition rate as a first step for examining our classifier. If the inside-test recognition rate is already low, there are two possible reasons:
舉例來說,如果我們使用 1-NNR 為分類器,再使用內部錯誤率估測法,所得到的辨識率就是 100%(錯誤率為 0%),很明顯地,這是過於樂觀的結果,因此內部錯誤率估測法的結果只能姑且聽之,參考性比較低,我們只能將之視為實際錯誤率的下限值(或是實際辨識率的上限值)。一般而言,我們使用內部錯誤率來進行初步檢測,如果一個分類器的內部錯誤率已經很高,代表有下列兩種可能:
- The design method for the classifier is not good enough.
- The features of the training set do not have good discrinimative power.
- 分類器設計方法有錯誤。
- 設計資料有錯誤,特徵向量並沒有分類的鑑別能力。
However, if the inside-test recognition rate is high, it does not mean we have reach a reliable classifier. Usually we need to prepare a set of "unseen" data to test the classifier, as explained next.
當然,這只是一個基本的檢測,內部錯誤率過高,表示可能有上述兩種錯誤,但是內部錯誤率若很低,並非代表分類器或資料正確,此時還必須靠「外部測試錯誤率」(outside test error)來進行進一步的檢定,如下所述。
After a classifier is constructed, usually it will face unseen data for further application. Therefore it is better to prepare a set of "unseen" data for evaluating the recognition rate of the classifier. In practice, we usually divide the available data set into two disjoint part of a training set and a test set. The training set is used for designing the classifier, while the test set is used for evaluating the recognition rate of the classifier. The obtained recognition rate is referred to as the outside-test recognition rate or the holdout recognition rate, with the following characteristics:
為了避免「球員兼裁判」之嫌,最簡單的方式便是在進行錯誤率預估之時,將資料切成設計資料 design set)和測試資料 test set,我們可以使用 DS 來進行分類器的設計,然後使用 TS 來進行辨識率的測試,此種辨識率稱為「外部測試錯誤率」(outside test error)或「遮蔽式錯誤率」(holdout error)。此種方法的特性如下:
- Since the test set is not used for designing the classifier, the obtained recognition rate is more objective.
- Since the available data set is of limited size in the real world, the outside-test recognition rate is a little bit lower than the true recognition rate since a part of the data set is set aside for test.
- The complexity of a classifier is defined as the number of free parameters in the classifier. In general, the inside-test recognition goes up with the complexity of the classifier. On the other hand, the outside-test recognition rate goes up with the complexity of the classifier initially, but then goes down afterwords due to over-training. Usually we select the number of free parameters of a classifier which can optimize the outside-test recognition rate.
- After we set up the complexity of the classifier, we can then use the whole dataset for training. We can expect the true recognition rate of the thus-constructed classifier should be a little bit higher than the optimum outside-test recognition rate mentioned earlier.
- 由於設計資料和測試資料完全不相同,因此所得到的辨識率較為客觀。
- 在現實的例子中,資料量是有限的,外部測試錯誤率卻必須切出一部分的資料進行錯誤測試,因而將導致設計出的分類器錯誤率較高。
- 一般而言,訓練資料的資料量越大,分類器的精確度越高。因此我們可用外部錯誤率來估測分類器設計的參數,然後使用相同的參數以及全部的樣本資料,來設計一個完整的分類器。
We can extend the concept of outside test to have the so-called two-fold cross validation or two-way outside-test recognition rate. Namely, we can divide the data set into part A and B of equal size. In the first run, we use part A as the training set and part B as the test set. In the second run, we reverse the roles of part A and B. The overall recognition rate will be the average of these two outside-test recognition rates.
我們可以將外部測試錯誤率做進一步的延伸,先將所有資料等切成兩份 A 與 B,在第一次預估時以 A 為訓練資料、B 為測試資料,但在第二次預估時,改以以 B 為訓練資料、A 為測試資料;最後再求這兩次預估的平均錯誤率,稱為「雙向式外部錯誤率」(two-way outside test error)或 two-fold cross validation。
In two-fold cross validation, the dataset is divided into two equal-size parts, which lead to slight lower outside-test recognition rates since each classifier can only use 50% of the dataset. In order to estimate the recognition rate better, we can have m-fold cross validation in which the dataset S is divided into m sets of about equal size, S1, S2, ..., Sm, with the following characteristics: 使用前述的 two-fold cross validation 時,由於使用的設計資料量大約只有樣本資料的一半,因此得到的辨識率會偏低。為了更有效地預估辨識率,我們可以將資料切成 m 個子集合 S1, S2, ..., Sm,每個集合所包含的資料個數大約相等,並滿足下列條件:
- S = S1∪S2∪...∪Sm
- |S1| = |S2| = ... = |Sm|
- Si∩Sj = φ (empty set) whenever i≠j.
- The class distribution of each Sj, i=1 to m, should be as close as possible to that of the original dataset S.
Then we estimate the recognition according to the following steps:
然後以下列方式來估測辨識率:
- Use Si as the test set, while all the other data S-Si as the training set to design a classifier. Test the classifier using Si to obtain the outside-test recognition rate.
- Repeat the above step for each of Si, i = 1 to m. Compute the overall average outside-test recognition rate.
上述的方法稱為 m-fold cross validation,所得到的錯誤率稱為輪迴錯誤率。
- 以 Si 為測試資料,以剩餘的資料 S-Si 設計分類器,再以 Si 對這個分類器進行測試,得到外部測試辨識率。
- 重複上述的步驟,直到得到每個子集合 Si 的辨識結果,並計算整體辨識率。
The following example demonstrate the use of 5-fold cross validation on the IRIS dataset.
Since this type of performance evaluation using cross-validation is used often, we have created a function to serve this purpose, as shown in the next example where 10-fold cross-validation is applied to IRIS dataset:
A larger m will require more computation for constructing m classifiers. In practice, we select the value of m based on the size of the data set and the time needed to construct a specific classifier. In particular,
當 m 越來越大時,所需要的計算量也會越來越大,因此我們可以視實際情況(樣本資料量大小、分類器設計的計算時間)來決定 m 的值,說明如下:
- When m is equal to 2, we have the simple case of two-fold cross validation.
- When m is equal to n (the size of the dataset), we have the leave-one-out method to be explained next.
- m 的最小值是 2,此時輪迴錯誤率即等於前一節所介紹的雙向外部錯誤率。
- m 的最大值等於樣本資料的個數,此時輪迴錯誤率就變成下面所要介紹的「一次挑一個錯誤率」。
Leave-one-out method is also known as the jackknife procedure, which the most objective method for recognition rate estimate since almost all the data (except one entry) is used for constructing the classifier. It involves the following steps:
「"一次挑一個"錯誤率」(leave-one-out error rate)是樣式辨認中最常被用到的錯誤率預估方法,因為每個測試資料都沒有參與分類器的設計,因此也是一種較為公平、客觀的錯誤率預估方式。整個錯誤率預估演算過程又稱鐮刀式流程(jackknife procedure),其主要步驟如下所述:
Use xi (the i-th entry in the dataset) as the test set, while all the other data as the training set to design a classifier. Test the classifier using xi to obtain the outside-test recognition rate (which is either 0% or 100%).
- Repeat the above step for each of xi, i = 1 to n. Compute the overall average outside-test recognition rate.
- 先從資料集中取出一筆資料 xi,以剩餘的資料設計分類器,再以 xi 對這個分類器進行測試。
- 重複上述的步驟,直到得到每一筆資料的辨識結果,並計算整體 LOO 錯誤率或 LOO 辨識率。
The obtained recognition rate is known as the leave-one-out (LOO for short) recognition rate. The leave-one-out method has the following characteristics:
由上述方法可以看出,LOO 辨識率的特性如下:
- Each classifier uses almost all the dataset (except one entry), therefore the outside-test recognition rate should be able to approach the true recognition rate closely.
- For classifiers that require massive computation in the design stage (such as artificial neural networks, Gaussian mixture models), the leave-one-out method is impractical for a moderate dataset.
- Since the leave-one-out method require a lot more computation, usually we only choose a simple classifier such as KNNC for estimating the LOO recognition rate. The obtained LOO recognition rate can help us have a rough idea of the discriminating power of the features in the dataset.
- 由於每一筆資料都被用於當成一次測試資料,因此對於整體資料的使用率可說是達到最極致。
- 若有 n 筆資料,則在此過程中,必須設計 n 個分類器,如果 n 很大或是分類器的設計需要大量計算(例如 GMM 或是類神經網路等),此種錯誤率估測方式就會耗費大量計算時間,曠日廢時。
In the following example, we use the function knncLoo.m to find the LOO recognition rates based 1-NNR. Each misclassified data point is labeled with a cross for easy visual inspection, as follows:
也由於計算量太大,因此我們通常只使用簡單的分類器,例如 KNNC,來估測 LOO 錯誤率,並進而推斷樣本資料的特徵是否能夠足夠的鑑別能力。在下面這個範例中,我們使用一組亂數來產一組包含四個類別的樣本資料,然後利用 knncLoo 指令來計算 1-NNR 所產生的 LOO,並將辨識錯誤的資料點打上「x」號,以便檢查,如下:
You can change the value of param.k to get the LOO recognition rates of various KNNC.
讀者可以改變上述的 k 值,就可以得到 KNNC 在不同的 k 值的辨識結果。
Data Clustering and Pattern Recognition (資料分群與樣式辨認)