[chinese][english] (請注意:中文版本並未隨英文版本同步更新!)
Slides
Once we have constructed a classifier using a certain pattern recognition method, we need to evaluate its performance objectively. The performance evaluation of a classifier usually involves two factors:
所謂分類器的「效能評估」(performance evaluation),是指我們在設計一個分類器之後,如何以一個有效的方式來預估此分類器的能力,通常可以分為兩部分來評估:
- Recognition rate: The higher, the better. Some people prefer to use the error rate, which is equal to 1 minus the recognition rate.
- Computation load: The lower, the better. In fact, we have two types of computation loads:
- Computation load at the design stage
- Computation load at the application stage
- 運算量:越小越好,此部分又包含
- 設計時的運算量
- 辨識時的運算量
- 辨識率:越高越好。「辨識率」(recognition rate)是指的是發生分類錯誤的機率,與辨識率相對的另一個名詞是「錯誤率」(error rate),指的是正確分類的機率,兩者總和應該等於100%。
The computation load of a classifier depends on the underlying classifier a lot, which we shall not go into detail in this chapter. Instead, the focus of this chapter is to cover several methods for estimating the ideally true recognition rate for a given classifier and a dataset.
不同的分類器,會有不同的運算量,本章將重點放在辨識率的估測,而不討論運算量。
Moreover, for a simple binary classification problem, the misclassified cases can be divided into two types of false positive and false negative. We shall also address the issue of selecting a threshold for the classifier based on the cost of false positive and false negative.
由於在現實世界中,所有的樣本資料(sample data)都是有限的,資料的收集過程本身就要耗費時間與人力,因此樣本資料也就益形珍貴。樣本資料越多,我們設計出來的分類器也會越精準,但是為了測試所設計出來的分類器的效能,所以在進行樣式辨識系統的設計流程中,我們會將所有的樣本資料切成兩部分:
不同的資料切分方式,就對應到不同的錯誤率估測方式,請見各小節詳述。
- 訓練資料(training data):又稱為「設計資料」(design data),我們用此資料來設計分類器。
- 測試資料(test data):我們用此資料來測試分類器的效能。
Data Clustering and Pattern Recognition (資料分群與樣式辨認)