5-6 Quadratic Classifiers (二次分類器)

[chinese][english]
(請注意：中文版本並未隨英文版本同步更新！)
Slides
If we assume that data entries in the same class of the training set are generated by a probability density function (PDF), we can use the concept of MLE (maximum likelihood estimate) to identify the parameters of the PDF. These PDFs (with each PDF for a class) can then be used to construct a classifier which can evaluate the likelihood for a data point being in a class. One of the commonly used PDFs is the d-dimensional Gaussian (normal) distribution: $$ g(\mathbf{x}|\mathbf{\mu}, \Sigma)= \frac{1}{\sqrt{(2\pi)^d |\Sigma|}} exp\left(-\frac{1}{2}(\mathbf{x}-\mathbf{\mu})^T\Sigma^{-1}(\mathbf{x}-\mathbf{\mu})\right) $$ The resulting classifier is often called the quadratic classifier (QC) since the decision boundary between any two classes is a quadratic functions in the feature space. In contrast to the commonly used Gaussian mixture models (GMM for short), the quadratic classifier uses only one Gaussian distribution for each class.
The approach of the quadratic classifier can be summarized as follows:
如果我們將每一筆資料視為在高維空間中的一點，而且這些同一類別的資料點是由一個高維高斯機率密度函數所產生，那麼我們就可以使用 MLE 的方法來求出這個高斯密度函數的最佳參數值。使用這種方法所求得的分類器，稱為「二次分類器」（Quadratic classifier），因為此方法所產生的決策分界線（Decision Boundaries）都是輸入特徵的二次函數。其做法可以簡述如下：

Assume that data in class i is governed by a d-dimensional Gaussian probability density function: $$ g(\mathbf{x}|\mathbf{\mu_i}, \Sigma_i)= \frac{1}{\sqrt{(2\pi)^d |\Sigma_i|}} exp\left(-\frac{1}{2}(\mathbf{x_i}-\mathbf{\mu_i})^T\Sigma_i^{-1}(\mathbf{x_i}-\mathbf{\mu_i})\right) $$ where m_i is the mean vector and S_i is the covariance matrix of this PDF, and |S_i| is the determinant of S_i. Given the training data {x₁, x₂, ... x_n} in class i, the optimum parameters in the sense of maximum likelihood can be expressed as follows: $$ \left\{ \begin{array}[rcl] \mathbf{\mu}_i & = & \frac{\mathbf{x}_1+\mathbf{x}_2+\cdots+\mathbf{x}_n}{n} \\ \Sigma_i & = & \frac{(\mathbf{x_1}-\mathbf{\mu}_i)(\mathbf{x_1}-\mathbf{\mu}_i)^T + (\mathbf{x_2}-\mathbf{\mu}_i)(\mathbf{x_2}-\mathbf{\mu}_i)^T + \cdots + (\mathbf{x_n}-\mathbf{\mu}_i)(\mathbf{x_n}-\mathbf{\mu}_i)^T}{n} \\ \end{array} \right. $$
Since the Gaussian PDF does not reflect the prior probability of class i (denoted by p(c_i)), we usually use p(c_i)g(x, m_i, S_i) to represent the probability density of a given vector x. In practice, p(c_i) is computed as the number of entries of class i divided by the number of total entries in the training set.
In application, for a given vector x of unknown class, the higher the probability density p(c_i)g(x, m_i, S_i) is, the more likelihood that this vector belongs to class i.

假設每一個類別的資料均是由 d 維的高斯機率密度函數（Gaussian probability density function）所產生：： g_i(x, m, S) = (2p)^-d/2*det(S)^-0.5*exp[-(x-m)^TS^-1(x-m)/2] 其中 m 是此高斯機率密度函數的平均向量（Mean vector），S 則是其共變異矩陣（Covariance matrix），我們可以根據 MLE，產生最佳的平均向量 m 和共變異矩陣 S。
若有需要，可以對每一個高斯機率密度函數乘上一個權重 w_i。
在實際進行分類時，w_i*g_i(x, m, S) 越大，則資料 x 隸屬於類別 i 的可能性就越高。

When implementing the classifier, usually we do not calculate p(c_i)g(x, m_i, S_i) directly since the exponential function is likely to introduce round-off errors. Instead, we compute the natural logarithm of the probability density, as follows:
在實際進行運算時，我們通常不去計算 w_i*g_i(x, m, S) ，而是計算 log(w_i*g_i(x, m, S)) = log(w_i) + log(g_i(x, m, S))，以便避開計算指數時可能發生的種種問題（如精確度不足、計算耗時），log(g_i(x, m, S)) 的公式如下： $$ \ln \left( p(c_i)g(\mathbf{x}|\mathbf{\mu}, \Sigma) \right) = \ln p(c_i) - \frac{d \ln(2\pi)+ \ln |\Sigma|}{2} - \frac{(\mathbf{x}-\mathbf{\mu})^T\Sigma^{-1}(\mathbf{x}-\mathbf{\mu})}{2} $$ The decision boundary between class i and j is represented by the following trajectory: $$ p(c_i)g(\mathbf{x}|\mathbf{\mu}_i, \Sigma_i) = p(c_j)g(\mathbf{x}|\mathbf{\mu}_j, \Sigma_j) $$ Taking the logrithm of both sides, we have $$ \ln p(c_i) - \frac{d \ln(2\pi)+ \ln |\Sigma_i|}{2} - \frac{(\mathbf{x}-\mathbf{\mu}_i)^T\Sigma_i^{-1}(\mathbf{x}-\mathbf{\mu}_i)}{2} = \ln p(c_j) - \frac{d \ln(2\pi)+ \ln |\Sigma_j|}{2} - \frac{(\mathbf{x}-\mathbf{\mu}_j)^T\Sigma_j^{-1}(\mathbf{x}-\mathbf{\mu}_j)}{2} $$ After simplification, we have the decision boundary as the following equation: $$ 2\ln p(c_i) + \ln |\Sigma_i| - (\mathbf{x}-\mathbf{\mu}_i)^T\Sigma_i^{-1}(\mathbf{x}-\mathbf{\mu}_i) = 2\ln p(c_j) + \ln |\Sigma_j| - (\mathbf{x}-\mathbf{\mu}_j)^T\Sigma_j^{-1}(\mathbf{x}-\mathbf{\mu}_j) $$ $$ (\mathbf{x}-\mathbf{\mu}_i)^T\Sigma_i^{-1}(\mathbf{x}-\mathbf{\mu}_i) - (\mathbf{x}-\mathbf{\mu}_j)^T\Sigma_j^{-1}(\mathbf{x}-\mathbf{\mu}_j) = \ln \left( \frac{p^2(c_i)|\Sigma_i|}{p^2(c_j)|\Sigma_j|} \right) $$ where the right-hand side is a constant. Since both $(\mathbf{x}-\mathbf{\mu}_i)^T\Sigma_i^{-1}(\mathbf{x}-\mathbf{\mu}_i)$ and $(\mathbf{x}-\mathbf{\mu}_j)^T\Sigma_j^{-1}(\mathbf{x}-\mathbf{\mu}_j)$ are quadratic, the above equation represents a decision boundary of the quadratic form in the d-dimensional feature space.
In particular, if $\Sigma_i=\Sigma_j=\Sigma$, the decision boundary is reduced to a linear equation, as follows: $$ \underbrace{(\mathbf{\mu}_i-\mathbf{\mu}_j) \Sigma}_{\mathbf{c}} \mathbf{x} = \underbrace{\mathbf{\mu}_i \Sigma \mathbf{\mu}_i - \mathbf{\mu}_j \Sigma \mathbf{\mu}_j - \ln \left( \frac{p^2(c_i)|\Sigma_i|}{p^2(c_j)|\Sigma_j|} \right)}_{constant} \Longrightarrow \mathbf{cx}=constant $$
In the following example, we shall use dimensions 3 and 4 of the Iris dataset for the quadratic classifier. First of all, we can plot the data distribution:
在以下範例，我們將使用二次分類器來對 IRIS 資料的第三維及第四維進行分類。首先，我們先畫出資料的分佈圖：
Example 1: qcDataPlot01.m

Then we can use "qcTrain" for constructing a QC for classification:
接著，我們可以使用 qcTrain 來建造一個 QC：
Example 2: qcTrain01.m

So for such a inside test, the recognition rate of QC is 98% using only features 3 and 4.
由上例可知，若只使用第三維和第四維特徵，QC 的辨識率是 98%，此類測試稱為內部測試。
In the above example, each misclassified data point is labeled with a cross. Moreover, we use the number of entries of each class as the class prior probability for computing the probability density. In general, there are two ways to specify class prior probabilities:

Use the number of entries in each class as the prior probabilities. You can divide the total number of entries in the training set to have a real probability between 0 and 1. In practice, the division does not change the classification results since it is applied to every class' PDF. (You can use the function classDataCount.m to find the number of entries in each class.)
If we have the domain knowledge of the prior probabilities, we can assign them directly.

上圖秀出資料點，以及分類錯誤的點（叉叉）。特別需要注意的是，在上述的程式碼中，我們用到 classWeight，這是一個向量，用來指定每一個類別的權重，通常有兩種做法：

如果要滿足貝氏分類的原理（請見後續章節），此權重可以設定是每一個類別的資料個數。（計算每個類別的資料個數，可由 getClassDataCount.m 來達成。）
如果每個類別的資料出線的機率相差不大，我們可將每一個類別的權重都設定成 1。

For the above example, we can go one step further to plot the Gaussian PDFs as 3D surfaces and display the corresponding contours, as follows:
我們也可以將每個類別的高斯密度函數以三維曲面呈現，並畫出其等高線，請見下列範例：
Example 3: qcPlot01.m

Based on the computed PDF for each class, we can plot the decision boundaries, as follows:
根據這些高斯密度函數，我們就可以畫出每個類別的邊界，如下：
Example 4: qcPlot02.m

As we have derived earlier, the decision boundaries should be quadratic functions in the 2D plane. The above plots do verify this fact.
事實上，如之前之推導，這些邊界都是二次函數，可由上述圖形驗證之。
If the class prior probabilities vary a lot among different classes, we might get seemingly wrong result by using the quadratic classifier. This problem is explored in the exercise.
如果每個類別的資料量相差太大，二次分類器可能得到看似錯誤的結果，但實際上卻是對的。這種情況我們留待作業來討論。
We can also set the class prior probabilities according to the cost associated with false-positive and false-negative cases. This will be explained later.
我們也可以根據 false positive 和 false negative 所帶來的 cost 來進行權重的設定，後續詳述。
If the data distribution is too complicated to model using a single Gaussian PDF for each class, then we have other choices such as multiple Gaussian classifiers or Gaussian mixture models, to be detailed later.
另一方面，如果訓練資料比較複雜，無法以簡單的高斯機率密度函數來進行滿意的分類，此時我們就可以選用類似但較複雜的方法，例如 Gaussian Mixture Models，簡稱 GMM，詳見後述。
Data Clustering and Pattern Recognition (資料分群與樣式辨認)