5-5 Naive Bayes Classifiers (嚙踝蕭嚙踝蕭嚙踝蕭嚙踝蕭嚙踝蕭嚙踝蕭嚙踝蕭嚙踝蕭嚙踝蕭嚙踝蕭嚙?

[chinese][english]
(請注意：中文版本並未隨英文版本同步更新！)
Slides
If we assume that the features of the given dataset are independent, then we can perform PDF modeling on each dimension sequentially. The final PDF for each class is then given by the multiplication of the PDF of the same class over each dimension. In symbols, we assume that where X = [X₁, X₂, ..., X_d] is a feature vector and C is a class. Despite the assumption seems too strong for real-world data, the resulting naive Bayes classifier (NBC for short) is highly successful in many practical applications.
如果我們假設在給定的資料集中，每一維的資料都是獨立的，在此假設下，每一類資料的PDF可以簡化成此類資料在每一維的PDF的乘積。換句話說，我們可以先算出每一類資料在每一個維度所對應的PDF，然後將同一類資料的數個PDF進行連乘，就可以得到這一類資料的PDF。我們的假設可以使用數學式子來表示如下：
p(X|C) = P(X₁|C)P(X₂|C) ... P(X_d|C)
where X = [X₁, X₂, ..., X_d] is a feature vector and C is a class. Despite the assumption seems too strong for real-world data, the resulting naive Bayes classifier (NBC for short) is highly successful in many practical applications.
其中 X = [X₁, X₂, ..., X_d] 是一個特徵向量，而 C 代表一個特定類別。這個假設看來似乎過強，一般實際世界的資料似乎無法滿足此假設，但由此假設所產生的單純貝氏分類器（naive Bayes classifier，簡稱 NBC）卻是相當有實用性，其辨識效能常常不輸給其它更複雜的辨識器。
In practice, we can apply 1D Gaussian functions for modeling the data of a specific class and a specific dimension. The training of a NBC can be summarized as follows:
在實做上，我們通常假設一維資料所對應的PDF是高斯機率密度函式，在此情況下，對應的NBC步驟可以說明如下：

Assume that data in class i and dimension j is governed by a 1-dimensional Gaussian PDF: g(x, m_i, s_i²) = [(2p)*s_i²]^-0.5*exp[-(x-m_i)²)/(2s_i²)] where m_i is the mean value and s_i² is the variance of the PDF. Given the dimension-j training data {x₁, x₂, ... x_n} in class i, the optimum parameters in the sense of maximum likelihood can be expressed as follows: m_i = (x₁+x₂+...+x_n)/n s_i² = [ (x₁-m_i)² + (x₂-m_i)² + ... + (x_n-m_i)² ] /(n-1)
Since the Gaussian PDF does not reflect the prior probability of class i (denoted by p(c_i)), we usually use p(c_i)g(x, m_i, S_i) to represent the probability density of a given vector x. In practice, p(c_i) is computed as the number of entries of class i divided by the number of total entries in the training set.
In application, for a given vector x of unknown class, the higher the probability density p(c_i)g(x, m_i, S_i) is, the more likelihood that this vector belongs to class i.

假設每一個類別的資料均是由 d 維的高斯機率密度函數（Gaussian probability density function）所產生：： g_i(x, m, S) = (2p)^-d/2*det(S)^-0.5*exp[-(x-m)^TS^-1(x-m)/2] 其中 m 是此高斯機率密度函數的平均向量（Mean vector），S 則是其共變異矩陣（Covariance matrix），我們可以根據 MLE，產生最佳的平均向量 m 和共變異矩陣 S。
若有需要，可以對每一個高斯機率密度函數乘上一個權重 w_i。
在實際進行分類時，w_i*g_i(x, m, S) 越大，則資料 x 隸屬於類別 i 的可能性就越高。

When implementing the classifier, usually we do not calculate p(c_i)g(x, m_i, S_i) directly since the exponential function is likely to introduce round-off errors. Instead, we compute the natural logarithm of the probability density, as follows:
在實際進行運算時，我們通常不去計算 w_i*g_i(x, m, S) ，而是計算 log(w_i*g_i(x, m, S)) = log(w_i) + log(g_i(x, m, S))，以便避開計算指數時可能發生的種種問題（如精確度不足、計算耗時），log(g_i(x, m, S)) 的公式如下：
log[p(c_i)g(x, m_i, S_i)] = log(p(c_i)) - (d*log(2p) + log|S_i|)/2 - (x-m_i)^TS_i^-1(x-m_i)/2 The decision boundary between class i and j is represented by the following trajectory: p(c_i)g(x, m_i, S_i) = p(c_j)g(x, m_j, S_j). Taking the logrithm of both sides, we have log(p(c_i)) - (d*log(2p) + log|S_i|)/2 - (x-m_i)^TS_i^-1(x-m_i)/2 = log(p(c_j)) - (d*log(2p) + log|S|_j)/2 - (x-m_j)^TS_j^-1(x-m_j)/2 After simplification, we have the decision boundary as the following equation: (x-m_j)^TS_j^-1(x-m_j) - (x-m_i)^TS_i^-1(x-m_i) = log{[|S|_i p²(c_i)]/[|S|_j p²(c_j)]} where the right-hand side is a constant. Since both (x-m_j)^TS_j^-1(x-m_j) and (x-m_i)^TS_i^-1(x-m_i) are quadratic, the above equation represents a decision boundary of the quadratic form in the d-dimensional feature space.
In the following example, we shall use dimensions 3 and 4 of the Iris dataset for NBC:
例如，如果使用 NBC 來對 IRIS 資料的第三維及第四維進行分類，可使用下列範例程式：
Example 1: nbc01dataPlot.m

In the above example, each misclassified data point is labeled with a cross. Moreover, we use the number of entries of each class as the class prior probability for computing the probability density. In general, there are two ways to specify class prior probabilities:

Use the number of entries in each class as the prior probabilities. You can divide the total number of entries in the training set to have a real probability between 0 and 1. In practice, the division does not change the classification results since it is applied to every class' PDF. (You can use the function dsClassSize.m to find the number of entries in each class.)
If we have the domain knowledge of the prior probabilities, we can assign them directly.

上圖秀出資料點，以及分類錯誤的點（叉叉）。特別需要注意的是，在上述的程式碼中，我們用到 classWeight，這是一個向量，用來指定每一個類別的權重，通常有兩種做法：

如果要滿足貝氏分類的原理（請見後續章節），此權重可以設定是每一個類別的資料個數。（計算每個類別的資料個數，可由 dsClassSize.m 來達成。）
如果每個類別的資料出線的機率相差不大，我們可將每一個類別的權重都設定成 1。

We can plot the 1d PDFs for all classes and dimensions, together with the corresponding data points, as follows:
我們可以畫出每個類別及每個維度的一維PDF函數，以及其對應的資料，請見下列範例：
Example 2: nbcPlot00.m

For the above example, we can go one step further to plot the class-specific PDFs as 3D surfaces and display the corresponding contours, as follows:
我們也可以將每個類別的PDF函數以三維曲面呈現，並畫出其等高線，請見下列範例：
Example 3: nbcPlot01.m

Based on the computed PDF for each class, we can plot the decision boundaries, as follows:
根據這些高斯密度函數，我們就可以畫出每個類別的邊界，如下：
Example 4: nbcPlot02.m

Data Clustering and Pattern Recognition (資料分群與樣式辨認)