[chinese][english] (Ъ`NG媩åH^媩PBsI)
Slides
If we assume that the features of the given dataset are independent, then we can perform PDF modeling on each dimension sequentially. The final PDF for each class is then given by the multiplication of the PDF of the same class over each dimension. In symbols, we assume that where X = [X1, X2, ..., Xd] is a feature vector and C is a class. Despite the assumption seems too strong for real-world data, the resulting naive Bayes classifier (NBC for short) is highly successful in many practical applications.
pGڭ̰]bwƶAC@ƳOWߪAb]UAC@ƪPDFiH²ƦƦbC@PDFnCyܻAڭ̥iHXC@ƦbC@ӺשҹPDFAMNP@ƪƭPDFisANiHoo@ƪPDFCڭ̪]iHϥμƾǦlӪܦpUG
p(X|C) = P(X1|C)P(X2|C) ... P(Xd|C) where X = [X1, X2, ..., Xd] is a feature vector and C is a class. Despite the assumption seems too strong for real-world data, the resulting naive Bayes classifier (NBC for short) is highly successful in many practical applications.
䤤 X = [X1, X2, ..., Xd] O@ӯSxVqA C N@ӯSwOCoӰ]ݨӦGLjA@ڥ@ɪƦGLk]AѦ]Ҳͪ¨]naive Bayes classifierA² NBC^oO۷ΩʡAѮį``鵹䥦ѾC
In practice, we can apply 1D Gaussian functions for modeling the data of a specific class and a specific dimension. The training of a NBC can be summarized as follows:
b갵WAڭ̳q`]@ƩҹPDFOvKר禡AbpUANBCBJiHpUG
- Assume that data in class i and dimension j is governed by a 1-dimensional Gaussian PDF:
g(x, mi, si2) = [(2p)*si2]-0.5*exp[-(x-mi)2)/(2si2)] where mi is the mean value and si2 is the variance of the PDF. Given the dimension-j training data {x1, x2, ... xn} in class i, the optimum parameters in the sense of maximum likelihood can be expressed as follows:mi = (x1+x2+...+xn)/n si2 = [ (x1-mi)2 + (x2-mi)2 + ... + (xn-mi)2 ] /(n-1) - Since the Gaussian PDF does not reflect the prior probability of class i (denoted by p(ci)), we usually use p(ci)g(x, mi, Si) to represent the probability density of a given vector x. In practice, p(ci) is computed as the number of entries of class i divided by the number of total entries in the training set.
- In application, for a given vector x of unknown class, the higher the probability density p(ci)g(x, mi, Si) is, the more likelihood that this vector belongs to class i.
- ]C@OƧO d vKרơ]Gaussian probability density function^Ҳ͡GG
gi(x, m, S) = (2p)-d/2*det(S)-0.5*exp[-(x-m)TS-1(x-m)/2] 䤤 m OvKרƪVq]Mean vector^AS hO@ܲx}]Covariance matrix^Aڭ̥iHھ MLEAͳ̨ΪVq m M@ܲx} SC- YݭnAiHC@ӰvKרƭW@v wiC
- bڶiɡAwi*gi(x, m, S) VjAh x ݩO i iʴNVC
When implementing the classifier, usually we do not calculate p(ci)g(x, mi, Si) directly since the exponential function is likely to introduce round-off errors. Instead, we compute the natural logarithm of the probability density, as follows:
bڶiBɡAڭ̳q`hp wi*gi(x, m, S) AӬOp log(wi*gi(x, m, S)) = log(wi) + log(gi(x, m, S))AHK}pƮɥioͪغذD]pTפBpӮɡ^Alog(gi(x, m, S)) pUG
log[p(ci)g(x, mi, Si)] = log(p(ci)) - (d*log(2p) + log|Si|)/2 - (x-mi)TSi-1(x-mi)/2 The decision boundary between class i and j is represented by the following trajectory:p(ci)g(x, mi, Si) = p(cj)g(x, mj, Sj). Taking the logrithm of both sides, we havelog(p(ci)) - (d*log(2p) + log|Si|)/2 - (x-mi)TSi-1(x-mi)/2 = log(p(cj)) - (d*log(2p) + log|S|j)/2 - (x-mj)TSj-1(x-mj)/2 After simplification, we have the decision boundary as the following equation:(x-mj)TSj-1(x-mj) - (x-mi)TSi-1(x-mi) = log{[|S|i p2(ci)]/[|S|j p2(cj)]} where the right-hand side is a constant. Since both (x-mj)TSj-1(x-mj) and (x-mi)TSi-1(x-mi) are quadratic, the above equation represents a decision boundary of the quadratic form in the d-dimensional feature space.In the following example, we shall use dimensions 3 and 4 of the Iris dataset for NBC:
ҦpApGϥ NBC ӹ IRIS ƪĤTβĥ|iAiϥΤUCdҵ{G
In the above example, each misclassified data point is labeled with a cross. Moreover, we use the number of entries of each class as the class prior probability for computing the probability density. In general, there are two ways to specify class prior probabilities:
- Use the number of entries in each class as the prior probabilities. You can divide the total number of entries in the training set to have a real probability between 0 and 1. In practice, the division does not change the classification results since it is applied to every class' PDF. (You can use the function dsClassSize.m to find the number of entries in each class.)
- If we have the domain knowledge of the prior probabilities, we can assign them directly.
WϨqXIAHΤ~I]ee^CSOݭn`NOAbWz{XAڭ̥Ψ classWeightAoO@ӦVqAΨӫwC@OvAq`ذkG
- pGnz]Ш`^AviH]wOC@OƭӼơC]pCOƭӼơAi dsClassSize.m ӹFC^
- pGCOƥXuvۮtjAڭ̥iNC@Ov]w 1C
We can plot the 1d PDFs for all classes and dimensions, together with the corresponding data points, as follows:
ڭ̥iHeXCOΨCӺת@PDFơAHΨơAШUCdҡG
For the above example, we can go one step further to plot the class-specific PDFs as 3D surfaces and display the corresponding contours, as follows:
ڭ̤]iHNCOPDFƥHTe{AõeX䵥uAШUCdҡG
Based on the computed PDF for each class, we can plot the decision boundaries, as follows:
ھڳoǰKרơAڭ̴NiHeXCOɡApUG
Data Clustering and Pattern Recognition (ƤsP˦{)