Once we grasp the concept of DHMM, it is straightforward to extend the concept to CHMM. The only difference between CHMM and DHMM is their representation of state probabilities:
- DHMM uses a VQ-based method for computing the state probability. For instance, frame i has to be converted into the corresponding symbol k=O(i), and the probability of symbol k to state j is retrieved from B(k, j) of the matrix B.
- CHMM uses a continuous probability density function (such as GMM) for computing the state probability. In other words, B(O(i), j) in CHMM is presented by a continuous probability density function:
B(O(i), j) = p(xi, qj) where p(•,•) is a PDF (such as GMM), xi is the feature vector of frame i, and qj is the parameter vector of this PDF of state j. The method for identifying the optimum of qj is based on re-estimation of MLE (Maximum Likelihood Estimate).In summary, the parameters of CHMM can be represented by the matrix A and the parameters q = {qj|j = 1~m}. The method for finding the optimum values A and q is again re-estimation, in which we need to guess the initial values of A and q, perform Viterbi Decoding, and then use the optimum mapping paths to compute A and q again. This procedure is repeated until the values of A and q converge. It can be proved that during the iteration, the overall probability is monotonic increasing. However, we cannot guarantee the obtained maximum is the global maximum.
The procedure for finding the parameters of CHMM is summarized next.
Note that in each iteration of the above procedure, optimum value of matrix A is identified by frame counting, which is the same as that used in DHMM. On the other hand, the optimum value of q is identified via MLE. Since p(•,•) in CHMM is a continuous function that can approximate the true PDF better, we can expect to achieve a better recognition rate than DHMM.
- Convert all utterances into acoustic features, say, 39-dimensional MFCC.
- Guess the initial values of A and q. If there is no manual transcription, we can adopt the simple strategy of "equal division".
- Iterate the following steps until the values of A and q converge.
- Viterbi decoding: Given A and q of a CHMM, find the optimum mapping paths of all the corresponding utterances of this CHMM.
- Re-estimation: Use the optimum mapping paths to estimate the values of A and q.
Data Clustering and Pattern Recognition (資料分群與樣式辨認)