17-4 Digit Recognition: Changing Acoustic Models (??辨?:改變Model??)

Old Chinese version

In the previous sections, we use a syllable as an acoustic model. In this section, we shall decompose a syllable into phones and use each phone as an acoustic model. These phones are called monophones since they are independent of their following phones. If we have more training data and want to distinguish phone models in a more detailed manner, we can use the so-called biphones which is right-context dependent (RCD for short).

The new pam file for using monophone acoustic models is digitMonophone.pam, as shown next:

lɡ]htk/chineseDigitRecog/training/digitMonophone.pam^G]ǦϰUYi^
ba	b a
er	er
jiou	j i o u
ling	l i ng
liou	l i o u
qi	q i
san	s a n
si	s i
sil	sil
wu	w u
i	i

In fact, we only need to replace digitSyl.pam with digitMonophone.pam, then we can proceed with all the same training and test procedures covered in the previous sections to get the results, as shown in the following example:

Example 1: htk/chineseDigitRecog/training/goMonophone13.mhtkPrm=htkParamSet; htkPrm.pamFile='digitMonophone.pam'; htkPrm.phoneMlfFile='digitMonophone.mlf'; htkPrm.mnlFile='digitMonophone.mnl'; disp(htkPrm) [trainRR, testRR]=htkTrainTest(htkPrm); fprintf('Inside test = %g%%, outside test = %g%%\n', trainRR, testRR); pamFile: 'digitMonophone.pam' feaCfgFile: 'mfcc.cfg' waveDir: '..\waveFile' sylMlfFile: 'digitSyl.mlf' phoneMlfFile: 'digitMonophone.mlf' mnlFile: 'digitMonophone.mnl' grammarFile: 'digit.grammar' feaType: 'MFCC_E' feaDim: 13 mixtureNum: 3 stateNum: 3 streamWidth: 13 Pruning-Off Pruning-Off Pruning-Off Pruning-Off Pruning-Off Inside test = 79.24%, outside test = 75.89%

The generated list of monophones are shown next:

lɡ]htk/chineseDigitRecog/training/output/digitMonophone.mnl^G]ǦϰUYi^
sil
l
i
ng
er
s
a
n
w
u
o
q
b
j

The corresponding mlf file are shown next:

Example]htk/chineseDigitRecog/training/output/digitMonophone.mlf^G

The following examples uses 26-dimensional MFCC_E_D_Z:

Example 2: htk/chineseDigitRecog/training/goMonoPhone26.mhtkPrm=htkParamSet; htkPrm.pamFile='digitMonophone.pam'; htkPrm.phoneMlfFile='digitMonophone.mlf'; htkPrm.mnlFile='digitMonophone.mnl'; htkPrm.feaCfgFile='mfcc26.cfg'; htkPrm.feaType='MFCC_E_D_Z'; htkPrm.feaDim=26; htkPrm.streamWidth=[26]; disp(htkPrm) [trainRR, testRR]=htkTrainTest(htkPrm); fprintf('Inside test = %g%%, outside test = %g%%\n', trainRR, testRR); pamFile: 'digitMonophone.pam' feaCfgFile: 'mfcc26.cfg' waveDir: '..\waveFile' sylMlfFile: 'digitSyl.mlf' phoneMlfFile: 'digitMonophone.mlf' mnlFile: 'digitMonophone.mnl' grammarFile: 'digit.grammar' feaType: 'MFCC_E_D_Z' feaDim: 26 mixtureNum: 3 stateNum: 3 streamWidth: 26 Pruning-Off Pruning-Off Pruning-Off Pruning-Off Pruning-Off Inside test = 83.71%, outside test = 87.5%

The following examples uses 39-dimensional MFCC_E_D_A_Z:

Example 3: htk/chineseDigitRecog/training/goMonoPhone39.mhtkPrm=htkParamSet; htkPrm.pamFile='digitMonophone.pam'; htkPrm.phoneMlfFile='digitMonophone.mlf'; htkPrm.mnlFile='digitMonophone.mnl'; htkPrm.feaCfgFile='mfcc39.cfg'; htkPrm.feaType='MFCC_E_D_A_Z'; htkPrm.feaDim=39; htkPrm.streamWidth=[39]; disp(htkPrm) [trainRR, testRR]=htkTrainTest(htkPrm); fprintf('Inside test = %g%%, outside test = %g%%\n', trainRR, testRR); pamFile: 'digitMonophone.pam' feaCfgFile: 'mfcc39.cfg' waveDir: '..\waveFile' sylMlfFile: 'digitSyl.mlf' phoneMlfFile: 'digitMonophone.mlf' mnlFile: 'digitMonophone.mnl' grammarFile: 'digit.grammar' feaType: 'MFCC_E_D_A_Z' feaDim: 39 mixtureNum: 3 stateNum: 3 streamWidth: 39 Pruning-Off Pruning-Off Pruning-Off Pruning-Off Pruning-Off Inside test = 84.6%, outside test = 89.29%


Audio Signal Processing and Recognition (TBzP)