17-2 HTK Example: Digit Recognition (HTK ?本範?一:數字辨?

Old Chinese version

To implement CHMM, usually we rely on HTK (Hidden Markov Model Toolket) for corpus preparation and training. In this section, we shall use a simple example to demonstrate the use of HTK. In this example, we want to construct CHMM for the recognition for the digits from 0 to 9 in Mandarin. You will be able to know how to do corpus training, and how to compute the recognition rates.

In this homepage, we intend to display files of several extensions (mlf, scp, template, cfg, pam, init, hmm, net, grammar, etc) via iframe tags in the web browser directly. To this end, you need to cancel the application associations for these extensions. (Otherwise there will be popup windows to ask if you want to download the files, etc.) You can use the following batch file to cancel the extension-application associations:

assoc .mlf=
assoc .scp=
assoc .template=
assoc .cfg=
assoc .pam=
assoc .init=
assoc .hmm=
assoc .net=
assoc .grammar=

Please run the above batch file under the DOS prompt. Then you can reload this homepage such that all files can be displayed on within iframe tages in this page correctly.

You can use the command "assoc" to specify or delete file association with an application. For instance, if you want to delete file association with the extension "scp", you can type "assoc .scp=" under the DOS prompt.

Before move on, you need to download the following files for this section:

Open a DOS window and change directory to chineseDigitRecog/training. Type "goSyl13.bat" in the DOS window to start training and performance evaluation. (You can also execute goSyl13.m within MATLAB to get the same results.) If the command runs smoothly until two confusion matrices are displayed on the screen, then everything is set up correctly.

Before start our corpus training, we need to prepare two files manually. The first file is "digitSyl.pam" which specifies how to decompose the phonetic alphabets of each digit into the corresponding acoustic models. For simplicity, our current approach use each syllable as an acoustic model, as follows:

ba	ba
er	er
jiou	jiou
ling	ling
liou	liou
qi	qi
san	san
si	si
sil	sil
wu	wu
i	i

The above decomposition is a rough but simple way to define syllable-based acoustic models for this application. More delicate decompositions based on monophones or biphones will be explained later.

The extension "pam" represents phonetic alphabets to model. This is also referred to as the dictionary file in HTK.

The second file is digitSyl.mlf, which define the contents (in the form of acoustic models) of each utterance, as follows:


In the above file, we use sil for silence. We add leading and trailing sil around ling to indicate the utterance of "0" is surrounded by silence.

  • The extension of "mlf" stands for master label file, which is used to record the contents of each utterance.
  • HTK uses feature file names to map to the contents in an mlf file. Hence the feature file names should be unique. (The wave file names do not have to be unique.)

In the following, we shall explain the contents of the MATLAB script goSyl13.m and the DOS batch file goSyl13.bat. Both these files involve three major tasks:

  1. Extraction of acoustic features of MFCC.
  2. Corpus training based on EM to find the optimum parameters.
  3. Performance evaluation based on recognition rate.
We shall introduce these three steps in detail.

  1. Acoustic feature extraction of MFCC

    1. Create output directories
      We need to create 3 directories for holding output files:
      • output: For various intermediate output files
      • output\feature: For feature files of all utterances.
      • output\hmm: For HMM parameters during training
      The MATLAB commands are
      The batch command is:
      for %%i in (output output\feature output\hmm) do mkdir %%i > nul 2>&1
      If the directories exist, the batch command will not display any warning messages.

      The batch command listed above is copied directly from goSyl13.bat. If you want to execute the command within DOS prompt, be sure to change %%i to %i. Same for the following discussion.

    2. Generate digitSyl.mnl and digitSylPhone.mlf
      The MATLAB comand for generating syl2phone.scp is: fid=fopen('output\syl2phone.scp', 'w'); fprintf(fid, 'EX'); fclose(fid); The corresponding batch command is: @echo EX > output\syl2phone.scp The contents of syl2phone.scp are shown next:


      The string "EX" represents "expand", which serves to expand syllables into acoustic models to be used with the HTK command "HLED". This command is used to generate digitSyl.mnl and digitSylPhone.mlf, as shown next:

      HLEd -n output\digitSyl.mnl -d digitSyl.pam -l * -i output\digitSylPhone.mlf output\syl2phone.scp digitSyl.mlf
      In the above expression, input files are in blue while output files are in red. The output file digitSyl.mnl lists all the used acoustic models:


      The extension "mnl" represents model name list.

      The file digitSylPhone.mlf contains the results of converting the syllable information in digitSyl.mlf into acoustic models for corpus training, as follows:


      In this example, since we are using syllable-based acoustic models, the contents of digitSylPhone.mlf are the same as those in digitSyl.mlf.

    3. Generate wav2fea.scp
      Before extracting acoustic features, we need to specify the file mapping between each utterance (with extension .wav) and its corresponding feature file (with extension .fea). This mapping is specified in the file wav2fea.scp, which can be generated by the following MATLAB commands:
      waveFiles=recursiveFileList(wavDir, 'wav');
      fid=fopen(outFile, 'w');
      for i=1:length(waveFiles)
      	wavePath=strrep(waveFiles(i).path, '/', '\');
      	fprintf(fid, '%s\t%s\r\n', wavePath, ['output\feature\', b, '.fea']);
      The corresponding batch command is much simpler:
      (for /f "delims=" %%i in ('dir/s/b wave\*.wav') do @echo %%i output\feature\%%~ni.fea)> output\wav2fea.scp
      The contents of wav2fea.scp are shown next:


      From the contents of wave2fea.scp, we know that all the feature files will be put under "output\feature" with a file extension of "fea".

    4. Use HCopy.exe for acoustic feature extraction
      Now we can use HTK command "HCopy" to generate MFCC feature files for all utterances:
      HCopy -C mfcc13.cfg -S output\wav2fea.scp
      In the above expression, mfcc13.cfg is a configuration file which specifies parameters for generating MFCC, with the following contents:

      TARGETRATE = 100000.0
      WINDOWSIZE = 200000.0
      PREEMCOEF = 0.975
      NUMCHANS = 26
      CEPLIFTER = 22
      NUMCEPS = 12
      DELTAWINDOW = 2	
      ACCWINDOW= 2

      The meanings of these parameters can be found in the HTK manual.

  2. Corpus training based on EM to find the optimum parameters

    1. Generate file lists in trainFea.scp and testFea.scp
      We need to generate file lists for training and test sets, with the following MATLAB commands:
      fid=fopen(outFile, 'w');
      for i=1:460
      	wavePath=strrep(waveFiles(i).path, '/', '\');
      	fprintf(fid, '%s\r\n', ['output\feature\', b, '.fea']);
      fid=fopen(outFile, 'w');
      for i=461:length(waveFiles)
      	wavePath=strrep(waveFiles(i).path, '/', '\');
      	fprintf(fid, '%s\r\n', ['output\feature\', b, '.fea']);
      From the above program, it is obvious that the first 460 files are for training, while all the others are for test. The corresponding batch commands are:
      for %%i in (train test) do (
      	for /f %%j in (%%i.list) do @echo output\feature\%%j.fea
      ) > output\%%iFea.scp
      Note that the above code segment read contents from files train.list and test.list (which are prepared in advance), and generates files trainFea.scp and testFea.scp for corpus training and recognition rate computation, respectively. The contents of trainFea.scp are:


    2. Generate HMM template file
      For corpus training, we need to generate an HMM template file to specify the model structure, such as how many states in an acoustic model, how many streams in a state, and how many Gaussian components in a stream, and so on. The HTK command is:
      outMacro.exe P D 3 1 MFCC_E 13 > output\template.hmm
      • P: HMM system type, which is fixed to "P" for the time being.
      • D: Types of the covariance matrix, which could be "InvDiagC", "DiagC", or "FullC". The "D" is the above command represents "DiagC", which is the most commonly used setting.
      • 3: Number of states for a model
      • 1: Indicate each state has 1 stream with 1 Gaussian component. (For example, "5 3" indicates there are 2 streams in a state, with 5 and 3 Gaussian components, respectively.)
      • MFCC_E: The acoustic parameters are MFCC and energy.
      • 13: Dimension of the feature vector.
      If this is done with MATLAB, we need to invoke genTemplateHmmFile.m, as follows:
      genTemplateHmmFile(feaType, feaDim, stateNum, outFile, mixtureNum, streamWidth);
      The generated template.hmm specifies an HMM of 3 states, with 1 stream per state, and 1 component per stream, with the following contents:


      Since this file is used to specify the structure of HMM, all the parameters are given preset reasonable values. Moreover, the states are given indices from 2 to 4 since the first and last states are dummy in HTK convention.

    3. Populate HMM template using all corpus
      In the next step, we need to compute the initial HMM parameters from the corpus and put them into template.hmm to generate output\hcompv.hmm. By doing so, we can have a set of parameters (for a single HMM of 3 states) which is a better guess than the preset value in template.hmm. Later on, we should copy the parameters to all of the HMMs for EM training. The following command can populate output\template.hmm to generate output\hcompv.hmm:
      HCompV -m -o hcompv.hmm -M output -I output\digitSylPhone.mlf -S output\trainFea.scp output\template.hmm
      The contents of the generated output\hcompv.hmm are:


      From the contents of output/hcompv.htmm, it can be observed that:

      • The transition probabilities are not changed.
      • The mean and variance of each Gaussian have been changed to the same values for all components. These values are obtained via MLE (maximum likelihood estimate) based on all corpus.
    4. Copy the contents of hcompv.hmm to generate macro.init
      In this step, we need to copy the contents of hcompv.hmm to each acoustic model, with the following MATLAB commands:
      % Read digitSyl.mnl
      models = textread(modelListFile,'%s','delimiter','\n','whitespace','');
      % Read hcompv.hmm
      fid=fopen(hmmFile, 'r');
      contents=fread(fid, inf, 'char');
      % Write macro.init
      fid=fopen(outFile, 'w');
      source='~h "hcompv.hmm"';
      for i=1:length(models)
      	target=sprintf('~h "%s"', models{i});
      	x=strrep(contents, source, target);
      	fprintf(fid, '%s', x);
      The corresponding DOS batch commands are:
        (for /f %%i in (output\digitSyl.mnl) do @sed 's/hcompv.hmm/%%i/g' output\hcompv.hmm) > output\macro.init
      The generated HMM parameter file is macro.init, with the following contents:


      This file contains the HMM parameters of 11 (silBlingBiBerBsanBsiBwuBliouBqiBbaBjiou) acoustic models.

    5. Use mxup.scp to modify macro.init to generate macro.0
      We copy output\macro.init to output\hmm\macro.0 first, and then use HHEd.exe to modify macro.0, with the following MATLAB commands:
      fid=fopen('output\mxup.scp', 'w'); fprintf(fid, 'MU 3 {*.state[2-4].mix}'); fclose(fid);
      copyfile('output/macro.init', 'output/hmm/macro.0');
      cmd='HHEd -H output\hmm\macro.0 output\mxup.scp output\digitSyl.mnl';
      The corresponding batch commands are:
      copy /y output\macro.init output\hmm\macro.0
      (@echo MU 3 {*.state[2-4].mix}) > output\mxup.scp
      HHEd -H output\hmm\macro.0 output\mxup.scp output\digitSyl.mnl
      The contents of mxup.scp are shown next:

      MU 3 {*.state[2-4].mix}

      Its function is to increase the number of mixture components (of states 2 ~ 4) from 1 to 3. The contents of the generated macro.0 are:


      From the above contents, we can observe that the variances of the three mixtures of a given state are the same. But their mean vectors are different in order to better cover the dataset.

    6. Perform re-estimation to generate macro.1~macro.5
      Now we can start training to find the best parameters for each acoustic model, with the MATLAB commands:
      for i=1:emCount
      	sourceMacro=['output\hmm\macro.', int2str(i-1)];
      	targetMacro=['output\hmm\macro.', int2str(i)];
      	fprintf('%d/%d:  %s...\n', i, emCount, targetMacro);
      	copyfile(sourceMacro, targetMacro);
      	cmd=sprintf('HERest -H %s -I output\\digitSylPhone.mlf -S output\\trainFea.scp output\\digitSyl.mnl', targetMacro);
      The corresponding batch commands are:
      set current=0
      	set /a prev=current
      	set /a current+=1
      	copy /y output\hmm\macro.%prev% output\hmm\macro.%current%
      	set cmd=HERest -H output\hmm\macro.%current% -I output\digitSylPhone.mlf -S output\trainFea.scp output\digitSyl.mnl
      	echo %cmd%
      if not %current%==5 goto :loop
      In the above commands, we use macro.0 as the initial guess for corpus training to generate macro.1, and then use macro.1 to perform re-estimation to generate macro.2. This re-estimation is repeated five times to generate macro.1 ~ macro.5 in "output\hmm\macro.*".

      Since we do not have phone-level transcription, HTK will use flat start (or equal division) to assign an utterance uniformly to its transcribed sequence of models and states.

  3. Performance evaluation based on recognition rate

    1. Use digit.grammar to generate digit.net
      After corpus training, we need to evaluate the recognition rate based on a test data set. First of all, we need to construct the lexicon net, as follows:
      Hparse digit.grammar output\digit.net
      The contents of grammer.txt are:

      $syl=( ling | i | er | san | si | wu | liou | qi | ba | jiou );
      (sil $syl sil)

      The contents of the generated digit.net are:


      The schematic diagram of the net is shown next:

      Note that "!NULL" is a dummy node whose inputs can connected to its fanout directly.

    2. Evaluate the recognition rates for both the training and test sets
      This is achieved by the following HTK commands:
      HVite -H output\macro -l * -i output\result_test.mlf -w output\digit.net -S output\testFea.scp digitSyl.pam output\digitSyl.mnl
      The contents of the output file result_test.mlf are:


      By using a similar command, we can also generate the recognition rate of the training set.

    3. Generate the confusion matrices for both inside and outside tests
      Finally, we can use the following commands to generate the confusion matrices:
      findstr /v "sil" output\result_test.mlf > output\result_test_no_sil.mlf
      findstr /v "sil" digitSyl.mlf > output\answer.mlf
      HResults -p -I output\answer.mlf digitSyl.pam output\result_test_no_sil.mlf > output\outsideTest.txt
      type output\outsideTest.txt
      The confusion matrix for the outside test is:


      Similarly, the confusion matrix for the inside test is:


      As usually, the outside test is not as good as the inside test. One possible reason is that the training corpus is not big enough to cover a variety of accents from different individuals. Other possibilities could be the structures of the acoustic models. In the subsequent sections, we shall explore other model structures to improve the performance.

Audio Signal Processing and Recognition (TBzP)