20-2 ASRA for English

In this chapter, we shall cover the use of ASRA for spoken English on Windows. If you want to read the document for other platforms, simply change the platform option in the URL of this page. For instance:

ASRA document for English on Windows
ASRA document for English on Unix/Linux
To download ASRA package for English, try the following links:

ASRA package for English on Windows
ASRA package for English on Unix/Linux
We shall cover the following items:

Directory/file structures of the package
Functionalities of ASRA
Compilation of the main programs with the libraries in the package
Other related issues of using ASRA
For further technical details, please contact Roger Jang at jang@cs.nthu.edu.tw.
For simplicity, we shall use {asraRoot} to denote the root directory of the ASRA package. Here is a list of directories/files under {asraRoot}:

Directories:

asraData/English: Fixed data files of ASRA for English, including acoustic models, pronunciation dictionary, etc.
testInputEnglish: Input data for testing English ASRA
docEnglish: Documents of ASRA
lib: Library files for ASRA (mainly *.h and asra.lib)
mainProgram: Main programs for linking ASRA library
output: Output directory for ASRA
script4winEnglish: Scripts (*.bat) for invoking the main programs
testInputEnglish: Data files for testing ASRA

Files:

expDate.txt: Expiration date of the ASRA library
asra.dll, asra.lib, asra.exp: dynamic library for ASRA.
vcVersionSet.bat: Set the VC version for compiling the main program together with the libraries. This is used to compile the shipped static libraries.

The first thing you can to is to enter {asraRoot}/mainProgram and check out all the main programs. These main programs can be divided into two groups of speech assessment (SA) and voice command (VC), which are two major functions provided by ASRA:

Speech assessment (SA)

saLibFile.cpp: SA program compiled with LIB, which takes a wav file as the input
saDllFile.cpp: SA program using DLL, which takes a wav file as the input
saDllRecording.cpp: SA program using DLL, which takes microphone recording as the input
saLibFile_multipleInput.cpp: SA program compiled with LIB, which takes multiple wav files as the input (This is a typical example of allocating memory once to process multiple wav files.)

Voice command (VC)

vcLibFile.cpp: VC program compiled with LIB, which takes a wav file as the input
vcDllFile.cpp: VC program using DLL, which takes a wav file as the input
vcDllRecording.cpp: VC program using DLL, which takes microphone recording as the input
vcLibFile_multipleInput.cpp: VC program compiled with LIB, which takes multiple wav files as the input (This is a typical example of allocating memory once to process multiple wav files.)

To compile all the above main programs, you can simply type the following command within a DOS window:
goMainCompile.bat
This will set up environmental variables for VC (using the batch file {asraRoot}/vcVersionSet.bat) and compile all the main programs accordingly.
If there is no problem with compilation, we shall have an executable for each of the above main programs. These executables can then be used for SA or VC in different scenarios. To try out the executables, we can move to {asraRoot}/script4winEnglish to run some scripts that invoke these main programs:

For SA:

Run "goSaLibFile.bat" to invoke saLibFile.exe for SA using an input wave file (16 KHz, 16 bits, mono).
Run "goSaLibFile_multipleInput.bat" to invoke saLibFile_multipleInput.exe for SA using multiple input wave file (16 KHz, 16 bits, mono).
Run goSaDllRecording.bat to invoke saDllRecording.exe for SA using direct microphone input.
(The detailed scoring result is stored at {asraRoot}/output/output.xml)
For VC:

Run "goVcLibFile.bat" to invoke VcLibFile.exe for VC using an input wave file (16 KHz, 16 bits, mono).
Run "goVcLibFile_multipleInput.bat" to invoke vcLibFile_multipleInput.exe for SA using multiple input wave file (16 KHz, 16 bits, mono).
Run goVcDllRecording.bat to invoke vcDllRecording.exe for VC using microphone input.
(Recognizable vocabulary is located {asraRoot}/testInputEnglish/english0200.txt)
After the execution of the above scripts, several output files are generated under {asraRoot}/output. You can use these files to save computation and speed up SA/RA significantly. For example:

For English SA, after running "goSaLibFile.bat", you can copy files as follows:

{asraRoot}/output/output.net ===> {asraRoot}/testInputEnglish/what_are_you_allergic_to.net
{asraRoot}/output/output.wpa ===> {asraRoot}/testInputEnglish/what_are_you_allergic_to.wpa
You can then run "goSaLibFile_fast.bat" to see the speed difference. (In fact, these files were put in the right place already. You can simply run "goSaLibFile.bat" and "goSaLibFile_fast.bat" to compare their speed difference.)
For English VC, after running "goVcLibFile.bat", you can copy files as follows:

{asraRoot}/output/output.syl ===> {asraRoot}/testInputEnglish/english0200.syl
{asraRoot}/output/output.net ===> {asraRoot}/testInputEnglish/english0200.net
{asraRoot}/output/output.wpa ===> {asraRoot}/testInputEnglish/english0200.wpa
You can then run "goVcLibFile.bat" to see the speed difference. (In fact, these files were put in the right place already. You can simply run "goSaLibFile_fast.bat" and "goVcLibFile_fast.bat" to compare their difference.)
Here is a list of the input files:

english.macb: Macro file containing all the HMM parameters for English acoustic models
english.wpa: WPA file with the phonetic alphabets for all English words. Note that you should not modify this file since it is automatically generated from CMU dictionary available at http://www.speech.cs.cmu.edu/cgi-bin/cmudict The mapping table between CMU phonetic alphabets and the commonly used KK phonetic alphabets can be found at http://blog.urdada.net/2005/07/17/17/ If you want to add new entries (which correspond to new words in your recognizable texts) to this file, modify english.wpaAddenda accordingly.
english.wpaAddenda: User-defined WPA file, which can be used to hold extra listing not in english.wpa.
english.qiYin: Phone of unvoiced sounds.
mfcc2.cfg: Configuration file for MFCC
phoneRank2scoreParam.txt: Parameters for converting phone ranking into scores.
scoreDiscountParam.txt: Parameters for score discount
english.tnm: Text normalization mapping. You can open this file with a text editor to see the mapping for text normalization.
You can also open the recognition parameter file using a text editor. A brief explanation of each entry in the parameter file is also given in the file.
If necessary, you can specify "outputDir" in the prm file to store the output of ASRA. Some of the output files that may help your debugging:

output.xml: This is the major output file for SA/VC, which lists all the details for computing the final score of confidence measure. Some comments are interleaved into the file to make it self-explanatory.
output.wpa: Minimum wpa file containing phonetic alphabets (PA) for the given txt file. Instead of using the original comprehensive wpa file, you can use this file instead to speed up the loading time. (Of course, the minimum wpa file is only good for the corredponding txt file.)
output.syl: All possible PA sequences for sentences in the txt file. The first column is the PA sequences; the second column is the index (0-based) into the txt file. If some of the PA sequences are unlikely, you can simply delete them and use the update output.syl as the input to ASRA for VC. (Note that SA will not generate output.syl.)
output.net: Lexicon net (same format as HTK's net file)
Other technical issues:

To achieve the best results, ASRA takes .wav files of 16KHz, 16Bits, mono, linear/PCM as the standard format. Other formats may or may not be taken. ASRA sometimes will try to do format conversion if the format slight mismatches. For best results, please stick to the standard format.
ASRA can also take a subset of .flv files recorded by FLASH. However, you should avoid using .flv files since they are compressed in a lossy way and thus not good for SA/VC.
When invoking the main programs, we usually use parameter files {asraRoot}/testInputEnglish/english.sa.prm or {asraRoot}/testInputEnglish/english.vc.prm. The command-line arguments always precedes the entries in these parameter files. Moreover, the entries of these parameter files are optimized and please do not change them unless you know the meanings of each entry.
The library files were compiled with /MD options to enable multi-threading.
Usually there are four factors to considered for scoring in speech assessment:

Timber for each phone
Intonation profile for the whole utterance
Volume profile for the whole utterance
Duration for each phone
The scoring mechanisms for these four factors differ. For timber, the scoring is based on voice to AM (acoustic models) comparison. For all the other three factors, the scoring is based on voice to voice comparison. It is almost impossible to have timber scores based on voice to voice comparison since utterances of same texts from different persons differ a lot in their acoustic features. As a result, we need to have the statistical acoustic models for more objective scoring for timber. Our current version provides scores for timber, which is the most important factor in speech assessment. Scoring mechanisms for intonation, volume, and duration will be provided later, but there is no definite schedule at this moment.

Audio Signal Processing and Recognition (音訊處理與辨識)