In this chapter, we shall cover the use of ASRA for spoken English on Unix/Linux. If you want to read the document for other platforms, simply change the platform option in the URL of this page. For instance: To download ASRA package for English, try the following links: We shall cover the following items:For further technical details, please contact Roger Jang at jang@cs.nthu.edu.tw.
- Directory/file structures of the package
- Functionalities of ASRA
- Compilation of the main programs with the libraries in the package
- Other related issues of using ASRA
For simplicity, we shall use {asraRoot} to denote the root directory of the ASRA package. Here is a list of directories/files under {asraRoot}:
The first thing you can to is to enter {asraRoot}/mainProgram and check out all the main programs. These main programs can be divided into two groups of speech assessment (SA) and voice command (VC), which are two major functions provided by ASRA:
- Directories:
- asraData/English: Fixed data files of ASRA for English, including acoustic models, pronunciation dictionary, etc.
- testInputEnglish: Input data for testing English ASRA
- docEnglish: Documents of ASRA
- lib: Library files for ASRA (mainly *.h and asra.lib)
- mainProgram: Main programs for linking ASRA library
- output: Output directory for ASRA
- script4unixEnglish: Scripts (*.sh) for invoking the main programs
- testInputEnglish: Data files for testing ASRA
- Files:
- expDate.txt: Expiration date of the ASRA library
To compile all the above main programs, you can simply type the following command within a terminal window:
- Speech assessment (SA)
- saLibFile.cpp: SA program, which takes a wav file as the input
- saLibFile_multipleInput.cpp: SA program, which takes multiple wav files as the input (This is a typical example of allocating memory once to process multiple wav files.)
- Voice command (VC)
- vcLibFile.cpp: VC program, which takes a wav file as the input
- vcLibFile_multipleInput.cpp: VC program, which takes multiple wav files as the input (This is a typical example of allocating memory once to process multiple wav files.)
bash goMainCompile.sh
If there is no problem with compilation, we shall have an executable for each of the above main programs. These executables can then be used for SA or VC in different scenarios. To try out the executables, we can move to {asraRoot}/script4unixEnglish to run some scripts that invoke these main programs:
After the execution of the above scripts, several output files are generated under {asraRoot}/output. You can use these files to save computation and speed up SA/RA significantly. For example:
- For SA:
(The detailed scoring result is stored at {asraRoot}/output/output.xml)
- Run "bash goSaLibFile.sh" to invoke saLibFile.exe for SA using an input wave file (16 KHz, 16 bits, mono).
- Run "bash goSaLibFile_multipleInput.sh" to invoke saLibFile_multipleInput.exe for SA using multiple input wave file (16 KHz, 16 bits, mono).
- For VC:
(Recognizable vocabulary is located {asraRoot}/testInputEnglish/english0200.txt)
- Run "bash goVcLibFile.sh" to invoke VcLibFile.exe for VC using an input wave file (16 KHz, 16 bits, mono).
- Run "bash goVcLibFile_multipleInput.sh" to invoke vcLibFile_multipleInput.exe for SA using multiple input wave file (16 KHz, 16 bits, mono).
Here is a list of the input files:
- For English SA, after running "bash goSaLibFile.sh", you can copy files as follows:
You can then run "bash goSaLibFile_fast.sh" to see the speed difference. (In fact, these files were put in the right place already. You can simply run "bash goSaLibFile.sh" and "bash goSaLibFile_fast.sh" to compare their speed difference.)
- {asraRoot}/output/output.net ===> {asraRoot}/testInputEnglish/what_are_you_allergic_to.net
- {asraRoot}/output/output.wpa ===> {asraRoot}/testInputEnglish/what_are_you_allergic_to.wpa
- For English VC, after running "bash goVcLibFile.sh", you can copy files as follows:
You can then run "bash goVcLibFile.sh" to see the speed difference. (In fact, these files were put in the right place already. You can simply run "bash goSaLibFile_fast.sh" and "bash goVcLibFile_fast.sh" to compare their difference.)
- {asraRoot}/output/output.syl ===> {asraRoot}/testInputEnglish/english0200.syl
- {asraRoot}/output/output.net ===> {asraRoot}/testInputEnglish/english0200.net
- {asraRoot}/output/output.wpa ===> {asraRoot}/testInputEnglish/english0200.wpa
You can also open the recognition parameter file using a text editor. A brief explanation of each entry in the parameter file is also given in the file.
- english.macb: Macro file containing all the HMM parameters for English acoustic models
- english.wpa: WPA file with the phonetic alphabets for all English words. Note that you should not modify this file since it is automatically generated from CMU dictionary available at
http://www.speech.cs.cmu.edu/cgi-bin/cmudict The mapping table between CMU phonetic alphabets and the commonly used KK phonetic alphabets can be found athttp://blog.urdada.net/2005/07/17/17/ If you want to add new entries (which correspond to new words in your recognizable texts) to this file, modify english.wpaAddenda accordingly.- english.wpaAddenda: User-defined WPA file, which can be used to hold extra listing not in english.wpa.
- english.qiYin: Phone of unvoiced sounds.
- mfcc2.cfg: Configuration file for MFCC
- phoneRank2scoreParam.txt: Parameters for converting phone ranking into scores.
- scoreDiscountParam.txt: Parameters for score discount
- english.tnm: Text normalization mapping. You can open this file with a text editor to see the mapping for text normalization.
If necessary, you can specify "outputDir" in the prm file to store the output of ASRA. Some of the output files that may help your debugging:
Other technical issues:
- output.xml: This is the major output file for SA/VC, which lists all the details for computing the final score of confidence measure. Some comments are interleaved into the file to make it self-explanatory.
- output.wpa: Minimum wpa file containing phonetic alphabets (PA) for the given txt file. Instead of using the original comprehensive wpa file, you can use this file instead to speed up the loading time. (Of course, the minimum wpa file is only good for the corredponding txt file.)
- output.syl: All possible PA sequences for sentences in the txt file. The first column is the PA sequences; the second column is the index (0-based) into the txt file. If some of the PA sequences are unlikely, you can simply delete them and use the update output.syl as the input to ASRA for VC. (Note that SA will not generate output.syl.)
- output.net: Lexicon net (same format as HTK's net file)
- To achieve the best results, ASRA takes .wav files of 16KHz, 16Bits, mono, linear/PCM as the standard format. Other formats may or may not be taken. ASRA sometimes will try to do format conversion if the format slight mismatches. For best results, please stick to the standard format.
- ASRA can also take a subset of .flv files recorded by FLASH. However, you should avoid using .flv files since they are compressed in a lossy way and thus not good for SA/VC.
- When invoking the main programs, we usually use parameter files {asraRoot}/testInputEnglish/english.sa.prm or {asraRoot}/testInputEnglish/english.vc.prm. The command-line arguments always precedes the entries in these parameter files. Moreover, the entries of these parameter files are optimized and please do not change them unless you know the meanings of each entry.
- The library files were compiled with /MD options to enable multi-threading.
- Usually there are four factors to considered for scoring in speech assessment:
The scoring mechanisms for these four factors differ. For timber, the scoring is based on voice to AM (acoustic models) comparison. For all the other three factors, the scoring is based on voice to voice comparison. It is almost impossible to have timber scores based on voice to voice comparison since utterances of same texts from different persons differ a lot in their acoustic features. As a result, we need to have the statistical acoustic models for more objective scoring for timber. Our current version provides scores for timber, which is the most important factor in speech assessment. Scoring mechanisms for intonation, volume, and duration will be provided later, but there is no definite schedule at this moment.
- Timber for each phone
- Intonation profile for the whole utterance
- Volume profile for the whole utterance
- Duration for each phone
Audio Signal Processing and Recognition (音訊處理與辨識)