[chinese][all] Slides for this chapter
The goal of end-point detection (EPD for short) is to identify the important part of an audio segment for further processing. Hence EPD is also known as "voice activity detection" (VAD) or "speech detection". EPD plays an essential role since it is usually the first step in audio signal processing and recognition.
Based on the acoustic features used, we can classify EPD methods into two types:
- Time-domain methods:
The computation load for these methods is usually small so they can be ported to low-end platform such as micro-controllers.
- Volume only: Volume is the most commonly used feature for EPD. However, it is usually hard to have a single universal threshold for EPD. (In particular, a single volume threshold for EPD is likely to misclassify unvoiced sounds as silence for audio input from uni-directional microphones.)
- Volume and ZCR: ZCR can be used in conjunction with volume to identify unvoiced sounds in a more reliable manner, as explained in the next section.
- Frequency-domain methods:
These methods usually require more computing power and thus are not portable to low-end platforms.
- Variance in spectrum: Voiced sounds have more regular amplitude spectra, leading to smaller spectral variances.
- Entropy in spectrum: Regular amplitude spectra of voices sounds also generate low entropy, which can be used as a criterion for EPD.
There are two types of errors in EPD, which cause different effects in speech recognition, as follows.
The other sections of this chapter will introduce both time-domain and frequency-domain methods for EPD.
- False Rejection: Speech frames are erroneously identified as silence/noise, leading to decreased recognition rates.
- False Acceptance: Silence/noise frames are erroneously identified as speech frames, which will not cause too much trouble if the recognizer can take short leading/trailing silence into consideration.
Audio Signal Processing and Recognition (音訊處理與辨識)