6-1 Introduction to End-Point Detection (端點偵測介紹)

[chinese][all]

Slides for this chapter

The goal of end-point detection (EPD for short) is to identify the important part of an audio segment for further processing. Hence EPD is also known as "voice activity detection" (VAD) or "speech detection". EPD plays an essential role since it is usually the first step in audio signal processing and recognition.

Based on the acoustic features used, we can classify EPD methods into two types:

  1. Time-domain methods:
    1. Volume only: Volume is the most commonly used feature for EPD. However, it is usually hard to have a single universal threshold for EPD. (In particular, a single volume threshold for EPD is likely to misclassify unvoiced sounds as silence for audio input from uni-directional microphones.)
    2. Volume and ZCR: ZCR can be used in conjunction with volume to identify unvoiced sounds in a more reliable manner, as explained in the next section.
    The computation load for these methods is usually small so they can be ported to low-end platform such as micro-controllers.
  2. Frequency-domain methods:
    1. Variance in spectrum: Voiced sounds have more regular amplitude spectra, leading to smaller spectral variances.
    2. Entropy in spectrum: Regular amplitude spectra of voices sounds also generate low entropy, which can be used as a criterion for EPD.
    These methods usually require more computing power and thus are not portable to low-end platforms.

Hint
To put it simply, time-domain methods use only the waveform of audio signals for EPD. On the other hand, if we need to use the Fourier transform to analyze the waveforms for EPD, then it is frequency-domain method. More information on spectrum and Fourier transform will be detailed in later chapters.

There are two types of errors in EPD, which cause different effects in speech recognition, as follows.

The other sections of this chapter will introduce both time-domain and frequency-domain methods for EPD.
Audio Signal Processing and Recognition (音訊處理與辨識)