6-1 Introduction to End-Point Detection (I)

The goal of end-point detection (EPD for short) is to identify the important part of an audio segment for further processing. Hence EPD is also known as "voice activity detection" (VAD) or "speech detection". EPD plays an essential role since it is usually the first step in audio signal processing and recognition.

「端點偵測」（End-point Detection，簡稱 EPD）的目標是要決定音訊開始和結束的位置，所以又可以稱為 Speech Detection 或是 VAD (Voice Activity Detection)。端點偵測在音訊處理與辨識中，扮演一個重要的角色。

Based on the acoustic features used, we can classify EPD methods into two types:

常見的端點偵測方法與相關的特徵參數，可以分成兩大類：

Time-domain methods:
1. Volume only: Volume is the most commonly used feature for EPD. However, it is usually hard to have a single universal threshold for EPD. (In particular, a single volume threshold for EPD is likely to misclassify unvoiced sounds as silence for audio input from uni-directional microphones.)
2. Volume and ZCR: ZCR can be used in conjunction with volume to identify unvoiced sounds in a more reliable manner, as explained in the next section.
The computation load for these methods is usually small so they can be ported to low-end platform such as micro-controllers.
Frequency-domain methods:
1. Variance in spectrum: Voiced sounds have more regular amplitude spectra, leading to smaller spectral variances.
2. Entropy in spectrum: Regular amplitude spectra of voices sounds also generate low entropy, which can be used as a criterion for EPD.
These methods usually require more computing power and thus are not portable to low-end platforms.

時域（Time Domain）的方法：計算量比較小，因此比較容易移植到計算能力較差的微電腦平台。
1. 音量：只使用音量來進行端點偵測，是最簡單的方法，但是會對氣音造成誤判。不同的音量計算方式也會造成端點偵測結果的不同，至於是哪一種計算方式比較好，並無定論，需要靠大量的資料來測試得知。
2. 音量和過零率：以音量為主，過零率為輔，可以對氣音進行較精密的檢測。
頻域（Frequency Domain）的方法：計算量比較大，因此比較難移植到計算能力較差的微電腦平台。
1. 頻譜的變異數：有聲音的頻譜變化較規律，變異數較低，可作為判斷端點的基準。
2. 頻譜的Entropy：我們也可以使用使用 Entropy 達到類似上述的功能。

Hint

To put it simply, time-domain methods use only the waveform of audio signals for EPD. On the other hand, if we need to use the Fourier transform to analyze the waveforms for EPD, then it is frequency-domain method. More information on spectrum and Fourier transform will be detailed in later chapters.

Hint

簡單地說，若只是對聲音波形做一些較簡單的運算，就是屬於時域的方法。另一方面，凡是要用到傅立葉轉換（Fourier Transform）來產生聲音的頻譜，就是屬於頻譜的方法。這種分法常被用來對音訊處的方法進行分類，但有時候有一些模糊地帶。有關於頻譜以及傅立葉轉換，會在後續的章節說明。

There are two types of errors in EPD, which cause different effects in speech recognition, as follows.

False Rejection: Speech frames are erroneously identified as silence/noise, leading to decreased recognition rates.
False Acceptance: Silence/noise frames are erroneously identified as speech frames, which will not cause too much trouble if the recognizer can take short leading/trailing silence into consideration.

The other sections of this chapter will introduce both time-domain and frequency-domain methods for EPD.

錯誤的端點偵測，在語音辨識上會造成兩種效應：

False Rejection：將 Speech 誤認為 Silence/Noise，因而造成音訊辨識率下降
False Acceptance：將 Silence/Noise 誤認為 Speech，此時音訊辨識率也會下降，但是我們可以在設計辨識器時，前後加上可能的靜音聲學模型，此時辨識率的下降就會比前者來的和緩。

以下各小節將針對這兩類的端點偵測方法來介紹。

Audio Signal Processing and Recognition (音訊處理與辨識)