7-1 Introduction to Pitch Tracking (l簡)

[chinese][english]

Slides for this chapter

From the previous chapter, you should already know how to find the pitch by visual inspection and some simple computation. If we want the computer to automatically identify the pitch from a stream of audio signals, then we need to have some reliable methods for such a task of "pitch tracking". Once the pitch vector is identified, it can be used for various applications in audio signal processing, including:

如前一章所述,使用「觀察法」來算出音高,並不是太難,但是若要電腦自動算出音高,就需要對音訊進行進一步的自動分析。使用電腦對整段音訊進行抓取音高的過程,通常稱為「音高追蹤」(Pitch Tracking),所抓出來的音高資訊,有下列應用:

In summary, pitch tracking is a fundamental step toward other important tasks for audio signal processing. Related research on pitch tracking has been going on for decades, and it still remains an hot topic in the literature. Therefore we need to know the basic concept of pitch tracking as a stepstone for other advanced audio processing techniques.

總而言之,音高追蹤可說是音訊處理過程中,最基本也是最重要的一環,相關的研究,也進行了數十年,因此我們必須完全瞭解其原理,才能繼續進行其他相關的分析與處理。

Pitch tracking follows the general processing of short-term analysis for audio signals, as follows.

音高追蹤的基本流程如下:

  1. Chop the audio signals into frames of 20 ms or so. Overlap is allowed between neighboring frames.
  2. Compute the pitch of each frame.
  3. Eliminate pitch from silence or unvoiced sounds. This can be done by using volume thresholding or pitch range thresholding.
  4. Smooth the pitch curve using median filters or other similar methods.
  1. 將整段音訊訊號切成音框(Frames),相鄰音框之間可以重疊。
  2. 算出每個音框所對應的音高。
  3. 排除不穩定的音高值。(可由音量來篩選,或由音高值的範圍來過濾。)
  4. 對整段音高進行平滑化,通常是使用「中位數濾波器」(Median Filters)。

In the processing frame blocking, we allow overlap between neighboring frames to reduce discontinuity between them. We can define "frame rate" as the frames per second for our analysis. For instance, if fs = 11025 Hz, frame size = 256, overlap = 84, then the frame rate is equal to fs/(frameSize-overlap) = 11025/(256-84) = 64. In other words, if we wish to have real-time pitch tracking (for instance, on the platform of micro-controllers), then the computer should be able to handle 64 frames per second. A small overlap will lead to a low frame rate. The process of frame blocking is shown next.

在切音框的過程中,我們允許左右音框的重疊,因此我們定義「音框率」(Frame Rate)是每秒鐘所出現的音框個數,如果取樣頻率是 11025,音框長度是 256 點,重疊點數是 84,那麼音框率就是 11025/(256-84) = 64,換句話說,我們的電腦要能夠每秒鐘處理 64 個音框,才能達到「即時處理」的目的。示意圖如下:

When we choose the frame size and the overlap, we need to consider the following factors.

我們讓音框重疊的目地,只是希望相鄰音框之間的變化不會太大,使抓出來的音高曲線更具有連續性。但是在實際應用時,音框的重疊也不能太大,否則會造成計算量的過大。在選擇音框的大小時,有下列考量因素:

There are a number of methods to derive a pitch value from a single frame. Generally, these methods can be classified into time-domain and frequency-domain methods, as follows.

由一個音框計算出音高的方法很多,可以分為時域和頻域兩大類:

These methods will be covered in the rest of this chapter.

這些方法將在以下各小節介紹。


Audio Signal Processing and Recognition (音訊處理與辨識)