5-4 Pitch (?喲?)

Pitch is an important feature of audio signals, especially for quasi-periodic signals such as voiced sounds from human speech/singing and monophonic music from most music instruments. Intuitively speaking, pitch represent the vibration frequency of the sound source of audio signals. In other words, pitch is the fundamental frequency of audio signals, which is equal to the reciprocal of the fundamental period.

「音高」（Pitch）是另一個音訊裡面很重要的特徵，直覺地說，音高代表聲音頻率的高低，而此頻率指的是「基本頻率」（Fundamental Frequency），也就是「基本週期」（Fundamental Period）的倒數。

Generally speaking, it is not too difficult to observe the fundamental period within a quasi-periodic audio signals. Take a 3-second clip of a tuning fork tuningFork01.wav for example. We can first plot a frame of 256 sample points and identify the fundamental period easily, as shown in the following example.

若直接觀察音訊的波形，只要聲音穩定，我們並不難直接看到基本週期的存在，以一個 3 秒的音叉聲音來說，我們可以取一個 256 點的音框，將此音框畫出來後，就可以很明顯地看到基本週期，請見下列範例：

Example 1: framePitchDisp4tuningFork01.mwaveFile='tuningFork01.wav'; au=myAudioRead(waveFile); y=au.signal; fs=au.fs; index1=11000; frameSize=256; index2=index1+frameSize-1; frame=y(index1:index2); subplot(2,1,1); plot(y); grid on xlabel('Sample index'); ylabel('Amplitude'); title(['Waveform of ', waveFile]); axis([1, length(y), -1 1]); subplot(2,1,2); plot(frame, '.-'); grid on xlabel('Sample index within frame'); ylabel('Amplitude'); point=[7, 226]; % Peaks axis([1, length(frame), -1 1]); periodCount=6; fp=((point(2)-point(1))/periodCount); % fundamental period ff=fs/fp; % fundamental frequency pitch=69+12*log2(ff/440); fprintf('Fundamental period (fp) = (%g-%g)/%g = %g points\n', point(2), point(1), periodCount, fp); fprintf('Fundamental frequency (ff) = %g/%g = %g Hz\n', fs, fp, ff); fprintf('Pitch = %g semitone\n', pitch); % === For plotting arrows, etc % ====== Frame boundary subplot(211); line(index1*[1 1], [-1 1], 'color', 'r', 'linewidth', 1); line(index2*[1 1], [-1 1], 'color', 'r', 'linewidth', 1); % ====== FP coverage subplot(212); line(point, frame(point), 'marker', 'o', 'color', 'red'); % ====== Axis locations subplot(211); loc1=get(gca, 'position'); subplot(212); loc2=get(gca, 'position'); % ====== arrow 1 x1=[loc1(1)+(index1(1)-1)/(length(y)-1)*loc1(3), loc2(1)]; y1=[loc1(2), loc2(2)+loc2(4)]; ah=annotation('arrow', x1, y1, 'color', 'r', 'linewidth', 1); % ======= arrow 2 x2=[loc1(1)+(index2-1)/(length(y)-1)*loc1(3), loc2(1)+loc2(3)]; y2=[loc1(2), loc2(2)+loc2(4)]; ah=annotation('arrow', x2, y2, 'color', 'r', 'linewidth', 1); % ====== Texts indicating start/end indices h1=text(point(1), frame(point(1)), [' \leftarrow index=', int2str(point(1))], 'rotation', 30); h2=text(point(2), frame(point(2)), [' \leftarrow index=', int2str(point(2))], 'rotation', 30); Fundamental period (fp) = (226-7)/6 = 36.5 points Fundamental frequency (ff) = 16000/36.5 = 438.356 Hz Pitch = 68.9352 semitone

In the above example, the two red lines in the first plot define the start and end of the frame for our analysis. The second plot shows the waveform of the frame as well as two points (identified visually) which cover 5 fundamental periods. Since the distance between these two points is 182 units, the fundamental frequency is fs/(182/5) = 16000/(182/5) = 439.56 Hz, which is equal to 68.9827 semitones. The formula for the conversion from pitch frequency to semitone is shown next.

在上述範例中，上圖紅線的位置代表音框的位置，下圖即是 256 點的音框，其中紅線部分包含了 5 個基本週期，總共佔掉了 182 單位點，因此對應的基本頻率是 fs/(182/5) = 16000/(182/5) = 439.56 Hz，相當於 68.9827 半音（Semitone），其中由基本頻率至半音的轉換公式如下：

semitone = 69 + 12*log₂(frequency/440)

In other words, when the fundamental frequency is 440 Hz, we have a pitch of 69 semitones, which corresponds to "central la" or A4 in the following piano roll.

換句話說，當基本頻率是 440 Hz 時，對應到的半音差是 69，這就是鋼琴的「中央 La」或是「A4」，請見下圖。

Hint

The fundamental frequency of the tuning fork is designed to be 440 Hz. Hence the tuning fork are usually used to fine tune the pitch of a piano.

一般音叉的震動頻率非常接近 440 Hz，因此我們常用音叉來校正鋼琴的音準。

In fact, semitone is also used as unit for specify pitch in MIDI files. From the conversion formula, we can also notice the following facts:

Each octave contains 12 semitones, including 7 white keys and 5 black ones.
Each transition to go up one octave corresponds to twice the frequency. For instance, the A4 (central la) is 440 Hz (69 semitones) while A5 is 880 Hz (81 semitones).
Pitch in terms of semitones (more of less) correlates linearly to human's "perceived pitch".

上述公式所轉換出來的半音差，也是 MIDI 音樂檔案所用的標準。從上述公式也可以看出：

每個全音階包含 12 個半音（七個白鍵和五個黑鍵）。
每向上相隔一個全音階，頻率會變成兩倍。例如，中央 la 是 440 Hz（69 Semitones），向上平移一個全音階之後，頻率就變成 880 Hz（81 Semitones）。
人耳對音高的「線性感覺」是隨著基本頻率的對數值成正比。

The waveform of the tuning fork is very "clean" since it is very close to a sinusoidal signal and the fundamental period is very obvious. In the following example, we shall use human's speech as an examle of visual determination of pitch. The clip is my voice of "清華大學資訊系" (csNthu.wav). If we take a frame around the character "華", we can visually identify the fundamental period easily, as shown in the following example.

音叉的聲音非常乾淨，整個波形非常接近弦波，所以基本週期顯而易見。若以我的聲音「清華大學資訊系」來說，我們可以將「華」的部分放大，也可以明顯地看到基本週期，請見下列範例：

Example 2: framePitchDisp4speech01.mwaveFile='csNthu.wav'; au=myAudioRead(waveFile); y=au.signal; fs=au.fs; index1=11050; frameSize=512; index2=index1+frameSize-1; frame=y(index1:index2); subplot(2,1,1); plot(y); grid on xlabel('Sample index'); ylabel('Amplitude'); title(['Waveform of ', waveFile]); axis([1, length(y), -1 1]); subplot(2,1,2); plot(frame, '.-'); grid on xlabel('Sample index within frame'); ylabel('Amplitude'); point=[83, 485]; % Peaks point=[75, 477]; % Valleys axis([1, length(frame), -1 1]); periodCount=3; fp=((point(2)-point(1))/periodCount); % fundamental period ff=fs/fp; % fundamental frequency pitch=69+12*log2(ff/440); fprintf('Fundamental period (fp) = (%g-%g)/%g = %g points\n', point(2), point(1), periodCount, fp); fprintf('Fundamental frequency (ff) = %g/%g = %g Hz\n', fs, fp, ff); fprintf('Pitch = %g semitone\n', pitch); % === For plotting arrows, etc % ====== Frame boundary subplot(211); line(index1*[1 1], [-1 1], 'color', 'r', 'linewidth', 1); line(index2*[1 1], [-1 1], 'color', 'r', 'linewidth', 1); % ====== FP coverage subplot(212); line(point, frame(point), 'marker', 'o', 'color', 'red'); % ====== Axis locations subplot(211); loc1=get(gca, 'position'); subplot(212); loc2=get(gca, 'position'); % ====== arrow 1 x1=[loc1(1)+(index1(1)-1)/(length(y)-1)*loc1(3), loc2(1)]; y1=[loc1(2), loc2(2)+loc2(4)]; ah=annotation('arrow', x1, y1, 'color', 'r', 'linewidth', 1); % ======= arrow 2 x2=[loc1(1)+(index2-1)/(length(y)-1)*loc1(3), loc2(1)+loc2(3)]; y2=[loc1(2), loc2(2)+loc2(4)]; ah=annotation('arrow', x2, y2, 'color', 'r', 'linewidth', 1); % ====== Texts indicating start/end indices h1=text(point(1), frame(point(1)), [' \leftarrow index=', int2str(point(1))], 'rotation', -10); h2=text(point(2), frame(point(2)), [' \leftarrow index=', int2str(point(2))], 'rotation', -10); Fundamental period (fp) = (477-75)/3 = 134 points Fundamental frequency (ff) = 16000/134 = 119.403 Hz Pitch = 46.42 semitone

In the above example, we select a 512-point frame around the vowel of the character "華". In particular, we chose two points (with indices 75 and 477) that covers 3 complete fundamental periods. Since the distance between these two points is 402, the fundamental frequency is fs/(402/3) = 16000/(402/3) = 119.403 Hz and the pitch is 46.420 semitones.

上列範例的下圖，是從「華」的韻母附近抓出來的 512 點的音框，其中紅線部分包含了 3 個基本週期，總共佔掉了 402 單位點，因此對應的基本頻率是 fs/(402/3) = 16000/(402/3) = 119.403 Hz，相當於 46.420 半音，與「中央 La」差了 22.58 個半音，接近但還不到兩個全音階（24 個半音）。

Conceptually, the most obvious sample point within a fundamental period is often referred to as the pitch mark. Usually pitch marks are selected as the local maxima or minima of the audio waveform. In the previous example of pitch determination for the tuning fork, we used two pitch marks that are local maxima. On the other hand, in the example of pitch determination for human speech, we used two pitch marks that are local minima instead since they are more obvious than local maxima. Reliable identification of pitch marks is an essential task for text-to-speech synthesis.

在觀察音訊波形時，每一個基本週期的開始點，我們稱為「音高基準點」（Pitch Marks，簡稱 PM），PM 大部分是波形的局部最大點或最小點，例如在上述音叉的範例中，我們抓取的兩個 PM 是局部最大點，而在我的聲音的範例中，由於 PM 在局部最大點並不明顯，因此我們抓取了兩個局部最小點的 PM 來計算音高。PM 通常用來調節一段聲音的音高，在語音合成方面很重要。

Due to the difference in physiology, the pitch ranges for males ane females are different:

The pitch range for males is 35 ~ 72 semitones, or 62 ~ 523 Hz.
The pitch range of females is 45 ~ 83 semitones, or 110 ~ 1000 Hz.

由於生理構造不同，男女生的音高範圍並不相同，一般而言：

男生的音高範圍約在 35 ~ 72 半音，對應的頻率是 62 ~ 523 Hz。
女生的音高範圍約在 45 ~ 83 半音，對應的頻率是 110 ~ 1000 Hz。

However, it should be emphasized that we are not using pitch alone to identify male or female voices. Moreover, we also use the information from timbre (or more precisely, formants) for such task. More information will be covered in later chapters.

但是我們分辨男女的聲並不是只憑音高，而還是依照音色（共振峰），詳見後續說明。

As shown in this section, visual identification of the fundamental frequency is not a difficult task for human. However, if we want to write a program to identify the pitch automatically, there are much more we need to take into consideration. More details will be followed in the next few chapters.

使用「觀察法」來算出音高，並不是太難的事，但是若要電腦自動算出音高，就需要更深入的研究。有關音高追蹤的各種方法，會在後續章節詳細介紹。

Audio Signal Processing and Recognition (音訊處理與辨識)