5-4 Pitch (������)

[chinese][english]

Old Chinese version

Pitch is an important feature of audio signals, especially for quasi-periodic signals such as voiced sounds from human speech/singing and monophonic music from most music instruments. Intuitively speaking, pitch represent the vibration frequency of the sound source of audio signals. In other words, pitch is the fundamental frequency of audio signals, which is equal to the reciprocal of the fundamental period.

¡u­µ°ª¡v¡]Pitch¡^¬O¥t¤@­Ó­µ°T¸Ì­±«Ü­«­nªº¯S¼x¡Aª½Ä±¦a»¡¡A­µ°ª¥NªíÁn­µÀW²vªº°ª§C¡A¦Ó¦¹ÀW²v«üªº¬O¡u°ò¥»ÀW²v¡v¡]Fundamental Frequency¡^¡A¤]´N¬O¡u°ò¥»¶g´Á¡v¡]Fundamental Period¡^ªº­Ë¼Æ¡C

Generally speaking, it is not too difficult to observe the fundamental period within a quasi-periodic audio signals. Take a 3-second clip of a tuning fork tuningFork01.wav for example. We can first plot a frame of 256 sample points and identify the fundamental period easily, as shown in the following example.

­Yª½±µÆ[¹î­µ°Tªºªi§Î¡A¥u­nÁn­µÃ­©w¡A§Ú­Ì¨Ã¤£Ãøª½±µ¬Ý¨ì°ò¥»¶g´Áªº¦s¦b¡A¥H¤@­Ó 3 ¬íªº­µ¤eÁn­µ¨Ó»¡¡A§Ú­Ì¥i¥H¨ú¤@­Ó 256 ÂIªº­µ®Ø¡A±N¦¹­µ®Øµe¥X¨Ó«á¡A´N¥i¥H«Ü©úÅã¦a¬Ý¨ì°ò¥»¶g´Á¡A½Ð¨£¤U¦C½d¨Ò¡G

Example 1: framePitchDisp4tuningFork01.mwaveFile='tuningFork01.wav'; au=myAudioRead(waveFile); y=au.signal; fs=au.fs; index1=11000; frameSize=256; index2=index1+frameSize-1; frame=y(index1:index2); subplot(2,1,1); plot(y); grid on xlabel('Sample index'); ylabel('Amplitude'); title(['Waveform of ', waveFile]); axis([1, length(y), -1 1]); subplot(2,1,2); plot(frame, '.-'); grid on xlabel('Sample index within frame'); ylabel('Amplitude'); point=[7, 226]; % Peaks axis([1, length(frame), -1 1]); periodCount=6; fp=((point(2)-point(1))/periodCount); % fundamental period ff=fs/fp; % fundamental frequency pitch=69+12*log2(ff/440); fprintf('Fundamental period (fp) = (%g-%g)/%g = %g points\n', point(2), point(1), periodCount, fp); fprintf('Fundamental frequency (ff) = %g/%g = %g Hz\n', fs, fp, ff); fprintf('Pitch = %g semitone\n', pitch); % === For plotting arrows, etc % ====== Frame boundary subplot(211); line(index1*[1 1], [-1 1], 'color', 'r', 'linewidth', 1); line(index2*[1 1], [-1 1], 'color', 'r', 'linewidth', 1); % ====== FP coverage subplot(212); line(point, frame(point), 'marker', 'o', 'color', 'red'); % ====== Axis locations subplot(211); loc1=get(gca, 'position'); subplot(212); loc2=get(gca, 'position'); % ====== arrow 1 x1=[loc1(1)+(index1(1)-1)/(length(y)-1)*loc1(3), loc2(1)]; y1=[loc1(2), loc2(2)+loc2(4)]; ah=annotation('arrow', x1, y1, 'color', 'r', 'linewidth', 1); % ======= arrow 2 x2=[loc1(1)+(index2-1)/(length(y)-1)*loc1(3), loc2(1)+loc2(3)]; y2=[loc1(2), loc2(2)+loc2(4)]; ah=annotation('arrow', x2, y2, 'color', 'r', 'linewidth', 1); % ====== Texts indicating start/end indices h1=text(point(1), frame(point(1)), [' \leftarrow index=', int2str(point(1))], 'rotation', 30); h2=text(point(2), frame(point(2)), [' \leftarrow index=', int2str(point(2))], 'rotation', 30); Fundamental period (fp) = (226-7)/6 = 36.5 points Fundamental frequency (ff) = 16000/36.5 = 438.356 Hz Pitch = 68.9352 semitone

In the above example, the two red lines in the first plot define the start and end of the frame for our analysis. The second plot shows the waveform of the frame as well as two points (identified visually) which cover 5 fundamental periods. Since the distance between these two points is 182 units, the fundamental frequency is fs/(182/5) = 16000/(182/5) = 439.56 Hz, which is equal to 68.9827 semitones. The formula for the conversion from pitch frequency to semitone is shown next.

¦b¤W­z½d¨Ò¤¤¡A¤W¹Ï¬õ½uªº¦ì¸m¥Nªí­µ®Øªº¦ì¸m¡A¤U¹Ï§Y¬O 256 ÂIªº­µ®Ø¡A¨ä¤¤¬õ½u³¡¤À¥]§t¤F 5 ­Ó°ò¥»¶g´Á¡AÁ`¦@¦û±¼¤F 182 ³æ¦ìÂI¡A¦]¦¹¹ïÀ³ªº°ò¥»ÀW²v¬O fs/(182/5) = 16000/(182/5) = 439.56 Hz¡A¬Û·í©ó 68.9827 ¥b­µ¡]Semitone¡^¡A¨ä¤¤¥Ñ°ò¥»ÀW²v¦Ü¥b­µªºÂà´«¤½¦¡¦p¤U¡G

semitone = 69 + 12*log2(frequency/440)

In other words, when the fundamental frequency is 440 Hz, we have a pitch of 69 semitones, which corresponds to "central la" or A4 in the following piano roll.

´«¥y¸Ü»¡¡A·í°ò¥»ÀW²v¬O 440 Hz ®É¡A¹ïÀ³¨ìªº¥b­µ®t¬O 69¡A³o´N¬O¿ûµ^ªº¡u¤¤¥¡ La¡v©Î¬O¡uA4¡v¡A½Ð¨£¤U¹Ï¡C

Hint
The fundamental frequency of the tuning fork is designed to be 440 Hz. Hence the tuning fork are usually used to fine tune the pitch of a piano.

¤@¯ë­µ¤eªº¾_°ÊÀW²v«D±`±µªñ 440 Hz¡A¦]¦¹§Ú­Ì±`¥Î­µ¤e¨Ó®Õ¥¿¿ûµ^ªº­µ·Ç¡C

In fact, semitone is also used as unit for specify pitch in MIDI files. From the conversion formula, we can also notice the following facts:

¤W­z¤½¦¡©ÒÂà´«¥X¨Óªº¥b­µ®t¡A¤]¬O MIDI ­µ¼ÖÀɮשҥΪº¼Ð·Ç¡C±q¤W­z¤½¦¡¤]¥i¥H¬Ý¥X¡G

The waveform of the tuning fork is very "clean" since it is very close to a sinusoidal signal and the fundamental period is very obvious. In the following example, we shall use human's speech as an examle of visual determination of pitch. The clip is my voice of "²MµØ¤j¾Ç¸ê°T¨t" (csNthu.wav). If we take a frame around the character "µØ", we can visually identify the fundamental period easily, as shown in the following example.

­µ¤eªºÁn­µ«D±`°®²b¡A¾ã­Óªi§Î«D±`±µªñ©¶ªi¡A©Ò¥H°ò¥»¶g´ÁÅã¦Ó©ö¨£¡C­Y¥H§ÚªºÁn­µ¡u²MµØ¤j¾Ç¸ê°T¨t¡v¨Ó»¡¡A§Ú­Ì¥i¥H±N¡uµØ¡vªº³¡¤À©ñ¤j¡A¤]¥i¥H©úÅã¦a¬Ý¨ì°ò¥»¶g´Á¡A½Ð¨£¤U¦C½d¨Ò¡G

Example 2: framePitchDisp4speech01.mwaveFile='csNthu.wav'; au=myAudioRead(waveFile); y=au.signal; fs=au.fs; index1=11050; frameSize=512; index2=index1+frameSize-1; frame=y(index1:index2); subplot(2,1,1); plot(y); grid on xlabel('Sample index'); ylabel('Amplitude'); title(['Waveform of ', waveFile]); axis([1, length(y), -1 1]); subplot(2,1,2); plot(frame, '.-'); grid on xlabel('Sample index within frame'); ylabel('Amplitude'); point=[83, 485]; % Peaks point=[75, 477]; % Valleys axis([1, length(frame), -1 1]); periodCount=3; fp=((point(2)-point(1))/periodCount); % fundamental period ff=fs/fp; % fundamental frequency pitch=69+12*log2(ff/440); fprintf('Fundamental period (fp) = (%g-%g)/%g = %g points\n', point(2), point(1), periodCount, fp); fprintf('Fundamental frequency (ff) = %g/%g = %g Hz\n', fs, fp, ff); fprintf('Pitch = %g semitone\n', pitch); % === For plotting arrows, etc % ====== Frame boundary subplot(211); line(index1*[1 1], [-1 1], 'color', 'r', 'linewidth', 1); line(index2*[1 1], [-1 1], 'color', 'r', 'linewidth', 1); % ====== FP coverage subplot(212); line(point, frame(point), 'marker', 'o', 'color', 'red'); % ====== Axis locations subplot(211); loc1=get(gca, 'position'); subplot(212); loc2=get(gca, 'position'); % ====== arrow 1 x1=[loc1(1)+(index1(1)-1)/(length(y)-1)*loc1(3), loc2(1)]; y1=[loc1(2), loc2(2)+loc2(4)]; ah=annotation('arrow', x1, y1, 'color', 'r', 'linewidth', 1); % ======= arrow 2 x2=[loc1(1)+(index2-1)/(length(y)-1)*loc1(3), loc2(1)+loc2(3)]; y2=[loc1(2), loc2(2)+loc2(4)]; ah=annotation('arrow', x2, y2, 'color', 'r', 'linewidth', 1); % ====== Texts indicating start/end indices h1=text(point(1), frame(point(1)), [' \leftarrow index=', int2str(point(1))], 'rotation', -10); h2=text(point(2), frame(point(2)), [' \leftarrow index=', int2str(point(2))], 'rotation', -10); Fundamental period (fp) = (477-75)/3 = 134 points Fundamental frequency (ff) = 16000/134 = 119.403 Hz Pitch = 46.42 semitone

In the above example, we select a 512-point frame around the vowel of the character "µØ". In particular, we chose two points (with indices 75 and 477) that covers 3 complete fundamental periods. Since the distance between these two points is 402, the fundamental frequency is fs/(402/3) = 16000/(402/3) = 119.403 Hz and the pitch is 46.420 semitones.

¤W¦C½d¨Òªº¤U¹Ï¡A¬O±q¡uµØ¡vªºÃý¥Àªþªñ§ì¥X¨Óªº 512 ÂIªº­µ®Ø¡A¨ä¤¤¬õ½u³¡¤À¥]§t¤F 3 ­Ó°ò¥»¶g´Á¡AÁ`¦@¦û±¼¤F 402 ³æ¦ìÂI¡A¦]¦¹¹ïÀ³ªº°ò¥»ÀW²v¬O fs/(402/3) = 16000/(402/3) = 119.403 Hz¡A¬Û·í©ó 46.420 ¥b­µ¡A»P¡u¤¤¥¡ La¡v®t¤F 22.58 ­Ó¥b­µ¡A±µªñ¦ýÁÙ¤£¨ì¨â­Ó¥þ­µ¶¥¡]24 ­Ó¥b­µ¡^¡C

Conceptually, the most obvious sample point within a fundamental period is often referred to as the pitch mark. Usually pitch marks are selected as the local maxima or minima of the audio waveform. In the previous example of pitch determination for the tuning fork, we used two pitch marks that are local maxima. On the other hand, in the example of pitch determination for human speech, we used two pitch marks that are local minima instead since they are more obvious than local maxima. Reliable identification of pitch marks is an essential task for text-to-speech synthesis.

¦bÆ[¹î­µ°Tªi§Î®É¡A¨C¤@­Ó°ò¥»¶g´Áªº¶}©lÂI¡A§Ú­ÌºÙ¬°¡u­µ°ª°ò·ÇÂI¡v¡]Pitch Marks¡A²ºÙ PM¡^¡APM ¤j³¡¤À¬Oªi§Îªº§½³¡³Ì¤jÂI©Î³Ì¤pÂI¡A¨Ò¦p¦b¤W­z­µ¤eªº½d¨Ò¤¤¡A§Ú­Ì§ì¨úªº¨â­Ó PM ¬O§½³¡³Ì¤jÂI¡A¦Ó¦b§ÚªºÁn­µªº½d¨Ò¤¤¡A¥Ñ©ó PM ¦b§½³¡³Ì¤jÂI¨Ã¤£©úÅã¡A¦]¦¹§Ú­Ì§ì¨ú¤F¨â­Ó§½³¡³Ì¤pÂIªº PM ¨Ó­pºâ­µ°ª¡CPM ³q±`¥Î¨Ó½Õ¸`¤@¬qÁn­µªº­µ°ª¡A¦b»y­µ¦X¦¨¤è­±«Ü­«­n¡C

Due to the difference in physiology, the pitch ranges for males ane females are different:

¥Ñ©ó¥Í²zºc³y¤£¦P¡A¨k¤k¥Íªº­µ°ª½d³ò¨Ã¤£¬Û¦P¡A¤@¯ë¦Ó¨¥¡G

However, it should be emphasized that we are not using pitch alone to identify male or female voices. Moreover, we also use the information from timbre (or more precisely, formants) for such task. More information will be covered in later chapters.

¦ý¬O§Ú­Ì¤À¿ë¨k¤kªºÁn¨Ã¤£¬O¥u¾Ì­µ°ª¡A¦ÓÁÙ¬O¨Ì·Ó­µ¦â¡]¦@®¶®p¡^¡A¸Ô¨£«áÄò»¡©ú¡C

As shown in this section, visual identification of the fundamental frequency is not a difficult task for human. However, if we want to write a program to identify the pitch automatically, there are much more we need to take into consideration. More details will be followed in the next few chapters.

¨Ï¥Î¡uÆ[¹îªk¡v¨Óºâ¥X­µ°ª¡A¨Ã¤£¬O¤ÓÃøªº¨Æ¡A¦ý¬O­Y­n¹q¸£¦Û°Êºâ¥X­µ°ª¡A´N»Ý­n§ó²`¤Jªº¬ã¨s¡C¦³Ãö­µ°ª°lÂܪº¦UºØ¤èªk¡A·|¦b«áÄò³¹¸`¸Ô²Ó¤¶²Ð¡C


Audio Signal Processing and Recognition (­µ°T³B²z»P¿ëÃÑ)