[chinese][all] Pitch is an important feature of audio signals, especially for quasi-periodic signals such as voiced sounds from human speech/singing and monophonic music from most music instruments. Intuitively speaking, pitch represent the vibration frequency of the sound source of audio signals. In other words, pitch is the fundamental frequency of audio signals, which is equal to the reciprocal of the fundamental period.
Generally speaking, it is not too difficult to observe the fundamental period within a quasi-periodic audio signals. Take a 3-second clip of a tuning fork tuningFork01.wav for example. We can first plot a frame of 256 sample points and identify the fundamental period easily, as shown in the following example.
In the above example, the two red lines in the first plot define the start and end of the frame for our analysis. The second plot shows the waveform of the frame as well as two points (identified visually) which cover 5 fundamental periods. Since the distance between these two points is 182 units, the fundamental frequency is fs/(182/5) = 16000/(182/5) = 439.56 Hz, which is equal to 68.9827 semitones. The formula for the conversion from pitch frequency to semitone is shown next.
semitone = 69 + 12*log2(frequency/440) In other words, when the fundamental frequency is 440 Hz, we have a pitch of 69 semitones, which corresponds to "central la" or A4 in the following piano roll.
In fact, semitone is also used as unit for specify pitch in MIDI files. From the conversion formula, we can also notice the following facts:
- Each octave contains 12 semitones, including 7 white keys and 5 black ones.
- Each transition to go up one octave corresponds to twice the frequency. For instance, the A4 (central la) is 440 Hz (69 semitones) while A5 is 880 Hz (81 semitones).
- Pitch in terms of semitones (more of less) correlates linearly to human's "perceived pitch".
The waveform of the tuning fork is very "clean" since it is very close to a sinusoidal signal and the fundamental period is very obvious. In the following example, we shall use human's speech as an examle of visual determination of pitch. The clip is my voice of "清華大學資訊系" (csNthu.wav). If we take a frame around the character "華", we can visually identify the fundamental period easily, as shown in the following example.
In the above example, we select a 512-point frame around the vowel of the character "華". In particular, we chose two points (with indices 75 and 477) that covers 3 complete fundamental periods. Since the distance between these two points is 402, the fundamental frequency is fs/(402/3) = 16000/(402/3) = 119.403 Hz and the pitch is 46.420 semitones.Conceptually, the most obvious sample point within a fundamental period is often referred to as the pitch mark. Usually pitch marks are selected as the local maxima or minima of the audio waveform. In the previous example of pitch determination for the tuning fork, we used two pitch marks that are local maxima. On the other hand, in the example of pitch determination for human speech, we used two pitch marks that are local minima instead since they are more obvious than local maxima. Reliable identification of pitch marks is an essential task for text-to-speech synthesis.
Due to the difference in physiology, the pitch ranges for males ane females are different:
- The pitch range for males is 35 ~ 72 semitones, or 62 ~ 523 Hz.
- The pitch range of females is 45 ~ 83 semitones, or 110 ~ 1000 Hz.
However, it should be emphasized that we are not using pitch alone to identify male or female voices. Moreover, we also use the information from timbre (or more precisely, formants) for such task. More information will be covered in later chapters.
As shown in this section, visual identification of the fundamental frequency is not a difficult task for human. However, if we want to write a program to identify the pitch automatically, there are much more we need to take into consideration. More details will be followed in the next few chapters.
Audio Signal Processing and Recognition (音訊處理與辨識)