7-1 Introduction to Pitch Tracking (������������������)

[chinese][english]

Slides for this chapter

From the previous chapter, you should already know how to find the pitch by visual inspection and some simple computation. If we want the computer to automatically identify the pitch from a stream of audio signals, then we need to have some reliable methods for such a task of "pitch tracking". Once the pitch vector is identified, it can be used for various applications in audio signal processing, including:

¦p«e¤@³¹©Ò­z¡A¨Ï¥Î¡uÆ[¹îªk¡v¨Óºâ¥X­µ°ª¡A¨Ã¤£¬O¤ÓÃø¡A¦ý¬O­Y­n¹q¸£¦Û°Êºâ¥X­µ°ª¡A´N»Ý­n¹ï­µ°T¶i¦æ¶i¤@¨Bªº¦Û°Ê¤ÀªR¡C¨Ï¥Î¹q¸£¹ï¾ã¬q­µ°T¶i¦æ§ì¨ú­µ°ªªº¹Lµ{¡A³q±`ºÙ¬°¡u­µ°ª°lÂÜ¡v¡]Pitch Tracking¡^¡A©Ò§ì¥X¨Óªº­µ°ª¸ê°T¡A¦³¤U¦CÀ³¥Î¡G

In summary, pitch tracking is a fundamental step toward other important tasks for audio signal processing. Related research on pitch tracking has been going on for decades, and it still remains an hot topic in the literature. Therefore we need to know the basic concept of pitch tracking as a stepstone for other advanced audio processing techniques.

Á`¦Ó¨¥¤§¡A­µ°ª°lÂÜ¥i»¡¬O­µ°T³B²z¹Lµ{¤¤¡A³Ì°ò¥»¤]¬O³Ì­«­nªº¤@Àô¡A¬ÛÃöªº¬ã¨s¡A¤]¶i¦æ¤F¼Æ¤Q¦~¡A¦]¦¹§Ú­Ì¥²¶·§¹¥þÁA¸Ñ¨ä­ì²z¡A¤~¯àÄ~Äò¶i¦æ¨ä¥L¬ÛÃöªº¤ÀªR»P³B²z¡C

Pitch tracking follows the general processing of short-term analysis for audio signals, as follows.

­µ°ª°lÂܪº°ò¥»¬yµ{¦p¤U¡G

  1. Chop the audio signals into frames of 20 ms or so. Overlap is allowed between neighboring frames.
  2. Compute the pitch of each frame.
  3. Eliminate pitch from silence or unvoiced sounds. This can be done by using volume thresholding or pitch range thresholding.
  4. Smooth the pitch curve using median filters or other similar methods.
  1. ±N¾ã¬q­µ°T°T¸¹¤Á¦¨­µ®Ø¡]Frames¡^¡A¬Û¾F­µ®Ø¤§¶¡¥i¥H­«Å|¡C
  2. ºâ¥X¨C­Ó­µ®Ø©Ò¹ïÀ³ªº­µ°ª¡C
  3. ±Æ°£¤£Ã­©wªº­µ°ª­È¡C¡]¥i¥Ñ­µ¶q¨Ó¿z¿ï¡A©Î¥Ñ­µ°ª­Èªº½d³ò¨Ó¹LÂo¡C¡^
  4. ¹ï¾ã¬q­µ°ª¶i¦æ¥­·Æ¤Æ¡A³q±`¬O¨Ï¥Î¡u¤¤¦ì¼ÆÂoªi¾¹¡v¡]Median Filters¡^¡C

In the processing frame blocking, we allow overlap between neighboring frames to reduce discontinuity between them. We can define "frame rate" as the frames per second for our analysis. For instance, if fs = 11025 Hz, frame size = 256, overlap = 84, then the frame rate is equal to fs/(frameSize-overlap) = 11025/(256-84) = 64. In other words, if we wish to have real-time pitch tracking (for instance, on the platform of micro-controllers), then the computer should be able to handle 64 frames per second. A small overlap will lead to a low frame rate. The process of frame blocking is shown next.

¦b¤Á­µ®Øªº¹Lµ{¤¤¡A§Ú­Ì¤¹³\¥ª¥k­µ®Øªº­«Å|¡A¦]¦¹§Ú­Ì©w¸q¡u­µ®Ø²v¡v¡]Frame Rate¡^¬O¨C¬íÄÁ©Ò¥X²{ªº­µ®Ø­Ó¼Æ¡A¦pªG¨ú¼ËÀW²v¬O 11025¡A­µ®Øªø«×¬O 256 ÂI¡A­«Å|ÂI¼Æ¬O 84¡A¨º»ò­µ®Ø²v´N¬O 11025/(256-84) = 64¡A´«¥y¸Ü»¡¡A§Ú­Ìªº¹q¸£­n¯à°÷¨C¬íÄÁ³B²z 64 ­Ó­µ®Ø¡A¤~¯à¹F¨ì¡u§Y®É³B²z¡vªº¥Øªº¡C¥Ü·N¹Ï¦p¤U¡G

When we choose the frame size and the overlap, we need to consider the following factors.

§Ú­ÌÅý­µ®Ø­«Å|ªº¥Ø¦a¡A¥u¬O§Æ±æ¬Û¾F­µ®Ø¤§¶¡ªºÅܤƤ£·|¤Ó¤j¡A¨Ï§ì¥X¨Óªº­µ°ª¦±½u§ó¨ã¦³³sÄò©Ê¡C¦ý¬O¦b¹ê»ÚÀ³¥Î®É¡A­µ®Øªº­«Å|¤]¤£¯à¤Ó¤j¡A§_«h·|³y¦¨­pºâ¶qªº¹L¤j¡C¦b¿ï¾Ü­µ®Øªº¤j¤p®É¡A¦³¤U¦C¦Ò¶q¦]¯À¡G

There are a number of methods to derive a pitch value from a single frame. Generally, these methods can be classified into time-domain and frequency-domain methods, as follows.

¥Ñ¤@­Ó­µ®Ø­pºâ¥X­µ°ªªº¤èªk«Ü¦h¡A¥i¥H¤À¬°®É°ì©MÀW°ì¨â¤jÃþ¡G

These methods will be covered in the rest of this chapter.

³o¨Ç¤èªk±N¦b¥H¤U¦U¤p¸`¤¶²Ð¡C


Audio Signal Processing and Recognition (­µ°T³B²z»P¿ëÃÑ)