3-2 Basic Acoustic Features (������������������)

[chinese][english]

When we analyze audio signals, we usually adopt the method of short-term analysis since most audio signals are more or less stable within a short period of time, say 20 ms or so. When we do frame blocking, there may be some soverlap between neighboring frames to capture subtle change in the audio signals. Note that each frame is the basic unit for our analysis. Within each frame, we can observe the three most distinct acoustic features, as follows.

These three acoustic features can be related to the waveform of audio signals, as follows:

·í§Ú­Ì¦b¤ÀªRÁn­µ®É¡A³q±`¥H¡uµu®É¶Z¤ÀªR¡v¡]Short-term Analysis¡^¬°¥D¡A¦]¬°­µ°T¦bµu®É¶¡¤º¬O¬Û¹ïí©wªº¡C§Ú­Ì³q±`±NÁn­µ¥ý¤Á¦¨­µ®Ø¡]Frame¡^¡A¨C­Ó­µ®Øªø«×¤j¬ù¦b 20 ms ¥ª¥k¡A¦A®Ú¾Ú­µ®Ø¤ºªº°T¸¹¨Ó¶i¦æ¤ÀªR¡C¦b¤@­Ó¯S©w­µ®Ø¤º¡A§Ú­Ì¥i¥HÆ[¹î¨ìªº¤T­Ó¥D­nÁn­µ¯S¼x¥i»¡©ú¦p¤U¡G

³o¨Ç¯S¼x¥i¥Î¹Ï§Î»¡©ú¦p¤U¡G

Take human voices as an example, then the above three acoustic features will correlates to some physical quantities:

¦pªG¬O¥Î¤HÁn¨Ó»¡©ú¡A³o¨Ç»y­µ¯S¼xªºª«²z·N¸q¦p¤U¡G

We shall explain methods to extract these acoustic features in the other chapters of this book. It should be noted that these acoustic features mostly corresponds to human's "perception" and therefore cannot be represented exactly by mathematical formula or quantities. However, we still try to "quantitify" these features for further computer-based analysis in the hope that the used formula or quantities can emulate human's perception as closely as possible.

¦³Ãö³o¨Ç»y­µ¯S¼xªº§ì¨ú©M¤ÀªR¡A·|¦b«áÄò³¹¸`¦³¸Ô²Ó»¡©ú¡C¯S§O­nª`·Nªº¬O¡A³o¨Ç¯S¼x³£¬O¥Nªí¡u¤H¦Õªº·Pı¡v¡A¨Ã¨S¦³¤@©wªº¼Æ¾Ç¤½¦¡¥i´M¡A©Ò¥H·í§Ú­Ì¸ÕµÛ¦b¡u¶q¤Æ¡v³o¨Ç¯S¼x®É¡A¥u¬O®Ú¾Ú¤@¨Ç¼Æ¾Ú©M¸gÅç¨Ó¶q¤Æ¡A¨ÓºÉ¶q¹Gªñ¤H¦Õªº·Pı¡A¦ý¨Ã¤£¥Nªí³o¨Ç¡u¶q¤Æ¡v«áªº¼Æ¾Ú©Î¤½¦¡´N¥i¥H§¹¥þ¥NªíÁn­µªº¯S¼x¡C

The basic approach to the extraction of audio acoustic features can be summarized as follows:

  1. Perform frame blocking such that a strem of audio signals is converted to a set of frames. The time duration of each frame is about 20~30 ms. If the frame duration is too big, we cannot catch the time-varying characteristics of the audio signals. On the other hand, if the frame duration is too small, then we cannot extract valid acoustic features. In general, a frame should be contains several fundamental periods of the given audio signals. Usually the frame size (in terms of sample points) is equal to the powers of 2 (such as 256, 512, 1024 ,etc) such that it is suitable for fast fourier transform.
  2. If we want to reduce the difference between neighboring frames, we can allow overlap between them. Usually the overlap is 1/2 to 2/3 of the original frame. The more overlap, the more computation is needed.
  3. Assuming the audio signals within a frame is stationary, we can extract acoustic features such as zero crossing rates, volume, pitch, MFCC, LPC, etc.
  4. We can perform endpoint detection based on zero crossing rate and volume, and keep non-silence frames for further analysis.

­µ°T¯S¼x©â¨úªº°ò¥»¤è¦¡¦p¤U¡G

  1. ±N­µ°T¤Á¦¨¤@­Ó­Ó­µ®Ø¡A­µ®Øªø«×¤j¬ù¬O 20~30 ms¡C­µ®Ø­Y¤Ó¤j¡A´NµLªk§ì¥X­µ°TÀH®É¶¡Åܤƪº¯S©Ê¡F¤Ï¤§¡A­µ®Ø­Y¤Ó¤p¡A´NµLªk§ì¥X­µ°Tªº¯S©Ê¡C¤@¯ë¦Ó¨¥¡A­µ®Ø¥²¶·¯à°÷¥]§t¼Æ­Ó­µ°Tªº°ò¥»¶g´Á¡C¡]¥t¡A­µ®Øªø«×³q±`¬O 2 ªº¾ã¼Æ¦¸¤è¡A­Y¤£¬O¡A«h¦b¶i¦æ¡u³Å¥ß¸­Âà´«¡v®É¡A»Ý¸É¹s¦Ü 2 ªº¾ã¼Æ¦¸¤è¡A¥H«K¨Ï¥Î¡u§Ö³t³Å¥ß¸­Âà´«¡v¡C¡^
  2. ­Y¬O§Æ±æ¬Û¾F­µ®Ø¤§¶¡ªºÅܤƤ£¬O¤Ó¤j¡A¥i¥H¤¹³\­µ®Ø¤§¶¡¦³­«Å|¡A­«Å|³¡¤À¥i¥H¬O­µ®Øªø«×ªº 1/2 ¨ì 2/3 ¤£µ¥¡C¡]­«Å|³¡¤À¶V¦h¡A¹ïÀ³ªº­pºâ¶q¤]´N¶V¤j¡C¡^
  3. °²³]¦b¤@­Ó­µ®Ø¤ºªº­µ°T¬Oí©wªº¡A¹ï¦¹­µ®Ø¨D¨ú¯S¼x¡A¦p¹L¹s²v¡B­µ¶q¡B­µ°ª¡BMFCC °Ñ¼Æ¡BLPC °Ñ¼Æµ¥¡C
  4. ®Ú¾Ú¹L¹s²v¡B­µ¶q¤Î­µ°ªµ¥¡A¶i¦æºÝÂI°»´ú¡]Endpoint Detection¡^¡A¨Ã«O¯dºÝÂI¤ºªº¯S¼x¸ê°T¡A¥H«K¶i¦æ¤ÀªR©Î¿ëÃÑ¡C

When we are performing the above procedures, there are several terminologies that are used often:

¦b¶i¦æ¤W­z¤ÀªR®É¡A¦³´X­Ó¦Wµü±`¥Î¨ì¡A»¡©ú¦p¤U¡G

Hint
Note that these terminologies are not unified. Some papers use frame step to indicate hop size or frame rate instead. You should be cautious when reading papers with these terms.

For instance, if we have a stream of audio signals with sample frequency fs=16000, and a frame duration of 25 ms, overlap of 15 ms, then

Á|¨Ò¦Ó¨¥¡A¦pªG¨ú¼ËÀW²v fs=16000 ¥B¨C¤@­Ó­µ®Ø©Ò¹ïÀ³ªº®É¶¡¬O 25 ms¡A­«Å| 15 ms¡A¨º»ò


Audio Signal Processing and Recognition (­µ°T³B²z»P¿ëÃÑ)