14-1 Introduction (������)

[chinese][english]

(½Ðª`·N¡G¤¤¤åª©¥»¨Ã¥¼ÀH­^¤åª©¥»¦P¨B§ó·s¡I) Slides: qbshMain.ppt, qbshMethod.ppt, qbshDemo.ppt.

This chapter introduces methods for melody recognition. In fact, there are many different types of melody recognition. But in this chapter, we shall focus on melody recognition for query by singing/humming (QBSH), which is an intuitive interface for karaoke when one cannot recall the title or artist of a song to sing. Other applications include intelligent toys, on-the-fly scoring for karaoke, edutainment tools for singing and language learning, and so on.

A QBSH system consists of three essential components:

  1. Input data: The system can take two types of inputs:
    1. Acoustic input: This includes singing, humming, or whistling from the users. For such an input, the system needs to compute the corresponding pitch contour, and convert them into note vector (optionally) for further comparison.
    2. Symbolic input: This includes music note representation from the user. Usually this is not a part of QBSH since no singing or humming is involved, but the computation procedure is quite the same, as detailed later.
  2. Song database: This is the song database which hosts a collection of songs to be compared with. For simplicity, most of the song database contains symbolic music whose music scores can be extracted reliably. Typical examples of symbolic music are MIDI files, which can be classified into two types:
    1. Monophonic music: For a given time, only a single voice of an instrument can be heard.
    2. Polyphonic music: Multiple voices can be heard at the same time. Most pop music is of this type.
    Most of the time, we are using monophonic symbolic music for melody recognition systems.

    Theoretically it is possible to use polyphonic audio music (such as MP3) for constructing the song database. However, it remains a tough and challenging problem to extract the dominant pitch (from vocals, for instance, for pop songs) from polyphonic audio music reliably. (This can be explained from the analogy of identifying swimming objects on a pond by the observed waveforms coming from two channels.)

  3. Methods for comparison: There are at least 3 types of methods for comparing the input pitch representation with songs in the database:
    1. Note-based methods: The input is converted into note vector and compared with each song in the database. This method is efficient since a note vector is much shorter when compared with the corresponding pitch vector. However, note segmentation itself may introduce errors, leading to the degradation in performance. Typical methods of this type include edit distance for music note vectors.
    2. Frame-based methods: The input and database songs are compared in the format of frame-based pitch vectors, where the pitch rate (or frame rate) can be varied from 8 to 64 points per second. The major advantage of this method is effectiveness, at the cost of more computation. Typical methods include linear scaling (LS) and dynamic time warping (DTW, including type-1 and 2).
    3. Hybrid methods: The input is frame-based while the database song is note-based. Typical methods include type-3 DTW and HMM (Hidden Markov Models).
We shall introduces these methods in the subsequent sections.

¥»³¹¤¶²Ð±Û«ß¿ëÃÑ¡]Melody Recognition¡^ªº¦UºØ¤èªk¡C¤@­Ó±Û«ß¿ëÃѨt²Î¡A¥]§t¦³¤U¦C¤T³¡¤À¡G

  1. ¿é¤JºÝ¡G¨t²Î©Ò±µ¨ü¨ìªº¿é¤J¡A¨Ò¦p¨Ï¥ÎªÌªº­ó°ÛºqÁn¡A©Î¬O¨Ï¥ÎªÌ¿é¤Jªº­µ²Å¡C¤@¯ë¦Ó¨¥¡A¨t²Î¥²¶·¥ý±N¦¹¿é¤JÂন¥i¤ñ¹ï®æ¦¡¡A¨Ò¦p­µ°ª¦V¶q¡A©Î¬O­µ²Å¦V¶q¡A¤~¯à°e¨ì¤U¤@¶¥¬q¶i¦æ¤ñ¹ï¡C
  2. ¸ê®Æ®w¡G¸ê®Æ®w¥]§t¨t²Î¤º³¡¥i¨Ñ¤ñ¹ïªººq¦±¡A¦P¼Ëªº¡A³o¨Çºq¦±¤]¥²¶·¨Æ¥ý³B²z¦¨¥i¤ñ¹ïªº®æ¦¡¡A³Ì²³æªº®æ¦¡¡A´N¬O³æ­µªº¸ê®Æ¡A¥u¥]§t­µ°ª¤Î­µªøªº¸ê°T¡A¦Ó¥B¦P¤@®É¶¡ÂI¡A³Ì¦h¥u¦³¤@ºØµo­µ¡A³o´N¬O©Ò¿×ªº¡u³æ­µ­µ¼Ö¡v¡]Monophonic Music¡^¡A¨Ò¦p³æ­yªº MIDI ©Î¬O¤HÁnªº²M°Ûµ¥¡A³£ÄÝ©ó¦¹Ãþ¡C¬Û¹ï¦Ó¨¥¡A¤@¯ë§Ú­Ì±`Å¥¨ìªº MP3 ¬y¦æ­µ¼Ö©Î¬O¥j¨å¥æÅT¼Ö¡A¦b¦P¤@­Ó®É¶¡ÂI³q±`·|¦³¦h­Ó¼Ö¾¹¦P®Éµo­µ¡A©Ò¥H¬OÄÝ©ó¡u¦h­µ­µ¼Ö¡v¡]Polyphonic music¡^¡C
  3. ¤ñ¹ï¤è¦¡¡G¨Ï¥Î¿é¤J¦V¶q¨Ó¤ñ¹ï¸ê®Æ®wºq¦±ªº¤è¦¡¡A¤@¯ë¥i¥H¤À¦¨¨â¤jÃþ¡G
    • ¤Á­µ²Åªº¤èªk¡G¿é¤J°T¸¹©M¸ê®Æ®wºq¦±³£¥H­µ²Å¡]¥]§t­µ°ª©M­µªøªº¸ê°T¡^¬°³æ¦ì¨Ó¶i¦æ¤ñ¹ï¡A³oºØ¤èªkªº¦n³B¬O¤ñ¹ï³t«×¤ñ¸û§Ö¡A¦ý¬O¡u¤Á­µ²Å¡v¡]Note Segmentation¡^¥»¨­¥i¯à´N±a¨Ó»~®t¡A¾É­P¤ñ¹ïªº¿ëÃѲv¤]·|­°§C¡C¨å«¬ªº¤èªk¦³½s¿è¶ZÂ÷¡]Edit Distance¡^µ¥¡C
    • ¤£¤Á­µ²Åªº¤èªk¡G¿é¤J°T¸¹©M¸ê®Æ®wºq¦±³£¥H­µ°ª¦V¶q¬°³æ¦ì¡A¨C¤@¬í¥i¥H¥]§t8¡ã32­Ó­µ°ªÂI¡A³oºØ¤èªkªº¦n³B¬O¤ñ¹ï¿ëÃѲv¤ñ¸û°ª¡A¦ý¬O©Òªáªº­pºâ¶q¤]¤ñ¸û¤j¡C¨å«¬ªº¤èªk¦³½u©Ê¦ùÁY¡]Linear Scaling¡^¡Btype-1 & type-2 DTW¡]Dynamic Time Warping¡^µ¥¡C
    • ²V¦Xªk¡G¿é¤J°T¸¹¤£¤Á­µ²Å¡A¦ý¸ê®Æ®wªººq¦±«h¬O¥H­µ²Å¬°³æ¦ì¨ÓÀx¦s¸ê®Æ¡A¨å«¬ªº¤èªk¬O type-3 DTW ¥H¤Î HMM¡]Hidden Markov Models¡^µ¥¤èªk¡C
¥»³¹±N°w¹ï³o´XºØ±Û«ß¿ëÃѱ`¥Îªº¤èªk¡A¨Ó¶i¦æ»¡©ú¡C
Audio Signal Processing and Recognition (­µ°T³B²z»P¿ëÃÑ)