14-1 Introduction (蝪∩?)

[chinese][english]
(請注意：中文版本並未隨英文版本同步更新！) Slides: qbshMain.ppt, qbshMethod.ppt, qbshDemo.ppt.
This chapter introduces methods for melody recognition. In fact, there are many different types of melody recognition. But in this chapter, we shall focus on melody recognition for query by singing/humming (QBSH), which is an intuitive interface for karaoke when one cannot recall the title or artist of a song to sing. Other applications include intelligent toys, on-the-fly scoring for karaoke, edutainment tools for singing and language learning, and so on.
A QBSH system consists of three essential components:

Input data: The system can take two types of inputs:

Acoustic input: This includes singing, humming, or whistling from the users. For such an input, the system needs to compute the corresponding pitch contour, and convert them into note vector (optionally) for further comparison.
Symbolic input: This includes music note representation from the user. Usually this is not a part of QBSH since no singing or humming is involved, but the computation procedure is quite the same, as detailed later.

Song database: This is the song database which hosts a collection of songs to be compared with. For simplicity, most of the song database contains symbolic music whose music scores can be extracted reliably. Typical examples of symbolic music are MIDI files, which can be classified into two types:

Monophonic music: For a given time, only a single voice of an instrument can be heard.
Polyphonic music: Multiple voices can be heard at the same time. Most pop music is of this type.
Most of the time, we are using monophonic symbolic music for melody recognition systems.
Theoretically it is possible to use polyphonic audio music (such as MP3) for constructing the song database. However, it remains a tough and challenging problem to extract the dominant pitch (from vocals, for instance, for pop songs) from polyphonic audio music reliably. (This can be explained from the analogy of identifying swimming objects on a pond by the observed waveforms coming from two channels.)
Methods for comparison: There are at least 3 types of methods for comparing the input pitch representation with songs in the database:

Note-based methods: The input is converted into note vector and compared with each song in the database. This method is efficient since a note vector is much shorter when compared with the corresponding pitch vector. However, note segmentation itself may introduce errors, leading to the degradation in performance. Typical methods of this type include edit distance for music note vectors.
Frame-based methods: The input and database songs are compared in the format of frame-based pitch vectors, where the pitch rate (or frame rate) can be varied from 8 to 64 points per second. The major advantage of this method is effectiveness, at the cost of more computation. Typical methods include linear scaling (LS) and dynamic time warping (DTW, including type-1 and 2).
Hybrid methods: The input is frame-based while the database song is note-based. Typical methods include type-3 DTW and HMM (Hidden Markov Models).

We shall introduces these methods in the subsequent sections.
本章介紹旋律辨識（Melody Recognition）的各種方法。一個旋律辨識系統，包含有下列三部分：

輸入端：系統所接受到的輸入，例如使用者的哼唱歌聲，或是使用者輸入的音符。一般而言，系統必須先將此輸入轉成可比對格式，例如音高向量，或是音符向量，才能送到下一階段進行比對。
資料庫：資料庫包含系統內部可供比對的歌曲，同樣的，這些歌曲也必須事先處理成可比對的格式，最簡單的格式，就是單音的資料，只包含音高及音長的資訊，而且同一時間點，最多只有一種發音，這就是所謂的「單音音樂」（Monophonic Music），例如單軌的 MIDI 或是人聲的清唱等，都屬於此類。相對而言，一般我們常聽到的 MP3 流行音樂或是古典交響樂，在同一個時間點通常會有多個樂器同時發音，所以是屬於「多音音樂」（Polyphonic music）。
比對方式：使用輸入向量來比對資料庫歌曲的方式，一般可以分成兩大類：

切音符的方法：輸入訊號和資料庫歌曲都以音符（包含音高和音長的資訊）為單位來進行比對，這種方法的好處是比對速度比較快，但是「切音符」（Note Segmentation）本身可能就帶來誤差，導致比對的辨識率也會降低。典型的方法有編輯距離（Edit Distance）等。
不切音符的方法：輸入訊號和資料庫歌曲都以音高向量為單位，每一秒可以包含8～32個音高點，這種方法的好處是比對辨識率比較高，但是所花的計算量也比較大。典型的方法有線性伸縮（Linear Scaling）、type-1 & type-2 DTW（Dynamic Time Warping）等。
混合法：輸入訊號不切音符，但資料庫的歌曲則是以音符為單位來儲存資料，典型的方法是 type-3 DTW 以及 HMM（Hidden Markov Models）等方法。

本章將針對這幾種旋律辨識常用的方法，來進行說明。
Audio Signal Processing and Recognition (音訊處理與辨識)