14-7 LCS and Edit Distance

[chinese][english]
(請注意：中文版本並未隨英文版本同步更新！)
We can use note-based approach for melody recognition. First of all, we need to segment the input query pitch vector into music notes. Assume the input pitch vector is represented by pv(i), i=1~n, then the most straightforward method for note segmentation can be described as the pseudo code shown next:
for i=1:n
	if |pv(i)-pv(i-1)|>q & the current note is long enough
		Finish the current note and start a new note
	else
		Add pv(i) to the current note
	end
end
若是以音符為單位來進行比對，我們就必須先將輸入向量切成音符。假設音高向量可以表示成 pv(i), i=1~n，那麼最簡單的切音符方法，可以說明如下：
for i=1:n
	if |pv(i)-pv(i-1)|>q & 音符夠長
		切出一個音符
	else
		將 pv(i) 加入目前的音符
	end
end
The following example demonstrates the use of pv2note.m for note segmentation:
以下就是一個簡單的範例，來切出音符：
Example 1: noteSegment01.m

The original pitch vector: pitchVector.wav
Notes segmented from the pitch vector: noteFromPv.wav

原始音高向量：pitchVector.wav
切出之音符：noteFromPv.wav

In fact, the method implemented by pv2note.m is the simplest way for note segmentation. Possible enhancements include:

Finish the current note whenever there is a rest.
If the singer has a good absolute pitch, we can round off each note to the nearest integer. On the other hand, if the singer has a good relative pitch, we can shift up and down to find a best shift amount that minimizes the absolute error.
We can also employ the concept of DP to find the best way for note segmentation, such that the difference between the original pitch vector and the segmented notes is minimized.
上述方法是最簡單的方法，還有很多改進的空間，可能的改進的方向如下：

若遇到休止（音量很低之處），也要切出一個音符。
若遇到氣音，也要切出一個音符。
若歌唱者的音很準，可將每個音符的音高四捨五入到整數值。（或在上下平移後，找出一個平移量，讓每個音符在四捨五入後，和原音符的誤差總和為最小。）
可用 DP 的方式，來找出最佳的音符切割方式，使切出的音符和原來的音高向量有最小的誤差總和。

Once we finish note segmentation, we can use the note-based representation for melody recognition, as follows.

In the first method, we only consider the pitch difference between two neighboring notes, regardless of the note's absolute pitch and duration. For instance, if the note pitch is [69, 60, 58, 62, 62, 67, 69], we can convert it into a ternary string [D, D, U, S, U, U] where D (down), U (up) and S (same) are used to described the relationship between two neighboring notes. Once we have converted both the input query and the reference song into two ternary string vectors, we can invoke LCS (longest common subsequence) or ED (edit distance) to compute their distance.
In the second method, we use the note pitch directly, regardless of the note duration. For instance, if the note pitch is [69, 60, 58, 62, 62, 67, 69], we can simply invoke type-1 or type-2 DTW to compute the distance. Since we are using the note pitch directly, we need to perform key transposition before invoking DTW. On the other hand, if we are using the difference of the note pitch for comparison, then we do not need to invoke key transposition.
In the third method, we need to consider both note pitch and note duration. We can still use DTW-like method for comparison, except that we need to consider note pitch and note duration separately. For note pitch, we can take the difference to deal with key variation. For note duration, we can take the ratio of neighboring note duration to take care of tempo variation. We can then invoke DTW that compute the cost function as a weighted average of pitch and duration cost.

一旦切出音符後，就要進行比對，有幾種方式可以進行。

第一種切音符的比對方法，是只考慮音符的升降，而不考慮絕對數值及時間資訊。例如，如果切出的音符音高是 [69, 60, 58, 62, 62, 67, 69]，那麼對應的比對格式是 [D, D, U, S, U, U]，這些字母代表相鄰音符的關係，例如 D 是 Down，S 是 Same，U 是 Up。一旦將輸入音高向量轉成這種字串向量後，我們就可以使用「最長共同子字串」（longest common subsequence, LCS）或「編輯距離」（edit distance, ED）來進行比對。
第二種切音符的比對方法，則是指考慮音符的音高資訊，而不考慮音長資訊。例如，如果切出的音符音高是 [69, 60, 58, 62, 62, 67, 69]，那麼我們可以直接使用類似 DTW 的方式來進行比對，只是在比對前還必須經過音高校正（key transposition）。
第三種切音符的比對方法，則是同時考慮音符的音高和音長資訊，此時我們還是可以使用類似 DTW 的方式來進行比對，只是要將音高和音長分別考慮來求取距離，同時也必須考慮音高校正和音長校正。

Audio Signal Processing and Recognition (音訊處理與辨識)