14-7 LCS and Edit Distance

[chinese][all]
We can use note-based approach for melody recognition. First of all, we need to segment the input query pitch vector into music notes. Assume the input pitch vector is represented by pv(i), i=1~n, then the most straightforward method for note segmentation can be described as the pseudo code shown next:
for i=1:n
	if |pv(i)-pv(i-1)|>q & the current note is long enough
		Finish the current note and start a new note
	else
		Add pv(i) to the current note
	end
end
The following example demonstrates the use of pv2note.m for note segmentation:
Example 1: noteSegment01.m

The original pitch vector: pitchVector.wav
Notes segmented from the pitch vector: noteFromPv.wav

In fact, the method implemented by pv2note.m is the simplest way for note segmentation. Possible enhancements include:

Finish the current note whenever there is a rest.
If the singer has a good absolute pitch, we can round off each note to the nearest integer. On the other hand, if the singer has a good relative pitch, we can shift up and down to find a best shift amount that minimizes the absolute error.
We can also employ the concept of DP to find the best way for note segmentation, such that the difference between the original pitch vector and the segmented notes is minimized.

Once we finish note segmentation, we can use the note-based representation for melody recognition, as follows.

In the first method, we only consider the pitch difference between two neighboring notes, regardless of the note's absolute pitch and duration. For instance, if the note pitch is [69, 60, 58, 62, 62, 67, 69], we can convert it into a ternary string [D, D, U, S, U, U] where D (down), U (up) and S (same) are used to described the relationship between two neighboring notes. Once we have converted both the input query and the reference song into two ternary string vectors, we can invoke LCS (longest common subsequence) or ED (edit distance) to compute their distance.
In the second method, we use the note pitch directly, regardless of the note duration. For instance, if the note pitch is [69, 60, 58, 62, 62, 67, 69], we can simply invoke type-1 or type-2 DTW to compute the distance. Since we are using the note pitch directly, we need to perform key transposition before invoking DTW. On the other hand, if we are using the difference of the note pitch for comparison, then we do not need to invoke key transposition.
In the third method, we need to consider both note pitch and note duration. We can still use DTW-like method for comparison, except that we need to consider note pitch and note duration separately. For note pitch, we can take the difference to deal with key variation. For note duration, we can take the ratio of neighboring note duration to take care of tempo variation. We can then invoke DTW that compute the cost function as a weighted average of pitch and duration cost.

Audio Signal Processing and Recognition (音訊處理與辨識)