14-9 Exercise: QBSH via Optimum Assignment of Singing Pitch to Music Notes
Note that the description of this homework may be lengthy, but the concept is pretty straightforward. If you have questions about this homework, please come to our TA's office hours.
QBSH via Optimum Assignment of Singing Pitch to Music Notes
Dynamic programming (DP) is a very important and useful method for sequence comparision. It has been used extensively in speech recognition, natural language processing, music analysis, and a variety of other areas. In this homework, we shall explore the use of DP for query by singing/humming (QBSH), which is an interesting paradigm of music information retrieval. (For simplicity, we shall assume our QBSH system always starts the comparison from the beginning of a song.)
In particular, you need to find the alignment between the pitch vector of human's singing and the note vector of a given melody, both in the unit of semitone or MIDI number. The correspondence between MIDI numbers and piano keyboard can be found here. Moreover,
The note vector is the music notes corresponging to the intended song of the singing. For simplicity, there no duration information in the vector. Therefore for the song of "Twinkle twinkle little star", here is the music note vector for the first two phrases: [60 60 67 67 69 69 67 65 65 64 64 62 62 60].
Given a pitch vector and a note vector, your mission is to find an optimum mapping to assign each pitch point to a music note, such that the total distance is minimized. To be more specific, let's use the following notation:
$p(i)$ is element $i$ of the pitch vector, where $1 \leq i \leq m$.
$q(j)$ is element $j$ of the note vector, where $1 \leq j \leq n$.
The alignment path can be denoted by its coordinates, that is, $\{(1, u_1), (2, u_2), ..., (m, u_{m})\}$, where $u_k, k=1, 2, \cdots, m$ are indices into the note vector with the following constraints:
$u_1=1$.
$u_1 \leq u_2 \leq u_3 \cdots \leq u_{m}$.
For instance, the alignment path of the above figure is ${(1, 1),(2, 1),(3, 2),(4, 2),(5, 2),(6, 2),(7, 3),(8, 3),(9, 4),(10, 4),(11, 4),(12, 4)}$, with $m=12$ (the length of $p$) and $u_{m}=3$ (the last index of music note that have at least one pitch point being assigned to). In other words:
p(1) and p(2) are assigned to q(1).
p(3), p(4), p(5) and p(6) are assigned to q(2).
p(7) and p(8) are assigned to q(3).
p(9), p(10), p(11) and p(12) are assigned to q(4).
The distance between a pitch vector $p$ and note vector $q$ can then be defined as follows:
$$
dist(p, q)=\min_{u_1, u_2, ... u_{m}}\sum_{i=1}^{m} |p(i)-q(u_i)|,
$$
subject to $u_1=1$ and $u_1 \leq u_2 \leq u_3 \cdots \leq u_{m}$.
Approach
Based on the above description, we can solve the task using DP. The 3-step formula for DP can be described as follows:
Optimum-value function $D(i,j)$ is defined as the minimum distance between $p(1:i)$ and $q(1:j)$.
Recurrent equation for $D(i,j)$ is shown next, with $i>0$ and $j>0$:
$$
D(i,j)=|p(i)-q(j)|
+\min
\left\{
\begin{matrix}
D(i-1,j)\\
D(i-1, j-1)\\
\end{matrix}
\right.
$$
(See the local constraint in the previous figure.) Please derive the recurrent equation when $i=1$ or $j=1$ by yourself.
The boundary condition is $D(1, 1)=|p(1)-q(1)|$.
Answer to the original task:
$$
dist(p, q) = \min_{1\leq j \leq n}D(m, j).
$$
Input/output formats
You need to write a function myPvAlign.m with the following input/output format:
[minDistance, dpPath]=myPvAlign(p, q);
Note that
p is the pitch vector
q is the music note vector
minDistance is the min. distance after alignment
dpPath is a 2-by-m arrays indicating the best alignment path, where each column is a point in the path.
Requirements & suggestions
You need to use DP for this homework, otherwise it will be too slow.
If there are repeated notes, the optimal path might not be unique. As a result, TA's judge system will check your output based on the returned minimum distance. If the deviation between your distance and TA's distance is less than 1, it'll be considered as correct.
The alignment path shown in the previous plot is based on X-Y mode, which is commonly used in the Cartesian coordinate system. If you want to adopt I-J mode (commonly used in matrix indexing), you can simply rotate the plot by 90-degree clockwise.
The process of converting a singing clip into its pitch vector is call pitch tracking, which is out of the scope of this class.