Linear scaling (LS for short), also known as uniform scaling or global scaling, is the most straightforward frame-based method for melody recognition. Typical linear scaling involves the following three steps:
Use interpolation to expand or compress the input pitch vector linearly. The scaling factor, defined as the length ratio between the scaled and the original vectors, could range from 0.5 to 2.0. If we take the step of 0.1 for the scaling factor, it leads to 16 expanded or compressed versions of the original pitch vector. Of course, the scaling-factor bounds ([0.5, 2.0] for instance) and the resolution (the number of scaled pitch vectors, 16 for instance) are all adjustable parameters.
Compare these 16 time-scaled versions with each song in the database. The minimum distance is then defined as the distance of the input pitch vector to a song in the database. You can simply use L1 or L2 norm to compute the distance, with appropriate processing for key transposition, to be detailed later.
For a given input pitch vector, compute the distances to all songs in the database. The song with the minimum distance is the most likely song for the input pitch vector.
The following plot is a typical example of linear scaling, with a scaling-factor bounds of [0.5, 2] and a resolution of 5. When the scaling factor is 1.5, the minimum distance is achieved.
Simple as it is, in practice, we still need to consider the following detailed issues:
Method for interpolation: Simple linear interpolation should suffice. Other advanced interpolations may be tried only if they will not make the computation prohibitive.
Distance measures: We can use L1 norm (the sum of absolute difference of elements, also known as taxicab distance or Manhattan distance) or L2 norm (square root of the sum of squared difference of elements, also known as Euclidean distance) to compute the distance.
Distance normalization: Usually we need to normalize the distance by the length of the vector to eliminate the biased introduced via expansion or compression.
Key transposition: To achieve invariance in key transposition, we need to shift the input pitch vector to achieve a minimum distance when compared to the songs in the database. For different distance measures, we have different schemes for key transposition:
For L1 norm, we can shift the input pitch vector to have the same median as that of each song in the database.
For L2 norm, we can shift the input pitch vector to have the same mean as that of each song in the database.
Rest handling: In order to preserve the timing information, we usually replace the rest with previous non-rest pitch for both input pitch vector and songs in the database. One typical example is "Row, Row, Row Your Boat" (original site, local copy).
Characteristics of LS for melody recognition can be summarized as follows:
If the user's singing or humming is at a constant pace, LS usually gives satisfactory performance.
LS is also very efficient both in its computation and the one-shot way to handle key transposition.
The following example is a typical result of using LS to find the best scaling ratio of the input query:
LS is rather robust to the use of different distance metrics, as shown in the following example where L1 and L2 norms are used to obtain the same result.