14-4 Linear Scaling (線性伸�

[chinese][english]

(½Ðª`·N¡G¤¤¤åª©¥»¨Ã¥¼ÀH­^¤åª©¥»¦P¨B§ó·s¡I)

Linear scaling (LS for short), also known as uniform scaling or global scaling, is the most straightforward frame-based method for melody recognition. Typical linear scaling involves the following three steps:

¦b¤£¤Á­µ²Åªº±¡ªp¤U¡A³Ì²³æªº¿ëÃѤèªk¡A´N¬O½u©Ê¦ùÁY¡]Linear Scaling¡^¡A¬yµ{¦p¤U¡G

  1. Use interpolation to expand or compress the input pitch vector linearly. The scaling factor, defined as the length ratio between the scaled and the original vectors, could range from 0.5 to 2.0. If we take the step of 0.1 for the scaling factor, it leads to 16 expanded or compressed versions of the original pitch vector. Of course, the scaling-factor bounds ([0.5, 2.0] for instance) and the resolution (the number of scaled pitch vectors, 16 for instance) are all adjustable parameters.
  2. Compare these 16 time-scaled versions with each song in the database. The minimum distance is then defined as the distance of the input pitch vector to a song in the database. You can simply use L1 or L2 norm to compute the distance, with appropriate processing for key transposition, to be detailed later.
  3. For a given input pitch vector, compute the distances to all songs in the database. The song with the minimum distance is the most likely song for the input pitch vector.
  1. ¨Ï¥Î¤º®tªk¡A±N¨Ï¥ÎªÌ¿é¤Jªº­µ°ª¦V¶q¶i¦æ½u©Ê©Ôªø©ÎÀ£ÁY¡A¨Ò¦p¦ùÁY¤ñ¨Ò¥i¥H¬O±q 0.5 ¨ì 2.0¡A¸õ¶Z¬O 0.1¡A¦@¥Í¥X 16 ­Óª©¥»¡C
  2. ±N³o16­Óª©¥»©M¸ê®Æ®w¤¤ªº¨C¤@­ººq¦±¶i¦æ¤ñ¹ï¡A±o¨ì16­Ó¶ZÂ÷¡A¨ä¤¤ªº³Ì¤p­È¡A§Y¬O¿é¤J¦V¶q©M¦¹­ººqªº¶ZÂ÷¡C
  3. ¹ï©Ò¦³¸ê®Æ®wºq¦±¶i¦æ¤ñ¹ï¡A³Ìµu¶ZÂ÷ªÌ¡A§Y¬O¨Ï¥ÎªÌ©Ò°Ûªººq¡C

The following plot is a typical example of linear scaling, with a scaling-factor bounds of [0.5, 2] and a resolution of 5. When the scaling factor is 1.5, the minimum distance is achieved.

¤U­±¬O¤@­Ó½u©Ê¦ùÁYªº¥Ü·N¹Ï¡A¦@¦ùÁY¤­¦¸¡A·í¦ùÁY¤ñ¨Ò¬O 1.5 ®É¡A¥i¥H¹F¨ì³Ì¨Îªº¤ñ¹ï®ÄªG¡G

Simple as it is, in practice, we still need to consider the following detailed issues:

¦b¹ê°µ¤W¡AÁÙ¦³¤U¦C²Ó¸`­n¦Ò¼{¡G

  1. Method for interpolation: Simple linear interpolation should suffice. Other advanced interpolations may be tried only if they will not make the computation prohibitive.
  2. Distance measures: We can use L1 norm (the sum of absolute difference of elements, also known as taxicab distance or Manhattan distance) or L2 norm (square root of the sum of squared difference of elements, also known as Euclidean distance) to compute the distance.

    Hint
    Lp norm of vectors x and y = [S|xi-yi|p]1/p

  3. Distance normalization: Usually we need to normalize the distance by the length of the vector to eliminate the biased introduced via expansion or compression.
  4. Key transposition: To achieve invariance in key transposition, we need to shift the input pitch vector to achieve a minimum distance when compared to the songs in the database. For different distance measures, we have different schemes for key transposition:
    • For L1 norm, we can shift the input pitch vector to have the same median as that of each song in the database.
    • For L2 norm, we can shift the input pitch vector to have the same mean as that of each song in the database.
  5. Rest handling: In order to preserve the timing information, we usually replace the rest with previous non-rest pitch for both input pitch vector and songs in the database. One typical example is "Row, Row, Row Your Boat" (original site, local copy).
  1. ¤º®tªkªº¿ï¥Î¡G¥i¥H¨Ï¥Î²³æªº½u©Ê¤º®t¡C
  2. ¶ZÂ÷ªº¿ï¾Ü¡G³q±`§Ú­Ì¨Ï¥Î L1 norm¡A¤]´N¬O­pºâ¨C­Ó¹ïÀ³¤¸¯Àµ´¹ï®t­Èªº©M¡A©Î¬O¨Ï¥Î L2 norm¡A¤SºÙ¬°¼Ú°ò²z¼w¶ZÂ÷¡A¤]´N¬O­pºâ¨C­Ó¹ïÀ³¤¸¯À®t­Èªº¥­¤è©M¡A¦A¶}¥­¤è¡A¦ý¦b¹ê°µ¤W¡A§Ú­Ì³q±`¥u¦b¤ñ¸û¶ZÂ÷ªº¤j¤p¡A¦]¦¹±`±`¬Ù²¤¶}¥­¤èªº°Ê§@¡A¥H¸`¬Ù­pºâ¡C

    Hint
    Lp norm of vectors x and y = [S|xi-yi|p]1/p

  3. ¶ZÂ÷ªº¥¿³W¤Æ¡G§Ú­Ì·|±NÁ`¶ZÂ÷°£¥HÂI¼Æ¡A±o¨ì¥¿³W¤Æªº¶ZÂ÷¡A¥H®ø°£¦]¦ùÁY³y¦¨ÂI¼Æ¤£¦P©Ò±a¨Óªº¼vÅT¡C
  4. ­µ°ªªº®Õ¥¿¡G¨C¤@­Ó¤H°Ûºqªºkey¤£¦P¡]³q±`¤k¥Íªºkey¤ñ¸û°ª¡A¨k¥Íªºkey¤ñ¸û§C¡^¡A¦]¦¹¦b¶i¦æ¤ñ¹ï¤§«e¡A­n¥ý¶i¦æ®Õ¥¿¡C¤@¯ë¦Ó¨¥¡A®Õ¥¿ªº¥Øªº¬O­n¹F¨ì¨â­Ó¦V¶q¤§¶¡¶ZÂ÷ªº³Ì¤p­È¡A¦]¦¹¹ï©ó¤£¦Pªº¶ZÂ÷­pºâ¤è¦¡¡A§Ú­Ì´N¦³¤£¦Pªº®Õ¥¿ªk«h¡G
    • ­Y¨Ï¥Î L1 norm¡A§Ú­Ì¥i¥H±Ä¥Î¡u¤¤¦ì¼Æ®Õ¥¿¡v¡A¥ç§Y±N¿é¤J¦V¶qªº¤¤¦ì¼Æ®Õ¥¿¦¨¹ïÀ³¸ê®Æ®w¦V¶qªº¤¤¦ì¼Æ¡C
    • ­Y¨Ï¥Î L2 norm¡A§Ú­Ì¥i¥H±Ä¥Î¡u¥­§¡­È®Õ¥¿¡v¡A¥ç§Y±N¿é¤J¦V¶qªº¥­§¡­È®Õ¥¿¦¨¹ïÀ³¸ê®Æ®w¦V¶qªº¥­§¡­È¡C
  5. ¹ï©ó¥ð¤î²Åªº³B²z¡G¬°¤F«O«ù­µ²Åªº¯S©Ê¡A§Ú­Ì³q±`·|±N¥ð¤î²Å¡]¥]§t¨Ï¥ÎªÌªº¿é¤J©M¸ê®Æ®wªººq¦±¡^¥N´«¦¨«e¤@­Ó­µ¡C

Characteristics of LS for melody recognition can be summarized as follows:

½u©Ê¦ùÁY¥Î©ó±Û«ß¿ëÃѪº¯S©Ê¡A¥i¥H»¡©ú¦p¤U¡G

The following example is a typical result of using LS to find the best scaling ratio of the input query:

¥H¤U¬O¨Ï¥Î¯u¹ê­µ°ª¦V¶qªº½d¨Ò¡G

Example 1: linScaling01.minputPitch=[48.044247 48.917323 49.836778 50.154445 50.478049 50.807818 51.143991 51.486821 51.486821 51.486821 51.143991 50.154445 50.154445 50.154445 49.218415 51.143991 51.143991 50.807818 49.524836 49.524836 49.524836 49.524836 51.143991 51.143991 51.143991 51.486821 51.836577 50.807818 51.143991 52.558029 51.486821 51.486821 51.486821 51.143991 51.143991 51.143991 51.143991 51.143991 51.143991 51.143991 51.143991 51.143991 49.218415 50.807818 50.807818 50.154445 50.478049 48.044247 49.524836 52.193545 51.486821 51.486821 51.143991 50.807818 51.486821 51.486821 51.486821 51.486821 51.486821 55.788268 55.349958 54.922471 54.922471 55.349958 55.349958 55.349958 55.349958 55.349958 55.349958 55.349958 55.349958 53.699915 58.163541 59.213095 59.762739 59.762739 59.762739 59.762739 58.163541 57.661699 58.163541 58.680365 58.680365 58.680365 58.163541 55.788268 54.505286 55.349958 55.788268 55.788268 55.788268 54.922471 54.505286 56.237965 55.349958 55.349958 55.349958 55.349958 54.505286 54.505286 55.349958 48.917323 50.478049 50.807818 51.143991 51.143991 51.143991 50.807818 50.807818 50.478049 50.807818 51.486821 51.486821 51.486821 51.486821 51.486821 51.486821 52.558029 52.558029 52.558029 52.558029 52.193545 51.836577 52.193545 53.310858 53.310858 53.310858 52.930351 52.930351 53.310858 52.930351 52.558029 52.193545 52.930351 53.310858 52.930351 51.836577 52.558029 53.699915 52.930351 52.930351 52.558029 52.930351 52.930351 52.558029 52.558029 52.558029 53.310858 53.310858 53.310858 53.310858 52.930351 52.930351 52.930351 52.558029 52.930351 52.930351 52.930351 52.930351 52.930351 52.930351 52.930351 53.310858 53.310858 53.310858 52.193545 52.193545 52.193545 54.097918 52.930351 52.930351 52.930351 52.930351 52.930351 51.143991 51.143991 51.143991 48.917323 49.524836 49.524836 49.836778 49.524836 48.917323 49.524836 49.218415 48.330408 48.330408 48.330408 48.330408 48.330408 49.524836 49.836778 53.310858 53.310858 53.310858 52.930351 52.930351 52.930351 53.310858 52.930351 52.930351 52.558029 52.558029 51.143991 52.930351 49.218415 49.836778 50.154445 49.836778 49.524836 48.621378 48.621378 48.621378 49.836778 49.836778 49.836778 49.836778 46.680365 46.680365 46.680365 46.163541 45.661699 45.661699 45.910801 46.163541 46.163541 46.163541 46.163541 46.163541 46.163541 46.163541 46.163541 46.163541 46.163541 46.163541 46.163541 50.807818 51.486821 51.486821 51.143991]; dbPitch =[60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 64 64 64 64 64 64 64 64 64 64 64 64 64 67 67 67 67 67 67 67 67 67 67 67 67 64 64 64 64 64 64 64 64 64 64 64 64 64 60 60 60 60 60 60 60 60 60 60 60 60 60 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 59 59 59 59 59 59 59 59 59 59 59 59 59 62 62 62 62 62 62 62 62 62 62 62 62 59 59 59 59 59 59 59 59 59 59 59 59 59 55 55 55 55 55 55 55 55 55 55 55 55 55 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 64 64 64 64 64 64 64 64 64 64 64 64 64 67 67 67 67 67 67 67 67 67 67 67 67 64 64 64 64 64 64 64 64 64 64 64 64 64 60 60 60 60 60 60 60 60 60 60 60 60 60 67 67 67 67 67 67 67 67 67 67 67 67 65 65 65 65 65 65 65 65 65 65 65 65 65 64 64 64 64 64 64 64 64 64 64 64 64 62 62 62 62 62 62 62 62 62 62 62 62 62 60 60 60 60 60 60 60 60 60 60 60 60 60]; resolution=21; sfBounds=[0.5, 1.5]; % Scaling-factor bounds distanceType=1; % L1-norm [minDist1, scaledPitch, allDist]=linScaling(inputPitch, dbPitch, sfBounds(1), sfBounds(2), resolution, distanceType); axisLimit=[0 370 45 70]; subplot(3,1,1); plot(1:length(dbPitch), dbPitch, '.-', 1:length(inputPitch), inputPitch, '.-'); title('Database and input pitch vectors'); ylabel('Semitones'); legend('Database pitch', 'Input pitch', 'location', 'SouthEast'); axis(axisLimit); subplot(3,1,2); plot(1:length(dbPitch), dbPitch, '.-', 1:length(scaledPitch), scaledPitch, '.-'); legend('Database pitch', 'Scaled pitch', 'location', 'SouthEast'); title('Database and scaled pitch vectors'); ylabel('Semitones'); axis(axisLimit); subplot(3,1,3); ratio=linspace(sfBounds(1), sfBounds(2), resolution); plot(ratio, allDist, '.-'); xlabel('Scaling factor'); ylabel('Distance'); title(sprintf('Normalized distance of L_%d norm', distanceType));

LS is rather robust to the use of different distance metrics, as shown in the following example where L1 and L2 norms are used to obtain the same result.

¤U¦C½d¨Ò±Ä¥Î¨âºØ¶ZÂ÷¨ç¼Æ L1-norm ©M L2-norm ¨Ó¶i¦æ LS¡A©Ò±o¨ìªºµ²ªG«Ü±µªñ¡G

Example 2: linScaling02.minputPitch=[48.044247 48.917323 49.836778 50.154445 50.478049 50.807818 51.143991 51.486821 51.486821 51.486821 51.143991 50.154445 50.154445 50.154445 49.218415 51.143991 51.143991 50.807818 49.524836 49.524836 49.524836 49.524836 51.143991 51.143991 51.143991 51.486821 51.836577 50.807818 51.143991 52.558029 51.486821 51.486821 51.486821 51.143991 51.143991 51.143991 51.143991 51.143991 51.143991 51.143991 51.143991 51.143991 49.218415 50.807818 50.807818 50.154445 50.478049 48.044247 49.524836 52.193545 51.486821 51.486821 51.143991 50.807818 51.486821 51.486821 51.486821 51.486821 51.486821 55.788268 55.349958 54.922471 54.922471 55.349958 55.349958 55.349958 55.349958 55.349958 55.349958 55.349958 55.349958 53.699915 58.163541 59.213095 59.762739 59.762739 59.762739 59.762739 58.163541 57.661699 58.163541 58.680365 58.680365 58.680365 58.163541 55.788268 54.505286 55.349958 55.788268 55.788268 55.788268 54.922471 54.505286 56.237965 55.349958 55.349958 55.349958 55.349958 54.505286 54.505286 55.349958 48.917323 50.478049 50.807818 51.143991 51.143991 51.143991 50.807818 50.807818 50.478049 50.807818 51.486821 51.486821 51.486821 51.486821 51.486821 51.486821 52.558029 52.558029 52.558029 52.558029 52.193545 51.836577 52.193545 53.310858 53.310858 53.310858 52.930351 52.930351 53.310858 52.930351 52.558029 52.193545 52.930351 53.310858 52.930351 51.836577 52.558029 53.699915 52.930351 52.930351 52.558029 52.930351 52.930351 52.558029 52.558029 52.558029 53.310858 53.310858 53.310858 53.310858 52.930351 52.930351 52.930351 52.558029 52.930351 52.930351 52.930351 52.930351 52.930351 52.930351 52.930351 53.310858 53.310858 53.310858 52.193545 52.193545 52.193545 54.097918 52.930351 52.930351 52.930351 52.930351 52.930351 51.143991 51.143991 51.143991 48.917323 49.524836 49.524836 49.836778 49.524836 48.917323 49.524836 49.218415 48.330408 48.330408 48.330408 48.330408 48.330408 49.524836 49.836778 53.310858 53.310858 53.310858 52.930351 52.930351 52.930351 53.310858 52.930351 52.930351 52.558029 52.558029 51.143991 52.930351 49.218415 49.836778 50.154445 49.836778 49.524836 48.621378 48.621378 48.621378 49.836778 49.836778 49.836778 49.836778 46.680365 46.680365 46.680365 46.163541 45.661699 45.661699 45.910801 46.163541 46.163541 46.163541 46.163541 46.163541 46.163541 46.163541 46.163541 46.163541 46.163541 46.163541 46.163541 50.807818 51.486821 51.486821 51.143991]; dbPitch =[60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 64 64 64 64 64 64 64 64 64 64 64 64 64 67 67 67 67 67 67 67 67 67 67 67 67 64 64 64 64 64 64 64 64 64 64 64 64 64 60 60 60 60 60 60 60 60 60 60 60 60 60 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 59 59 59 59 59 59 59 59 59 59 59 59 59 62 62 62 62 62 62 62 62 62 62 62 62 59 59 59 59 59 59 59 59 59 59 59 59 59 55 55 55 55 55 55 55 55 55 55 55 55 55 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 64 64 64 64 64 64 64 64 64 64 64 64 64 67 67 67 67 67 67 67 67 67 67 67 67 64 64 64 64 64 64 64 64 64 64 64 64 64 60 60 60 60 60 60 60 60 60 60 60 60 60 67 67 67 67 67 67 67 67 67 67 67 67 65 65 65 65 65 65 65 65 65 65 65 65 65 64 64 64 64 64 64 64 64 64 64 64 64 62 62 62 62 62 62 62 62 62 62 62 62 62 60 60 60 60 60 60 60 60 60 60 60 60 60]; resolution=21; sfBounds=[0.5, 1.5]; % Scaling-factor bounds distanceType=1; % L1-norm [minDist1, scaledPitch1, allDist1]=linScaling(inputPitch, dbPitch, sfBounds(1), sfBounds(2), resolution, distanceType); distanceType=2; % L2-norm [minDist1, scaledPitch2, allDist2]=linScaling(inputPitch, dbPitch, sfBounds(1), sfBounds(2), resolution, distanceType); allDist2=sqrt(allDist2); % To reduce computation, the L2-distance returned by linScalingMex is actually the square distance, so we need to take the square root. axisLimit=[0 370 45 70]; subplot(3,1,1); plot(1:length(dbPitch), dbPitch, '.-', 1:length(inputPitch), inputPitch, '.-'); title('Database and input pitch vectors'); ylabel('Semitones'); legend('Database pitch', 'Input pitch', 'location', 'SouthEast'); axis(axisLimit); subplot(3,1,2); plot(1:length(dbPitch), dbPitch, '.-', 1:length(scaledPitch1), scaledPitch1, '.-', 1:length(scaledPitch2), scaledPitch2, '.-'); legend('Database pitch', 'Scaled pitch via L_1 norm', 'Scaled pitch via L_2 norm', 'location', 'SouthEast'); title('Database and scaled pitch vectors'); ylabel('Semitones'); axis(axisLimit); subplot(3,1,3); ratio=linspace(sfBounds(1), sfBounds(2), resolution); plot(ratio, allDist1, '.-', ratio, allDist2, '.-'); xlabel('Scaling factor'); ylabel('Distances'); title('Normalized distances via L_1 & L_2 norm'); legend('L_1 norm', 'L_2 norm', 'location', 'SouthEast');

Hint
Note that in order to save computation, linScalingMex uses the square of L2 norm instead of L2 norm literally. Therefore we need to do square root in order to get the correct distance based on L2 norm.

Hint
½Ðª`·N¡A¬°¤F¸`¬Ù­pºâ¡A¤W­z½d¨Ò¤¤ linScalingMex ªº¶ZÂ÷­pºâ¡A¬O¨Ï¥Î L2 norm ªº¥­¤è¡A¦Ó¤£¬O¨Ï¥Î L2 norm¡C©Ò¥H³Ì«á¦bµe¹Ï«e¡A§Ú­ÌÁÙ­n¶i¦æ¶}¥­¤è¡A¥H«K±o¨ì¥¿½Tªº L2 norm¡C


Audio Signal Processing and Recognition (­µ°T³B²z»P¿ëÃÑ)