14-2 Linear Scaling

[english][all]

(請注意:中文版本並未隨英文版本同步更新!)

在不切音符的情況下,最簡單的辨識方法,就是線性伸縮(Linear Scaling),流程如下:

  1. 使用內差法,將使用者輸入的音高向量進行線性拉長或壓縮,例如伸縮比例可以是從 0.5 到 2.0,跳距是 0.1,共生出 16 個版本。
  2. 將這16個版本和資料庫中的每一首歌曲進行比對,得到16個距離,其中的最小值,即是輸入向量和此首歌的距離。
  3. 對所有資料庫歌曲進行比對,最短距離者,即是使用者所唱的歌。

下面是一個線性伸縮的示意圖,共伸縮五次,當伸縮比例是 1.5 時,可以達到最佳的比對效果:

在實做上,還有下列細節要考慮:

  1. 內差法的選用:可以使用簡單的線性內差。
  2. 距離的選擇:通常我們使用 L1 norm,也就是計算每個對應元素絕對差值的和,或是使用 L2 norm,又稱為歐基理德距離,也就是計算每個對應元素差值的平方和,再開平方,但在實做上,我們通常只在比較距離的大小,因此常常省略開平方的動作,以節省計算。

    Hint
    Lp norm of vectors x and y = [S|xi-yi|p]1/p

  3. 距離的正規化:我們會將總距離除以點數,得到正規化的距離,以消除因伸縮造成點數不同所帶來的影響。
  4. 音高的校正:每一個人唱歌的key不同(通常女生的key比較高,男生的key比較低),因此在進行比對之前,要先進行校正。一般而言,校正的目的是要達到兩個向量之間距離的最小值,因此對於不同的距離計算方式,我們就有不同的校正法則:
    • 若使用 L1 norm,我們可以採用「中位數校正」,亦即將輸入向量的中位數校正成對應資料庫向量的中位數。
    • 若使用 L2 norm,我們可以採用「平均值校正」,亦即將輸入向量的平均值校正成對應資料庫向量的平均值。
  5. 對於休止符的處理:為了保持音符的特性,我們通常會將休止符(包含使用者的輸入和資料庫的歌曲)代換成前一個音。

線性伸縮用於旋律辨識的特性,可以說明如下:

以下是使用真實音高向量的範例:

Example 1: linScaling01.minputPitch=[48.044247 48.917323 49.836778 50.154445 50.478049 50.807818 51.143991 51.486821 51.486821 51.486821 51.143991 50.154445 50.154445 50.154445 49.218415 51.143991 51.143991 50.807818 49.524836 49.524836 49.524836 49.524836 51.143991 51.143991 51.143991 51.486821 51.836577 50.807818 51.143991 52.558029 51.486821 51.486821 51.486821 51.143991 51.143991 51.143991 51.143991 51.143991 51.143991 51.143991 51.143991 51.143991 49.218415 50.807818 50.807818 50.154445 50.478049 48.044247 49.524836 52.193545 51.486821 51.486821 51.143991 50.807818 51.486821 51.486821 51.486821 51.486821 51.486821 55.788268 55.349958 54.922471 54.922471 55.349958 55.349958 55.349958 55.349958 55.349958 55.349958 55.349958 55.349958 53.699915 58.163541 59.213095 59.762739 59.762739 59.762739 59.762739 58.163541 57.661699 58.163541 58.680365 58.680365 58.680365 58.163541 55.788268 54.505286 55.349958 55.788268 55.788268 55.788268 54.922471 54.505286 56.237965 55.349958 55.349958 55.349958 55.349958 54.505286 54.505286 55.349958 48.917323 50.478049 50.807818 51.143991 51.143991 51.143991 50.807818 50.807818 50.478049 50.807818 51.486821 51.486821 51.486821 51.486821 51.486821 51.486821 52.558029 52.558029 52.558029 52.558029 52.193545 51.836577 52.193545 53.310858 53.310858 53.310858 52.930351 52.930351 53.310858 52.930351 52.558029 52.193545 52.930351 53.310858 52.930351 51.836577 52.558029 53.699915 52.930351 52.930351 52.558029 52.930351 52.930351 52.558029 52.558029 52.558029 53.310858 53.310858 53.310858 53.310858 52.930351 52.930351 52.930351 52.558029 52.930351 52.930351 52.930351 52.930351 52.930351 52.930351 52.930351 53.310858 53.310858 53.310858 52.193545 52.193545 52.193545 54.097918 52.930351 52.930351 52.930351 52.930351 52.930351 51.143991 51.143991 51.143991 48.917323 49.524836 49.524836 49.836778 49.524836 48.917323 49.524836 49.218415 48.330408 48.330408 48.330408 48.330408 48.330408 49.524836 49.836778 53.310858 53.310858 53.310858 52.930351 52.930351 52.930351 53.310858 52.930351 52.930351 52.558029 52.558029 51.143991 52.930351 49.218415 49.836778 50.154445 49.836778 49.524836 48.621378 48.621378 48.621378 49.836778 49.836778 49.836778 49.836778 46.680365 46.680365 46.680365 46.163541 45.661699 45.661699 45.910801 46.163541 46.163541 46.163541 46.163541 46.163541 46.163541 46.163541 46.163541 46.163541 46.163541 46.163541 46.163541 50.807818 51.486821 51.486821 51.143991]; dbPitch =[60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 64 64 64 64 64 64 64 64 64 64 64 64 64 67 67 67 67 67 67 67 67 67 67 67 67 64 64 64 64 64 64 64 64 64 64 64 64 64 60 60 60 60 60 60 60 60 60 60 60 60 60 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 59 59 59 59 59 59 59 59 59 59 59 59 59 62 62 62 62 62 62 62 62 62 62 62 62 59 59 59 59 59 59 59 59 59 59 59 59 59 55 55 55 55 55 55 55 55 55 55 55 55 55 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 64 64 64 64 64 64 64 64 64 64 64 64 64 67 67 67 67 67 67 67 67 67 67 67 67 64 64 64 64 64 64 64 64 64 64 64 64 64 60 60 60 60 60 60 60 60 60 60 60 60 60 67 67 67 67 67 67 67 67 67 67 67 67 65 65 65 65 65 65 65 65 65 65 65 65 65 64 64 64 64 64 64 64 64 64 64 64 64 62 62 62 62 62 62 62 62 62 62 62 62 62 60 60 60 60 60 60 60 60 60 60 60 60 60]; resolution=21; sfBounds=[0.5, 1.5]; % Scaling-factor bounds distanceType=1; % L1-norm [minDist1, scaledPitch, allDist]=linScaling(inputPitch, dbPitch, sfBounds(1), sfBounds(2), resolution, distanceType); axisLimit=[0 370 45 70]; subplot(3,1,1); plot(1:length(dbPitch), dbPitch, '.-', 1:length(inputPitch), inputPitch, '.-'); title('Database and input pitch vectors'); ylabel('Semitones'); legend('Database pitch', 'Input pitch', 'location', 'SouthEast'); axis(axisLimit); subplot(3,1,2); plot(1:length(dbPitch), dbPitch, '.-', 1:length(scaledPitch), scaledPitch, '.-'); legend('Database pitch', 'Scaled pitch', 'location', 'SouthEast'); title('Database and scaled pitch vectors'); ylabel('Semitones'); axis(axisLimit); subplot(3,1,3); ratio=linspace(sfBounds(1), sfBounds(2), resolution); plot(ratio, allDist, '.-'); xlabel('Scaling factor'); ylabel('Distance'); title(sprintf('Normalized distance of L_%d norm', distanceType));

下列範例採用兩種距離函數 L1-norm 和 L2-norm 來進行 LS,所得到的結果很接近:

Example 2: linScaling02.minputPitch=[48.044247 48.917323 49.836778 50.154445 50.478049 50.807818 51.143991 51.486821 51.486821 51.486821 51.143991 50.154445 50.154445 50.154445 49.218415 51.143991 51.143991 50.807818 49.524836 49.524836 49.524836 49.524836 51.143991 51.143991 51.143991 51.486821 51.836577 50.807818 51.143991 52.558029 51.486821 51.486821 51.486821 51.143991 51.143991 51.143991 51.143991 51.143991 51.143991 51.143991 51.143991 51.143991 49.218415 50.807818 50.807818 50.154445 50.478049 48.044247 49.524836 52.193545 51.486821 51.486821 51.143991 50.807818 51.486821 51.486821 51.486821 51.486821 51.486821 55.788268 55.349958 54.922471 54.922471 55.349958 55.349958 55.349958 55.349958 55.349958 55.349958 55.349958 55.349958 53.699915 58.163541 59.213095 59.762739 59.762739 59.762739 59.762739 58.163541 57.661699 58.163541 58.680365 58.680365 58.680365 58.163541 55.788268 54.505286 55.349958 55.788268 55.788268 55.788268 54.922471 54.505286 56.237965 55.349958 55.349958 55.349958 55.349958 54.505286 54.505286 55.349958 48.917323 50.478049 50.807818 51.143991 51.143991 51.143991 50.807818 50.807818 50.478049 50.807818 51.486821 51.486821 51.486821 51.486821 51.486821 51.486821 52.558029 52.558029 52.558029 52.558029 52.193545 51.836577 52.193545 53.310858 53.310858 53.310858 52.930351 52.930351 53.310858 52.930351 52.558029 52.193545 52.930351 53.310858 52.930351 51.836577 52.558029 53.699915 52.930351 52.930351 52.558029 52.930351 52.930351 52.558029 52.558029 52.558029 53.310858 53.310858 53.310858 53.310858 52.930351 52.930351 52.930351 52.558029 52.930351 52.930351 52.930351 52.930351 52.930351 52.930351 52.930351 53.310858 53.310858 53.310858 52.193545 52.193545 52.193545 54.097918 52.930351 52.930351 52.930351 52.930351 52.930351 51.143991 51.143991 51.143991 48.917323 49.524836 49.524836 49.836778 49.524836 48.917323 49.524836 49.218415 48.330408 48.330408 48.330408 48.330408 48.330408 49.524836 49.836778 53.310858 53.310858 53.310858 52.930351 52.930351 52.930351 53.310858 52.930351 52.930351 52.558029 52.558029 51.143991 52.930351 49.218415 49.836778 50.154445 49.836778 49.524836 48.621378 48.621378 48.621378 49.836778 49.836778 49.836778 49.836778 46.680365 46.680365 46.680365 46.163541 45.661699 45.661699 45.910801 46.163541 46.163541 46.163541 46.163541 46.163541 46.163541 46.163541 46.163541 46.163541 46.163541 46.163541 46.163541 50.807818 51.486821 51.486821 51.143991]; dbPitch =[60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 64 64 64 64 64 64 64 64 64 64 64 64 64 67 67 67 67 67 67 67 67 67 67 67 67 64 64 64 64 64 64 64 64 64 64 64 64 64 60 60 60 60 60 60 60 60 60 60 60 60 60 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 59 59 59 59 59 59 59 59 59 59 59 59 59 62 62 62 62 62 62 62 62 62 62 62 62 59 59 59 59 59 59 59 59 59 59 59 59 59 55 55 55 55 55 55 55 55 55 55 55 55 55 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 64 64 64 64 64 64 64 64 64 64 64 64 64 67 67 67 67 67 67 67 67 67 67 67 67 64 64 64 64 64 64 64 64 64 64 64 64 64 60 60 60 60 60 60 60 60 60 60 60 60 60 67 67 67 67 67 67 67 67 67 67 67 67 65 65 65 65 65 65 65 65 65 65 65 65 65 64 64 64 64 64 64 64 64 64 64 64 64 62 62 62 62 62 62 62 62 62 62 62 62 62 60 60 60 60 60 60 60 60 60 60 60 60 60]; resolution=21; sfBounds=[0.5, 1.5]; % Scaling-factor bounds distanceType=1; % L1-norm [minDist1, scaledPitch1, allDist1]=linScaling(inputPitch, dbPitch, sfBounds(1), sfBounds(2), resolution, distanceType); distanceType=2; % L2-norm [minDist1, scaledPitch2, allDist2]=linScaling(inputPitch, dbPitch, sfBounds(1), sfBounds(2), resolution, distanceType); allDist2=sqrt(allDist2); % To reduce computation, the L2-distance returned by linScalingMex is actually the square distance, so we need to take the square root. axisLimit=[0 370 45 70]; subplot(3,1,1); plot(1:length(dbPitch), dbPitch, '.-', 1:length(inputPitch), inputPitch, '.-'); title('Database and input pitch vectors'); ylabel('Semitones'); legend('Database pitch', 'Input pitch', 'location', 'SouthEast'); axis(axisLimit); subplot(3,1,2); plot(1:length(dbPitch), dbPitch, '.-', 1:length(scaledPitch1), scaledPitch1, '.-', 1:length(scaledPitch2), scaledPitch2, '.-'); legend('Database pitch', 'Scaled pitch via L_1 norm', 'Scaled pitch via L_2 norm', 'location', 'SouthEast'); title('Database and scaled pitch vectors'); ylabel('Semitones'); axis(axisLimit); subplot(3,1,3); ratio=linspace(sfBounds(1), sfBounds(2), resolution); plot(ratio, allDist1, '.-', ratio, allDist2, '.-'); xlabel('Scaling factor'); ylabel('Distances'); title('Normalized distances via L_1 & L_2 norm'); legend('L_1 norm', 'L_2 norm', 'location', 'SouthEast');

Hint
請注意,為了節省計算,上述範例中 linScalingMex 的距離計算,是使用 L2 norm 的平方,而不是使用 L2 norm。所以最後在畫圖前,我們還要進行開平方,以便得到正確的 L2 norm。


Audio Signal Processing and Recognition (音訊處理與辨識)