Tutorial on Vibrato Detection (by Roger Jang)

This tutorial explains how to use HMM (Hidden Markov Models) for VD (vibrato detection) in human's singing voice. Here the human's singing voice is referred to pure vocals without any accompaniment. Such vibrato detection can be used in singing voice scoring, especially in karaoke machines.

Contents

Preprocessing

Before we start, let's add necessary toolboxes to the search path of MATLAB:

addpath d:/users/jang/matlab/toolbox/utility
addpath d:/users/jang/matlab/toolbox/sap
addpath d:/users/jang/matlab/toolbox/machineLearning

All the above toolboxes can be downloaded from the author's toolbox page. Make sure you are using the latest toolboxes to work with this script.

For compatibility, here we list the platform and MATLAB version that we used to run this script:

fprintf('Platform: %s\n', computer);
fprintf('MATLAB version: %s\n', version);
fprintf('Date & time: %s\n', char(datetime));
scriptStartTime=tic;	% Timing for the whole script
Platform: PCWIN64
MATLAB version: 9.4.0.813654 (R2018a)
Date & time: 30-Oct-2018 03:50:40

Most of the modifiable options for this vibrato detection task are set in vdOptSet.m:

type vdOptSet
function vdOpt=vdOptSet
% vdOptSet: Set options for VD (vibrato detection)
%
%	Usage:
%		vdOpt=vdOptSet;
%
%	Description:
%		vdOpt=vdOptSet returns the default options for vibrato detection.
%
%	Example:
%		vdOpt=vdOptSet

%	Category: Options for vibrato detection
%	Roger Jang, 20130114

%% === Function for feature extraction and plotting
vdOpt.feaExtractFcn=@vdFeaExtractFromFile;
%vdOpt.feaExtractFcn=@vibratoFeaExtract;
vdOpt.hmmPlotFcn=@vdHmmPlot;
%% === Folder for wave files
vdOpt.audioDir='D:\dataset\vibrato\TeresaTeng\waveAndPitch';
%% === Parameters for VD
vdOpt.frameSize=512;
vdOpt.overlap=0;
vdOpt.pfType=1;		% For pitch tracking, 0 for AMDF, 1 for ACF
vdOpt.sFrameSizeInSec=0.3;	% Super frame size
vdOpt.sOverlapInSec=0;		% Super overlap
vdOpt.featureName={'aPitch', 'bPitch', 'distPitch', 'aVol', 'bVol', 'distVol'};
%vdOpt.featureName={'aPitch', 'bPitch', 'distPitch'};
vdOpt.outputName={'nonvibrato', 'vibrato'};
vdOpt.classNum=length(vdOpt.outputName);
vdOpt.gaussianNum=3;	% No. of Gaussians for each class

If you want to run this script, you need to change vdOpt.audioDir such that it points to a folder of sound files containing singing voices with vibrato. The dataset used in the script can be downloaded from <http://mirlab.org/dataset/public>.

Dataset collection

First of all, we can collect all the sound files once they are downloaded. We can use the commmand "mmDataCollect" to collect all the file information:

vdOpt=vdOptSet;
opt=mmDataCollect('defaultOpt');
opt.extName='wav';
auData=mmDataCollect(vdOpt.audioDir, opt, 1);
Collecting 12 files with extension "wav" from "D:\dataset\vibrato\TeresaTeng\waveAndPitch"...

We can now read all the wave files (that have been labeled with vibrato segments), perform feature extraction, and store the result in a big structure varaible auSet:

myTic=tic;
auSet=auSetFeaExtract(auData, vdOpt, 1);
fprintf('Saving vdAuSet.mat...\n');
save vdAuSet auSet
fprintf('time=%g sec\n', toc(myTic));
1/12: file=D:\dataset\vibrato\TeresaTeng\waveAndPitch/一水隔天涯.wav, duration=31.6071 sec, time=7.24358 sec
2/12: file=D:\dataset\vibrato\TeresaTeng\waveAndPitch/何日君再來.wav, duration=68.7801 sec, time=13.5505 sec
3/12: file=D:\dataset\vibrato\TeresaTeng\waveAndPitch/你怎麼說.wav, duration=56.1401 sec, time=7.15232 sec
4/12: file=D:\dataset\vibrato\TeresaTeng\waveAndPitch/再見我的愛人.wav, duration=65.3401 sec, time=7.95378 sec
5/12: file=D:\dataset\vibrato\TeresaTeng\waveAndPitch/夜來香.wav, duration=61.0001 sec, time=9.34774 sec
6/12: file=D:\dataset\vibrato\TeresaTeng\waveAndPitch/小媳婦回娘家.wav, duration=61.6901 sec, time=11.5003 sec
7/12: file=D:\dataset\vibrato\TeresaTeng\waveAndPitch/帝女花.wav, duration=76.8601 sec, time=5.80488 sec
8/12: file=D:\dataset\vibrato\TeresaTeng\waveAndPitch/梅花.wav, duration=242.012 sec, time=23.9273 sec
9/12: file=D:\dataset\vibrato\TeresaTeng\waveAndPitch/獨上西樓.wav, duration=52.9102 sec, time=7.36321 sec
10/12: file=D:\dataset\vibrato\TeresaTeng\waveAndPitch/相似淚.wav, duration=91.3201 sec, time=17.3875 sec
11/12: file=D:\dataset\vibrato\TeresaTeng\waveAndPitch/郊道.wav, duration=122.16 sec, time=28.3073 sec
12/12: file=D:\dataset\vibrato\TeresaTeng\waveAndPitch/高山青.wav, duration=60.2945 sec, time=7.59473 sec
Deleting 0 wave file (due to no tOutput)...
Total time=147.961 sec
Saving vdAuSet.mat...
time=148.017 sec

We can also create DS for all kinds of data visualization tools:

feature=[auSet.feature];
output=[auSet.tOutput];
temp=[auSet.other]; invalidIndex=[temp.invalidIndex];
feature(:, invalidIndex)=[];
output(:, invalidIndex)=[];
DS.input=feature;
DS.output=output;
DS.inputName=vdOpt.featureName;
DS.outputName=vdOpt.outputName;
DS2=DS; DS2.input=inputNormalize(DS2.input);	% input normalization

Data analysis and visualization

Display data am ount:

[classSize, classLabel]=dsClassSize(DS, 1);
6 features
1353 instances
2 classes

Display feature distribution among different classes:

figure; dsBoxPlot(DS);
Index in position 1 exceeds array bounds (must not exceed 6).

Error in goTutorial (line 62)
figure; dsBoxPlot(DS);

Scatter plot of the original DS to 2D:

figure; dsProjPlot2(DS); figEnlarge;

Scatter plot of the input-normalized DS to 2D:

figure; dsProjPlot2(DS2); figEnlarge;

Scatter plot of the original DS to 3D:

figure; dsProjPlot3(DS); figEnlarge;

Scatter plot of the input-normalized DS to 3D:

figure; dsProjPlot3(DS2); figEnlarge;

Input selection based on KNNC, using the original dataset:

myTic=tic;
inputSelectExhaustive(DS); figEnlarge;
fprintf('time=%g sec\n', toc(myTic));

Input selection based on KNNC, using the input-normalizd dataset:

clf;
myTic=tic;
inputSelectExhaustive(DS2); figEnlarge;
fprintf('time=%g sec\n', toc(myTic));

LDA evaluation of approximate LOO

figure;
myTic=tic;
opt=ldaPerfViaKnncLoo('defaultOpt');
opt.mode='approximate';
recogRate1=ldaPerfViaKnncLoo(DS, opt);
recogRate2=ldaPerfViaKnncLoo(DS2, opt);
featureNum=size(DS.input, 1);
plot(1:featureNum, 100*recogRate1, 'o-', 1:featureNum, 100*recogRate2, '^-'); grid on
legend('Raw data', 'Normalized data', 'location', 'northOutside', 'orientation', 'horizontal');
xlabel('No. of projected features based on LDA');
ylabel('LOO recognition rates using KNNC (%)');
fprintf('time=%g sec\n', toc(myTic));

LDA evaluation of exact LOO

figure
myTic=tic;
opt=ldaPerfViaKnncLoo('defaultOpt');
opt.mode='exact';
recogRate1=ldaPerfViaKnncLoo(DS, opt);
recogRate2=ldaPerfViaKnncLoo(DS2, opt);
[featureNum, dataNum] = size(DS.input);
plot(1:featureNum, 100*recogRate1, 'o-', 1:featureNum, 100*recogRate2, '^-'); grid on
legend('Raw data', 'Normalized data', 'location', 'northOutside', 'orientation', 'horizontal');
xlabel('No. of projected features based on LDA');
ylabel('LOO recognition rates using KNNC (%)');
fprintf('time=%g sec\n', toc(myTic));

HMM training

Using the collected auSet, we can start HMM training for vibrato detection:

figure;
myTic=tic;
vdHmmModel=hmmTrain4audio(auSet, vdOpt, 1);
fprintf('time=%g sec\n', toc(myTic));

HMM test

After the training, we can test the HMM using a wave file:

figure;
myTic=tic;
auFile='D:\dataset\vibrato\female\combined-female.wav';
wObj=hmmEval4audio(auFile, vdOpt, vdHmmModel, 1);
fprintf('time=%g sec\n', toc(myTic));

Performance evaluation of HMM via LOO

To evaluate the performance objectively, we can test the LOO accuracy by using "leave-one-file-out":

myTic=tic;
showPlot=1;
[outsideRr, cvData]=hmmPerfLoo4audio(auSet, vdOpt, showPlot);
fprintf('time=%g sec\n', toc(myTic));

Our previous analysis indicates that input normalization can improve the accuracy. So here we shall try the normalized input for HMM training and test:

myTic=tic;
[~, mu, sigma]=inputNormalize(DS.input);
for i=1:length(auSet)
	auSet(i).feature=inputNormalize(auSet(i).feature, mu, sigma);
end
[outsideRr, cvData]=hmmPerfLoo4audio(auSet, vdOpt, 1);
fprintf('time=%g sec\n', toc(myTic));

Summary

This is a brief tutorial on using HMM for vibrato detection. There are several directions for further improvement:

Appendix

List of functions, scripts, and datasets used in this script:

Date and time when finishing this script:

fprintf('Date & time: %s\n', char(datetime));

Overall elapsed time:

toc(scriptStartTime)

Jyh-Shing Roger Jang, created on

datetime

If you are interested in the original MATLAB code for this page, you can type "grabcode(URL)" under MATLAB, where URL is the web address of this page.