ABSTRACT
This paper presents a new scheme for developing a voice conversion system that modifies the utterance of a source speaker to sound like speech from a target speaker. We refer to the method as Speaker Transformation Algorithm using Segmental Codebooks (STASC). Two new methods are described to perform the transformation of vocal tract and glottal excitation characteristics across speakers. In addition, the source speaker's general prosodic characteristics are modified using time-scale and pitch-scale modification algorithms. Informal listening tests suggest that convincing voice conversion is achieved while maintaining high speech quality. The performance of the proposed system is also evaluated on a standard Caussian mixture model based speaker identification system, and the results show that the transformed speech is assigned higher likelihood by the target speaker model when compared to the source model.
ABSTRACT
In this paper we propose a theoretical framework to extend classical continuous density HMM in order to consider different spectral representations depending on the state. We stress the need for a reference space and for spectral transformations between the model spectral representation spaces and the reference space. We show that this framework permits to obtain more precise pdfs in the reference space. Preliminary speech recognition experiments for two spectral representations MFCC and linear frequency scale cepstral coefficients show no improvements ; however they identify that the choice of the spectral representations is crucial and the determination of the spaces transformations is a complex problem
ABSTRACT
In this paper, the AM-FM modulation model is applied to speech analysis, synthesis and coding. The multiband demodulation pitch tracking algorithm is proposed that produces smooth and accurate fundamental frequency contours. The AM-FM modulation vocoder represents speech as the sum of resonance signals modeled by their amplitude envelope and instantaneous frequency signals. Eficient modeling and coding (at 4.8-9.6 kbits/sec) algorithms are proposed for the amplitude envelope and instantaneous frequency signals. Amplitude and frequency modulations of the speech resonances are shown to be perceptually important for natural speech synthesis.
ABSTRACT
We present here results of audio-visual to articulatory inversion for French fricatives embedded into VCVs. The inversion technique is evaluated using both experimental and synthetic data. The final synthesis is assessed by a perceptual categorisation test. Synthetic stimuli have similar scores as natural ones.
ABSTRACT
This paper proposes a method to transform acoustic models (HMM gaussian mixtures) that have been trained on a certain group of speakers for use on speech from a different group of speakers. Cepstral features are transformed on the basis of assumptions regarding the difference in vocal tract length (VTL) between the groups of speakers (VTL normalisation, VTLN). Firstly, the VTL of these groups has been estimated based on the average third formant F . Secondly, the linear acoustic theory of speech production has been applied to warp the spectral characteristics of the existing models so as to match the incoming speech. The mapping is composed of subsequent non-linear submappings. By locally linearizing it, a linear approximation was obtained which is accurate as long as warping is reasonably small. The method has been tested for the TI digits database, containing adult and kids speech, consisting of isolated digits and digit strings of different length. The word error rate when trained on adults and tested on kids with transformed adult models is decreased by more than a factor of 2 compared to the non-transformed case.
ABSTRACT
This paper presents an effort to provide a more efficient speech signal representation, which aims to be incorporated into an automatic speech recognition system. Modified cepstral coefficients, derived from a multiresolution auditory spectrum are proposed. The multiresolution spectrum was obtained using sliding single point discrete Fourier transformations. It is shown that the obtained spectrum values are similar to the results of a nonuniform filtering operation. The presented cepstral features are evaluated by introducing them into a simple phone recognition system.