ABSTRACT
This paper evaluates, in terms of speech signal processing, a non-linear method of pitch detection based on the detection of the zero-crossings of the signals (ZC method), in adverse conditions of interference. First, F0 identification is evaluated according to the relative level of energy between the components in mixtures of pure tones or pairs of vowels; then, we introduce in the double-vowel paradigm a confidence measure based on the standard deviation of inter-zero intervals. Finally, the robustness of this confidence measure is tested in two cases of interference : pure tones + noise, and vowels + noise. We show that the method allows to detect periodicity without any knowledge about the nature of the interfering sources, and then to identify their fundamental frequency.
ABSTRACT
This paper presents the use of a simulated annealing technique during the parameters estimation of a Hidden Markov Model (HMM) in a speech recognition system. This technique allows to move out of a local optimum which characterizes a classical Expectation Maximization (EM) algorithm, and thus to achieve a better estimation with a limited amount of training data. We choose here the Simulated Annealing Expectation Maximization (SAEM) algorithm introducing a simulated annealing technique in the EM method. The SAEM algorithm is compared to the classical EM algorithm, for both task-independent and task-dependent Viterbi training. The evaluation leads to significant improvement of recognition performances.
ABSTRACT
Sub- and supraglottal pressures, Psub and Psup, have been recorded during glissando phonation of sustained vowels, isolated vowels, and continuous speech including a one minute Iong reading of a novel. Studies of the covarition of Psub, F0, voice excitation amplitude Ee and overall sound pressure Ieve1 reveal systematic relations some of which can be expressed in closed form by regression equations. Systematic dfferences in Psub with respect to position in a breathgroup, vowel and consonant category and the degree of stress have been observed. The domain of subglottal increase with stress is of the order of one or a few words rather than a single syllable. The global contour of Psub within a breathgroup and the Ioca1 finestructures of Psub and transglottal pressure, Ptr=Psub-Psup associated with specific articulatory events are described and discussed.
ABSTRACT
Investigation of the fractal behaviour of unvoiced plosive consonants leads to interesting observations towards their classification. Experimental evidence of the fractal nature of the speech signals themselves, as well as of their derivatives and cumulative sums prompt the use of the associated fractal dimensions to form a discriminative feature set. The obtained feature set is compact in representation and easy to compute. At the same time, the discriminating capability of this feature set is seen to be promising even for speech signals sampled at 8KHz.
ABSTRACT
The speech rate is one of the important prosodic parameters essential for the naturalness of an utterance, yet comparatively little is known on the fine structures of speech rate variation in natural utterances. On the basis of the authors' definition of the relative local speech rate, the present paper describes an analysis of the changes in the local rate of speech units, each produced in isolation, when they are embedded in connected speech. The results, together with those already obtained by the authors, will lead to a complete scheme for speech rate control in speech synthesis by concatenation.
ABSTRACT
A quantitative model for the process of 0 contour generation, originally developed for Japanese by Fujisaki and his co-workers, has already been shown to be valid for several other languages. The present study aims at testing its applicability to 0 contours of Greek utterances. Analysis of 0 contours of 200 utterances by two native speakers of Greek, produced by reading texts of narrations and conversations, has shown that the model is essentially valid, and suggests the model's usefulness for Text-to-Speech synthesis of Greek.
ABSTRACT
In this paper we report the characteristics of slow, average and fast speech. The study has been done using the TRESVEL Spanish database. It is composed of 3200 sentences uttered at three different speech rates and contains speech material from 20 male and 20 female speakers. This database has been designed to study, evaluate and compensate the effect of speech rate in Large Vocabulary Continuous Speech Recognition (LVCSR) systems. We report a new measure for the rate of speech (ROS). The ROS is normalised using an appropriate set of constants that depends on the expected duration of each phone. We also report the characteristics of slow, average and fast speech. Finally, we report the degradation in performance of a continuous speech recognition system when the speech rate is low and high, and the evaluation of two compensation techniques. Adaptation of the language weight, insertion penalties and HMM state-transition probabilities for slow speech provides a 21.5% reduction of the word error rate (WER).
ABSTRACT
Magnitude and variability of duration, pitch and formant frequencies are computed for speech collected from five to eighteen year-old children. The study confirmed that reduction in magnitude and variability are the primary indicators of speech development. Specifically, children below age ten exhibit wider dynamic range of vowel duration, longer suprasegmental duration, and larger temporal and spectral variations. These trends diminish around age twelve. Children's speech acoustic characteristics fully develop to adult level in both magnitude and variability around age fifteen. Change of formant frequencies in male speakers parallels the growth of the vocal tract, while for female speakers the presence of such a linear trend is not clear. We conclude that the primary factors governing the acoustic patterns during speech development are anatomical maturation of the speech apparatus and speech motor control in terms of agility and precision.
ABSTRACT
Accurate measurement of formant frequencies is important in many studies of speech perception and production. Errors in formant frequency estimation by eye, using a spectrogram, or automatically, using linear prediction, have been reported to be as high as 60 Hz at F0 < 300 Hz. This exceeds the typical auditory difference limens (DLs) for formant frequencies and is also greater than some of the variation that one would like to study, e.g., the acoustic effects of varying vocal effort. The problem becomes substantially worse when F0 is as high as 500 to 600 Hz, which is not uncommon in the speech of women and children at high vocal efforts. In comparison with ordinary linear predictive analysis, the method described here drastically reduces measurement errors, given that the formant frequency is not below or only slightly above F0 (which rarely happens in speech). It thus becomes possible to study formant frequency variation in speech material that hitherto could not be analysed meaningfully since the effects of interest were no larger than the probable errors in measurement.
ABSTRACT
This paper analyses speaking rate variations in English and Danish and relates them to problems encountered in speech recognition. Intra speaker variabilities in speech rates are explained with reference to time equalisation of stress groups and utterances. Further, it is shown that certain natural classes of phonemes are more affected by speaking rate variations than others. Keywords: Phoneme modeling, rate of speech, phone duration, time equalisation of stress groups and phrases.
ABSTRACT
The identification of phoneme boundaries in continuous speech is an important problem in areas of speech recognition and synthesis. The use of robust parameters to allow a trained data set obtained from one language to be used for boundary identification in another language is being investigated. In particular the use of mixed time- frequency rate parameters, and the training on the change of the rate parameters at acoustic boundaries is reported.
ABSTRACT
For some tasks (e.g. ,forensic applications) it is vital important to know real pitch. So there is a problem to check-up the correctness of any concrete pitch contour for long speech records for any pitch detection method. Besides it would be useful to see the degree of signal periodicity without strong decision voice/noise. The new homomorphic method of signal periodicity degree detection with analysis frame length in proportion to the time-lag is described. The powerfull working approach to visual analysis of speech signal periodicity is proposed. This Foicograms method of speech periodicity representation ensures the practical correctness of pitch estimation and allows to find periodicity degree for poor quality signal.
ABSTRACT
A number of studies have shown that a pair of perceptual effective formants can be defined to capture most of the phonetic information present in vowels. Various methods of computing the effective formant values were proposed. However, many of them depend on the accuracy of conventional formant estimation. In this work, we study methods of automatically estimating perceptual effective formants without estimating the actual formant values and compare the results with the perceptually measured effective formant values. The preliminary results show that the method is effective in estimating the perceptual effective formants. Classification experiments using perceptual effective formants as explicit features do not demonstrate any advantages. However, using the perceptual effective second formant value as input to our formant estimation algorithm can help to correct up to 44% of the formant tracking errors.
ABSTRACT
In this paper a method to decompose a conventional feature space (LPC-cepstrum) into subspaces which carry information about the linguistic and speaker variability is presented. Principal component analysis is used to study the correlation between these sub-spaces. Oriented principal component analysis (OPCA) is then used to estimate a sub-space which is relatively speaker- independent. A method to estimate the dimensionality of the speaker independent sub-space is also presented. Original features can now be projected into the speaker independent sub-space to make them less sensitive to speaker variations. Finally the effectiveness of the proposed method in suppressing the speaker dependence is studied by experiments conducted on two different databases.
ABSTRACT
In this paper, an integrated approach to vector dynamic feature extraction is proposed in the design of a hidden Markov model (HMM) based speech recognizer. The integrated model we developed in this study generalizes the conventional, currently widely used dynamic-parameter technique, which has been confined strictly to the preprocessing domain only, in two significant ways. First, the new model contains state-dependent, vector-valued weighting functions responsible for transforming static speech features into the dynamic ones in a slowly time-varying manner. Second, a novel maximum-likelihood based training algorithm is developed for the model that allows joint optimization of the state-dependent, vector-valued weighting functions and the remaining conventional HMM parameters. The experimental results on alphabet classification demonstrate the effectiveness of the new model relative to standard HMM using dynamic features that have not been subject to optimization during training.
ABSTRACT
An algorithm is presented which allows non parametric representations of speech to be automatically segmented into units of comparable duration and character to manually-defined phonemes. The consistency of this segmentation across speakers, and across telephone channels, is investigated and the implications of adopting such forms of data for automatic speech recognition are discussed.
ABSTRACT
In this paper a new robust non-recursive algorithm for parameter estimation of AR model of speech signal is proposed. The proposed algorithm takes into account the quasi-periodic excitation for voiced speech and assumes the t-distribution with small degrees of freedom a of the excitation signal. The method is based on the covariance linear prediction with sliding window. Experiments on both synthesized and natural speeches have shown that the proposed robust algorithm gives estimates with smaller variance and bias, compared to the conventional non-robust algorithm. The choice of a=3 induces to the most efficient estimation.
ABSTRACT
This paper deals with the improvement of autosegmentation algorithms by establishing and implementing a simple energy model. This model consists of rules which describe the variation of the phoneme energy at the phoneme boundaries due to the phoneme context. The efficient estimation of phoneme boundaries results to the improvement of the accuracy of phoneme-based, large vocabulary speech recognition systems, as proven from experiments in the Greek language.
ABSTRACT
Macroscopic analysis of a corpus of emotional Standard Southern British speech signals has been performed to measure any changes in average fundamental frequency, speech rate, energy and first formant frequency. Seven acted emotional states were recorded and analysed for one male and one female speaker. Differences between neutral and emotional speech were found which agree with changes others have mentioned in the literature. Only the emotion sadness was found to be consistently and obviously different from neutral, while high activity emotions (such as elation and hot anger) could be distinguished from sadness. Additional measures are being developed which will further discriminate the emotions from one another. Results obtained to date are being evaluated for use in a speech synthesiser system.
ABSTRACT
In this paper a model-based approach for restoring a continuous fundamental frequency (F 0 ) contour from the noisy output of an F 0 extractor is investigated. In contrast to the conventional pitch trackers based on numerical curve-fitting, the proposed method employs a quantitative pitch generation model, which is often used for synthesizing F 0 contour from prosodic event commands for estimating continuous F 0 pattern. An inverse filtering technique is introduced for obtaining the initial candidates of the prosodic commands. In order to find the optimal command sequence from the commands efficiently, a beam- search algorithm and an N-best technique are employed. Preliminary experiments for a male speaker of the ATR B-set database showed promising results both in quality of the restored pattern and estimation of the prosodic events.
ABSTRACT
ABSTRACT
For many years, the K-Nearest Neighbours method (K-NN) is known as one of the best probability density function (pdf) estimator. A fast K-NN algorithm has been developed and tested on the TIMIT database with a gain in computational time of 99;8%. The K-NN decision principle has been assessed on a frame by frame phonetic identification. A method to integrate K-NN estimator pdf in a HMM-based system is proposed and tested on an acoustic-phonetic decoding task. Finally, preliminary experiments are performed on the HMM topology inference .
ABSTRACT
A spectral approach is proposed for voice source parameters representation and estimation. Parameter estimation is based on decomposition of the periodic and the aperiodic components of the speech signal, and on spectral modelling of the periodic component. The paper focusses on parameters estimation for the periodic component of the glottal flow. A new anticausal all-pole model of the glottal flow is derived. Glottal flow is seen as an anticausal 2-pole filter followed by a spectral tilt filter. The anticausal filter has complex poles, instead of the real poles that are usually assumed. Time-domain and frequency domain parameters are linked by analytic formulas. Two spectral domain algorithms are proposed for estimation of open quotient. The first one is based on measurement of the first harmonics, and the second one is based on spectral modelling. Experimental results demonstrate the accuracy of the estimation procedures