Authors:
James Droppo, Microsoft Research (USA)
Alex Acero, Microsoft Research (USA)
Page (NA) Paper number 71
Abstract:
A Maximum a posteriori framework for computing pitch tracks as well
as voicing decisions is presented. The proposed algorithm consists
of creating a time-pitch energy distribution based on predictable energy
that improves on the normalized cross-correlation. A large database
is used to evaluate the algorithm's performance against two standard
solutions, using glottal closure instants (GCI) obtained from electroglottogram
(EGG) signals as a reference. The new MAP algorithm exhibits higher
pitch accuracy and better voiced/unvoiced discrimination.
Authors:
Dekun Yang, Keele University (U.K.)
Georg F. Meyer, Keele University (U.K.)
William A. Ainsworth, Keele University (U.K.)
Page (NA) Paper number 511
Abstract:
This paper presents a method for segregating and recognizing concurrent
vowels based on the amplitude modulation spectrum. Vowel segregation
is accomplished by F0-guided grouping of harmonic components encoded
in the amplitude modulation spectrum while vowel recognition is achieved
by classifying the segregated vowel spectrum. Main features of the
method are (1) the reassigned technique is employed to obtain a high
resolution amplitude modulation spectrum and (2) Fisher's linear discriminant
analysis is used to improve the performance of vowel classification.
The method is tested on a double-vowel identification task and some
preliminary results are provided.
Authors:
Eloi Batlle, Universitat Politecnica de Catalunya (Spain)
Climent Nadeu, Universitat Politecnica de Catalunya (Spain)
José A.R. Fonollosa, Universitat Politecnica de Catalunya (Spain)
Page (NA) Paper number 473
Abstract:
In this paper we study various decorrelation methods for the features
used in speech recognition and we compare the performance of each one
by running several tests with a speech database. First of all we study
the Principal Components Analysis (PCA). PCA extracts the dimensions
along which the data vary the most, and thus it allows us to reduce
the dimension of the data points without significant loss of performance.
The second transform we study is the Discrete Cosine Transform (DCT).
As it will be shown, it is an approximation of the PCA analysis. By
applying this transform to FBE parameters we obtain the MFCC coefficients.
A further step is taken with the Linear Discriminant Analysis (LDA),
which, not only reduces the dimensionality of the problem, but also
discriminates among classes to reduce the confusion error. The last
method we study is Frequency Filtering (FF). This method consists of
a linear filtering of the frequency sequence of the log FBE that both
decorrelates and equalizes the variance of the coefficients.
Authors:
Marie-José Caraty, Laboratoire d'Informatique de Paris 6 (France)
Claude Montacié, Laboratoire d'Informatique de Paris 6 (France)
Page (NA) Paper number 1142
Abstract:
In the purpose to deal with artifact on observations measurements resulting
from usual speech processing, we propose to extend the representation
of the speech signal by taking a sequence of sets of observations instead
of a simple sequence of observations. A set of observations is computed
from temporal Multi-Resolution (MR) analysis. This method is designed
to be adapted to any usual mode and technique of analysis. Its originality
is to take into account two main variations in the analysis, -the center
of the frame and -the duration of the frame. In speech processing,
multi-resolution analysis has many applications. MR analysis is a basic
representation -to locate the stationary and non-stationary parts of
speech from the inertia computation, -to select the best representative
observation from centroid or generalized centroid. Preliminary experiments
are presented. The first one consists in the MR analysis of pieces
of the French and the English-American speech databases (i.e., TIMIT,
BREF80) and on the inertia as a criterion of location of stationary
and non-stationary parts of the speech signal. The second one is on
the computation of the phoneme prototypes of the two speech databases.
At last, some perspectives are discussed.
Authors:
Steve Cassidy, SHLRC, Macquarie University (Australia)
Catherine Watson, SHLRC, Macquarie University (Australia)
Page (NA) Paper number 664
Abstract:
As part of a long term project to develop speech recognitions systems
for young computer users, specifically children aged between 6 and
11 years, this paper presents a preliminary investigation into the
classification of children's vowels. In earlier studies of adult speech
we found that dynamic or time-varying cues were useful in classifying
diphthongal vowels but provided no advantage for monophthongs if duration
is included as an additional cue. In this study we investigate whether
dynamic cues (modelled by Discrete Cosine Transform coefficients) are
present to a greater or lesser extent in children's vowels. Our hypothesis
is that some of the observed variability in children's vowels may be
due to systematic time-varying features. We found that the children's
monophthong data was better separated by a combination of DCT coefficients
and vowel duration than by the formant data sampled at the vowel midpoint
plus duration. This result contrasts with our finding on Australian
adult data in which we found it was necessary to model the formant
trajectory only to separate the diphthongs.
Authors:
Johan de Veth, A2RT, Dept. of Language & Speech, University of Nijmegen (The Netherlands)
Louis Boves, A2RT, Dept. of Language & Speech, University of Nijmegen (The Netherlands)
Page (NA) Paper number 359
Abstract:
Phase-corrected RASTA is a new technique for channel normalization
that consists of classical RASTA filtering followed by a phase correction
operation. In this manner, the channel bias is as effectively removed
as with classical RASTA, without introducing a left context dependency.
The performance of the phase-corrected RASTA channel normalization
technique was evaluated for a continuous speech recognition task.
Using context-independent hidden Markov models we found that phase-corrected
RASTA reduces the best-sentence word error rate (WER) by 23 % compared
to classical RASTA. For context-dependent models phase-corrected RASTA
reduces WER by 15 % compared to classical RASTA.
Authors:
Satya Dharanipragada, IBM TJ Watson Research Center (USA)
Ramesh A. Gopinath, IBM TJ Watson Research Center (USA)
Bhaskar D. Rao, University of California, San Diego (USA)
Page (NA) Paper number 590
Abstract:
Fixed-rate feature extraction which is used in most current speech
recognizers is equivalent to sampling the feature trajectories at a
uniform rate. Often this sampling rate is well below the Nyquist rate
and thus leads to distortions in the sampled feature stream due to
aliasing. In this paper we explore various techniques, ranging from
simple cepstral and spectral smoothing to filtering and data-driven
dimensionality expansion using Linear Discriminant Analysis (LDA),
to counter aliasing and the variable rate nature of information in
speech signals. Smoothing in the spectral domain results in a reduction
in the variance of the short term spectral estimates which directly
translates to reduction in the variances of the Gaussians in the acoustic
models. With these techniques we obtain modest improvements, both in
word error rate and robustness to noise, on large vocabulary speech
recognition tasks.
Authors:
Limin Du, Inst Acoustics, Chinese Acad Sci (China)
Kenneth Noble Stevens, Dept Electrical Engineering and Computer Science, Massachusetts Institute of Technology (USA)
Page (NA) Paper number 302
Abstract:
A knowledge-based approach towards automatically detecting nasal landmarks
(/m/, /n/, and /ng/) from speech waveform is developed. The acoustic
characteristics Fn1 locus calculated on each frame of speech waveform
as the mass center of spectrum amplitude in the vicinity of the lowest
spectral prominence between 150-1000Hz, and A23 locus calculated on
the same speech frame as a band energy between 1000-3000Hz were incorporated
together to construct the nasal landmark detector, which alarms at
the instants of closure and release of nasal murmur. Experiment observations
on the acoustic characteristics of Fn1 and A23 and the nasal consonant
landmark detection results on the VCV database are also presented.
Authors:
Thierry Dutoit, Faculte Polytechnique de Mons (Belgium)
Juergen Shroeter, AT&T Labs - Research (USA)
Page (NA) Paper number 520
Abstract:
Software engineering for research and development in the area of signal
processing is by no means unimportant. For speech processing, in particular,
it should be a priority: given the intrinsic complexity of text-to-speech
or recognition systems, there is little hope to do state-of-the-art
research without solid and extensible code. This paper describes a
simple and efficient methodology for the design of maximally reusable
and extensible software components for speech and signal processing.
The resulting programming paradigm allows software components to be
advantageously combined with each other in a way that recalls the concept
of hardware plug-and-play, without the need for incorporating complex
schedulers to control data flows. It has been successfully used for
the design of a software library for high-level speech processing systems
at AT&T Labs, as well as for several other large-scale software
projects.
Authors:
Alexandre Girardi, NAIST - Nara Institute of Science and Technology (Japan)
Kiyohiro Shikano, NAIST - Nara Institute of Science and Technology (Japan)
Satoshi Nakamura, NAIST - Nara Institute of Science and Technology (Japan)
Page (NA) Paper number 687
Abstract:
In speaker independent speech recognition, one problem we often face
is the insufficient database for training. This problem is even more
serious for children database. Besides, adult data when used as children
data is affected by differences in pitch and spectral frequency stretch
that affects recognition. In this paper, as an approach to solve the
above problem, we applied STRAIGHT-TEMPO algorithm to morph adult data
towards children data, in order to construct more robust HMM acoustic
models, as well as to study the effect of a combined change in the
pitch and spectral frequency stretch of the original utterances in
the database. Using the morphed database, we analyzed the level of
improvement that can be obtained, in terms of recognition rate, compared
with non morphed data.
Authors:
Laure Charonnat, ENSSAT (France)
Michel Guitton, ENSSAT (France)
Joel Crestel, ENSSAT (France)
Gerome Allée, ENSSAT (France)
Page (NA) Paper number 1119
Abstract:
This paper describes an hyperbaric speech processing algorithm combining
a restoration of the formants position and a correction of the pitch.
The pitch is corrected using an algorithm of time-scale modification
associated to an oversampling module. This operation does not only
perform a shift of the fundamental frequency, but induces a shift of
the other frequencies of the signal. This shift, as well as the formants
shift due to the hyperbaric environment, is corrected by the formants
restoration module, based on the linear speech production model.
Authors:
Juana M. Gutiérrez-Arriola, Grupo de Tecnología del Habla- IEL- UPM (Spain)
Yung-Sheng Hsiao, Mind Machine Interaction Center. Electronic and Computer Engineer Department. University of Florida (USA)
Juan Manuel Montero, Grupo de Tecnología del Habla- IEL- UPM (Spain)
José Manuel Pardo, Grupo de Tecnología del Habla- IEL- UPM (Spain)
Donald G. Childers, Mind Machine Interaction Center. Electronic and Computer Engineer Department. UF (USA)
Page (NA) Paper number 468
Abstract:
This paper describes a voice conversion system based on parameter transformation.
Voice conversion is a process of making one person's voice "source"
sound like another person's voice "target". We will present a voice
conversion scheme consisting of three stages. First an analysis is
performed on the natural speech to obtain the acoustical parameters.
These parameters will be voiced and unvoiced regions, the glottal source
model, pitch, energy, formants and bandwidths. Once these parameters
have been obtained for two different speakers they are transformed
using linear functions. Finally the transformed parameters are synthesized
by means of a formant synthesizer. Experiments will show that this
scheme is effective in transforming the speaker individuality. It will
also be shown that the transformation can not be unique from one speaker
to another but it has to be divided in several functions each to transform
a certain part of the speech signal. Segmentation based on spectral
stability will divide the sentence into parts, for each segment a transformation
function will be applied.
Authors:
Jilei Tian, Nokia Research Center (Finland)
Ramalingam Hariharan, Nokia Research Center (Finland)
Kari Laurila, Nokia Research Center (Finland)
Page (NA) Paper number 325
Abstract:
Part of the problems in noise robust speech recognition can be attributed
to poor acoustic modeling and use of inappropriate features. It is
known that the human auditory system is superior to the best speech
recognizer currently available. Hence, in this paper, we propose a
new two-stream feature extractor that incorporates some of the key
functions of the peripheral auditory subsystem. To enhance noise robustness,
the input is divided into low-pass and high-pass channels to form so-called
static and dynamic streams. These two streams are independently processed
and recombined to produce a single stream, containing 13 feature vector
components, with improved linguistic information. Speaker-dependent
isolated-word recognition tests, using the proposed front-end, produced
an average 39% and 17% error rate reductions, over all noisy environments,
as compared to the standard Mel Frequency Cepstral Coefficient (MFCC)
front-ends with 13 (statics only) and 26 (statics and deltas) feature
vector components, respectively.
Authors:
Andrew K. Halberstadt, MIT Laboratory for Computer Science (USA)
James R. Glass, MIT Laboratory for Computer Science (USA)
Page (NA) Paper number 396
Abstract:
This paper addresses the problem of acoustic phonetic modeling. First,
heterogeneous acoustic measurements are chosen in order to maximize
the acoustic-phonetic information extracted from the speech signal
in preprocessing. Second, classifier systems are presented for successfully
utilizing high-dimensional acoustic measurement spaces. The techniques
used for achieving these two goals can be broadly categorized as hierarchical,
committee-based, or a hybrid of these two. This paper presents committee-based
and hybrid approaches. In context-independent classification and context-dependent
recognition on the TIMIT core test set using 39 classes, the system
achieved error rates of 18.3% and 24.4%, respectively. These error
rates are the lowest we have seen reported on these tasks. In addition,
experiments with a telephone-based weather information word recognition
task led to word error rate reductions of 10-16%.
Authors:
Naomi Harte, The Queen's University of Belfast (Ireland)
Saeed Vaseghi, The Queen's University of Belfast (Ireland)
Ben Milner, BT Research Laboratories (U.K.)
Page (NA) Paper number 259
Abstract:
This paper encompasses the approaches of segmental modelling and the
use of dynamic features in addressing the constraints of the IID assumption
in standard HMM. Phonetic features are introduced which capture the
transitional dynamics across a phoneme unit via a DCT transformation
of a variable length segment. Alongside this, the use of a hybrid
phoneme model is proposed. Classification experiments demonstrate
the potential of these features and this model to match the performance
of standard HMM. The extension of these features to full recognition
is explored and details of a novel recognition framework presented
alongside preliminary results. Lattice rescoring based on these models
and features is also explored. This reduces the set of segmentations
considered allowing a more detailed exploration of the nature of the
model and features and the challenges in using the proposed recognition
strategy.
Authors:
Hynek Hermansky, Oregon Graduate Institute Of Science And Technology (USA)
Sangita Sharma, Oregon Graduate Institute Of Science And Technology (USA)
Page (NA) Paper number 615
Abstract:
The work proposes a radically different set of features for ASR where
TempoRAl Patterns of spectral energies are used in place of the conventional
spectral patterns. The approach has several inherent advantages, among
them robustness to stationary or slowly varying disturbances.
Authors:
John N. Holmes, Consultant (U.K.)
Page (NA) Paper number 351
Abstract:
Both for robust fundamental frequency (F0) measurement and to provide
a degree of voicing indication, a new algorithm has been developed
based on multi-channel autocorrelation analysis. The speech is filtered
into eight separate frequency bands, representing the lowest 500 Hz
and seven overlapping band-pass channels each about 1000 Hz wide.
The outputs of all the band-pass channels are full-wave rectified and
band-pass filtered between 50 Hz and 500 Hz. Autocorrelation functions
are calculated for the signals from all eight channels, and these functions
are used both for the F0 measurement and for the voicing indication.
Optional dynamic programming is provided to maximize the continuity
of position of the correlation peaks selected for fundamental period
measurement. The algorithm has been designed for coding onto a 16-bit
integer DSP, using less than 4 MIPS processing power and 1500 words
of data memory.
Authors:
John F. Holzrichter, Lawrence Livermore National Laboratory (USA)
Gregory C. Burnett, Lawrence Livermore National Laboratory (USA)
Todd J. Gable, Lawrence Livermore National Laboratory (USA)
Lawrence C. Ng, Lawrence Livermore National Laboratory (USA)
Page (NA) Paper number 1064
Abstract:
Experiments have been conducted using a variety of very low power EM
sensors that measure articulator motions occurring in two frequency
bands, 1 Hz to 20Hz, and 70 Hz to 7 kHz. They enable noise free estimates
of a voiced excitation function, accurate pitch measurements, generalized
transfer function descriptions, and detection of vocal articulator
motions.
Authors:
Jia-Lin Shen, Institute of Information Science, Academia Sinica (Taiwan)
Jeih-Weih Hung, Institute of Information Science, Academia Sinica (Taiwan)
Lin-Shan Lee, Institute of Information Science, Academia Sinica (Taiwan)
Page (NA) Paper number 232
Abstract:
This paper presents an entropy-based algorithm for accurate and robust
endpoint detection for speech recognition under noisy environments.
Instead of using the conventional energy-based features, the spectral
entropy is developed to identify the speech segments accurately. Experimental
results show that this algorithm outperforms the energy-based algorithms
in both detection accuracy and recognition performance under noisy
environments, with an average error rate reduction of more than 16%.
Authors:
Jia-Lin Shen, Institute of Information Science, Academia Sinica (Taiwan)
Wen-Liang Hwang, Institute of Information Science, Academia Sinica (Taiwan)
Page (NA) Paper number 447
Abstract:
This paper presents a study on statistical integration of temporal
filter banks for robust speech recognition using linear discriminant
analysis (LDA). The temporal properties of stationary features were
first captured and represented using a bank of well-defined temporal
filters. Then these derived temporal features can be integrated and
compressed using the LDA technique. Experimental results show that
the recognition performance can be significantly improved both in clean
and in noisy environments.
Authors:
Dorota J. Iskra, University of Birmingham (U.K.)
William H. Edmondson, University of Birmingham (U.K.)
Page (NA) Paper number 778
Abstract:
The alternative approach to speech recognition proposed here is based
on pseudo-articulatory representations (PARs), which can be described
as approximations of distinctive features, and aims to establish a
mapping between them and their acoustic specifications (in this case
cepstral coefficients). This mapping which is used as the basis for
recognition is first done for vowels. It is obtained using multiple
regression analysis after all the vowels have been described in terms
of phonetic features and an average cepstral vector has been calculated
for each of them. Based on this vowel model, the PAR values are calculated
for consonants. At this point recognition is performed using a brute
search mechanism to derive PAR trajectories and subsequently dynamic
programming to obtain a phone sequence. The results are not as good
as when hidden Markov modelling is used, but very promising taking
into account the early stage of the experiments and the novelty of
the approach.
Authors:
Hiroyuki Kamata, Meiji University (Japan)
Akira Kaneko, Meiji University (Japan)
Yoshihisa Ishida, Meiji University (Japan)
Page (NA) Paper number 1016
Abstract:
We propose a new method for emphasizing the periodicity of voice wave
using chaotic neurons, and propose a practical method to detect the
fundamental frequency of human voice. The chaotic neuron is a kind
of nonlinear recursive mapping proposed in the field of nonlinear theory
and is usually used to generate the chaotic signal. Besides, when the
chaotic neuron is considered in the theory of linear signal processing,
we can interpret that the chaotic neuron is a positive feedback IIR
digital filter of first order, therefore, it gives a spectrum slope
to the target spectrum of input speech signal. In this study, we try
to tune up the chaotic neuron to amplify the low frequency components
to emphasize the component of fundamental frequency. As the result,
spectrum peaks based on the formants are canceled, the spectrum peak
corresponded to the fundamental frequency of voiced speech can be detected
easily. In addition, a nonlinear function that has a dead band is included
in the feedback loop of the chaotic neuron. As its effect, noise components
of unvoiced speech are not amplified.
Authors:
Simon King, University of Edinburgh (U.K.)
Todd Stephenson, University of Edinburgh (U.K.)
Stephen Isard, University of Edinburgh (U.K.)
Paul Taylor, University of Edinburgh (U.K.)
Alex Strachan, University of Edinburgh (U.K.)
Page (NA) Paper number 557
Abstract:
We describe a speech recogniser which uses a speech production-motivated
phonetic-feature description of speech. We argue that this is a natural
way to describe the speech signal and offers an efficient intermediate
parameterisation for use in speech recognition. We also propose to
model this description at the syllable rather than phone level. The
ultimate goal of this work is to generate syllable models whose parameters
explicitly describe the trajectories of the phonetic features of the
syllable. We hope to move away from Hidden Markov Models (HMMs) of
context-dependent phone units. As a step towards this, we present a
preliminary system which consists of two parts: recognition of the
phonetic features from the speech signal using a neural network; and
decoding of the feature-based description into phonemes using HMMs.
Authors:
Jacques Koreman, University of the Saarland, Institute of Phonetics (Germany)
Bistra Andreeva, University of the Saarland, Institute of Phonetics (Germany)
William J. Barry, University of the Saarland, Institute of Phonetics (Germany)
Page (NA) Paper number 549
Abstract:
The hidden Markov modelling experiments presented in this paper show
that consonant identification results can be improved substantially
if a neural network is used to extract linguistically relevant information
from the acoustic signal before applying hidden Markov modelling. The
neural network - or in this case a combination of two Kohonen networks
- takes 12 mel-frequency cepstral coefficients, overall energy and
the corresponding delta parameters as input and outputs distinctive
phonetic features, like [(plus-minus)uvular] and [(plus-minus)plosive].
Not only does this preprocessing of the data lead to better consonant
identification rates, the confusions that occur between the consonants
are less severe from a phonetic viewpoint, as is demonstrated. One
reason for the improved consonant identification is that the acoustically
variable consonant realisations can be mapped onto identical phonetic
features by the neural network. This makes the input to hidden Markov
modelling more homogenous and improves consonant identification. Furthermore,
by using phonetic features the neural network helps the system to focus
on linguistically relevant information in the acoustic signal.
0549_01.PDF
(was: 0549_01.GIF)
| Consonant confusion matrix for hidden Markov modelling experiment using mapping of acoustic parameters onto phonetic features
File type: Image File
Format: Image : GIF
Tech. description: 480 x 270, 24 bits per pixel
Creating Application:: xv
Creating OS: UNIX, sun4\_solaris 2.6
|
0549_02.PDF
(was: 0549_02.GIF)
| Consonant confusion matrix for hidden Markov modelling experiment using acoustic parameters as input directly
File type: Image File
Format: Image : GIF
Tech. description: 480 x 270, 24 bits per pixel
Creating Application:: xv
Creating OS: UNIX, sun4\_solaris 2.6
|
Authors:
Hisao Kuwabara, Teikyo University of Science & Technology (Japan)
Page (NA) Paper number 34
Abstract:
Investigations have been made on the perceptual properties of CV-syllables
taken out from continuous speech spoken with three different speaking
rate. Fifteen short Japanese sentences were spoke by a male speaker
with 1) fast speaking rate, 2) normal rate and 3) slow rate. The results
reveal that individual syllables do not have enough phonetic information
to be correctly identified especially for the fast speech. The average
identification of syllables for the fast speech is 35%, 59% for the
normal and 86% for the slow. It has been found that syllable perception
almost entirely depends on the consonant identification.
Authors:
Joohun Lee, Dong-Ah Broadcasting College (Korea)
Ki Yong Lee, Soongsil University (Korea)
Page (NA) Paper number 296
Abstract:
In this paper, to estimate the time-varying speech parameters having
non-Gaussian excitation source, we use the robust sequential estimator(RSE)
based on t-distribution and introduce the forgetting factor. By using
the RSE based on t-distribution with small degree of freedom, we can
alleviate efficiently the effects of outliers to obtain the better
performance of parameter estimation. Moreover, by the forgetting factor,
the proposed algorithm can estimate the accurate parameters under the
rapid variation of speech signal.
Authors:
Christopher John Long, Loughborough University (U.K.)
Sekharajit Datta, Loughborough University (U.K.)
Page (NA) Paper number 802
Abstract:
In this paper, a new feature extraction methodology based on Wavelet
Transforms is examined, which unlike some conventional parameterisation
techniques, is flexible enough to cope with the broadly differing characteristics
of typical speech signals. A training phase is involved during which
the final classifier is invoked to associate a cost function (a proxy
for misclassification) with a given resolution. The sub spaces are
then searched and pruned to provide a Wavelet Basis best suited to
the classification problem. Comparative results are given illustrating
some improvement over the Short-Time Fourier Transform using two differing
subclasses of speech.
Authors:
Hiroshi Matsumoto, Dept. of Electrical & Electronic Eng., Faculty of Eng., Shinshu University (Japan)
Yoshihisa Nakatoh, Multimedia Development Center, Matsushita Electric Industrial Co., Ltd. (Japan)
Yoshinori Furuhata, Dept. of Electrical & Electronic Eng., Faculty of Eng., Shinshu University (Japan)
Page (NA) Paper number 47
Abstract:
This paper proposes a simple and efficient time domain technique to
estimate an all-pole model on a mel-frequency axis (Mel-LPC), i.e.,
a bilinear transformed all-pole model by Strube. Autocorrelation coefficients
on the mel-frequency axis are exactly derived by computing cross-correlation
coefficients between speech signal and all-pass filtered one without
any approximation. This method requires only two-fold computational
cost as compared to conventional linear prediction analysis. The recognition
performance of mel-cepstral parameters obtained by the Mel LPC analysis
is compared with those of conventional LP mel-cepstra and the mel-frequency
cepstrum coefficients (MFCC) through gender-dependent phoneme and word
recognition tests. The results show that the Mel-LPC cepstrum attains
a significant improvement in recognition accuracy over conventional
LP mel-cepstrum, and gives slightly higher accuracy for male speakers
and slightly lower accuracy for female speakers than MFCC.
Authors:
Philip McMahon, The Queen's University of Belfast (Ireland)
Paul McCourt, The Queen's University of Belfast (Ireland)
Saeed Vaseghi, The Queen's University of Belfast (Ireland)
Page (NA) Paper number 315
Abstract:
This paper explores possible strategies for the recombination of independent
multi-resolution sub-band based recognisers. The multi-resolution
approach is based on the premise that additional cues for phonetic
discrimination may exist in the spectral correlates of a particular
sub-band, but not in another. Weights are derived via discriminative
training using the 'Minimum Classification Error' (MCE) criterion on
log-likelihood scores. Using this criterion the weights for correct
and competing classes are adjusted in opposite directions, thus conveying
the sense of enforcing separation of confusable classes. Discriminative
re-combination is shown to provide significant increases for both phone
classification and continuous recognition tasks on the TIMIT database.
Weighted recombination of independent multi-resolution sub-band models
is also shown to provide robustness improvements in broadband noise.
Authors:
Yoram Meron, University of Tokyo (Japan)
Keikichi Hirose, University of Tokyo (Japan)
Page (NA) Paper number 416
Abstract:
Our goal is to develop a singing synthesis system, in which "singing
units" are automatically extracted from existing musical recordings
of a singer accompanied by a musical instrument (piano). This paper
concentrates on the problem of separating singer from accompaniment.
Existing separation methods require the knowledge of the exact frequencies
of the signals to be separated, and are prone to degrading the quality
of the separated signals. In this paper, we use the framework of the
sinusoidal modeling approach. We suggest the use of further sources
of information, available to the specific task: advance knowledge of
the music score, knowledge about the piano sound, and a relatively
large database of the piano and singer signals, which is used to build
a model of the piano sound. Results show that using musical score
information and piano note modeling can improve separation quality.
Authors:
Nobuaki Minematsu, Toyohashi Univ. of Tech. (Japan)
Seiichi Nakagawa, Toyohashi Univ. of Tech. (Japan)
Page (NA) Paper number 52
Abstract:
Correlation of spectral variations and F0 changes in a vowel is firstly
analyzed, where the variations are also compared to VQ distortions
calculated in a five-vowel space. It is shown that the F0 change approximately
by a half octave produces the spectral variation comparable to the
VQ distortion when the codebook size is the number of the vowels. Next,
a model to predict the cepstral coefficients' variations caused by
the F0 changes is built using the multivariate regression analysis.
Experiments show that the generated frame by the model has a remarkably
small distance to the target frame. Furthermore, the model is evaluated
separately in terms of a spectral envelope predictor with a given F0
and a mapping function of feature sub-spaces. While the models should
be built dependently on phonemes and speakers as the former, adequate
selection of parameters can enable the speaker/phoneme-independent
models to work effectively as the latter.
Authors:
Partha Niyogi, Bell Labs - Lucent Technologies (USA)
Partha Mitra, Bell Labs - Lucent Technologies (USA)
Man Mohan Sondhi, Bell Labs - Lucent Technologies (USA)
Page (NA) Paper number 665
Abstract:
We consider the problem of detecting stop consonants in continuously
spoken speech. We pose the problem as one of finding the optimal filter
(linear or non-linear) that operates on a particular appropriately
chosen representation. We discuss the performance of several variants
of a canonical stop detector and consider its implications for human
and machine speech recognition.
Authors:
Climent Nadeu, Universitat Politècnica de Catalunya (Spain)
Fèlix Galindo, Universitat Politècnica de Catalunya (Spain)
Jaume Padrell, Universitat Politècnica de Catalunya (Spain)
Page (NA) Paper number 1135
Abstract:
Many speech recognition systems use logarithmic filter-bank energies
or a linear transformation of them to represent the speech signal.
Usually, each of those energies is routinely computed as a weighted
average of the periodogram samples that lie in the corresponding frequency
band. In this work, we attempt to gain an insight into the statistical
properties of the frequency-averaged periodogram (FAP) from which those
energies are samples. Thus, we have shown that the FAP is statistically
and asymptotically equivalent to a multiwindow estimator that arises
from the Thomson[HEX 146]s optimization approach and uses orthogonal
sinusoids as windows. The FAP and other multiwindow estimators are
tested in a speech recognition application, observing the influence
of several design factors. Particularly, a technique that is computationally
simple like the FAP[HEX 146]s one, and which is equivalent to use multiple
cosine windows, appears as an alternative to be taken into consideration.
Authors:
Munehiro Namba, Meiji University (Japan)
Yoshihisa Ishida, Meiji University (Japan)
Page (NA) Paper number 55
Abstract:
In this paper, a wavelet transform domain realization of the blind
equalization technique termed as EVA is applied to speech analysis.
The conventional linear prediction problem can be viewed as a constrained
blind equalization problem. Because the EVA does not impose any restriction
to the probability distribution in the input (the glottal excitation),
the principal features of speech can be effectively separated from
a speech in a short duration. The computational complexity will be
a problem, but the proposed implementation in a wavelet transform domain
promotes the faster convergence in the analysis of speech signal.
Authors:
Steve Pearson, Panasonic Technologies, Inc./Speech Technology Lab (USA)
Page (NA) Paper number 647
Abstract:
This paper presents a class of methods for automatically extracting
formant parameters from speech. The methods rely on an iterative optimization
algorithm. It was found that formant parameter data derived with these
methods was less prone to discontinuity errors than conventional methods.
Also, experiments were conducted that demonstrated that these methods
are capable of better accuracy in formant estimation than LPC, especially
for the first formant. In some cases, the analytic (non-iterative)
solution has been derived, making real time applications feasible.
The main target that we have been pursuing is text-to-speech (TTS)
conversion. These methods are being used to automatically analyze a
concatenation database, without the need for a tuning phase to fix
errors. In addition, they are instrumental in realizing high quality
pitch tracking, and pitch epoch marking.
Authors:
António J. Araújo, FEUP/INESC (Portugal)
Vitor C. Pera, FEUP (Portugal)
Marcio N. Souza, UFRJ (Brazil)
Page (NA) Paper number 319
Abstract:
For a real-time application of an automatic speech recognition system,
hardware acceleration can be the key to reduce the execution time.
Vector quantization is an important task that a recognizer based on
discrete hidden Markov models must perform. Due to the amount of floating
point operations executed, the vector quantizer is an excellent candidate
to be accelerated by customized hardware. The design, implementation
and obtained results of a hardware solution based on field programmable
gate array devices are presented.
Authors:
Hartmut R. Pfitzinger, Department of Phonetics, University of Munich (Germany)
Page (NA) Paper number 523
Abstract:
This investigation focuses on deriving local speech rate directly out
of the speech signal, which differs from syllable rate and from phone
rate. Since local speech rate modifies acoustic cues (e.g. transitions),
phones, and even words, it is one of the most important prosodic cues.
Our local speech rate estimation method is based on a linear combination
of the syllable rate and of the phone rate, since this investigation
strongly suggests that neither the syllable rate nor the phone rate
on its own represent the speech rate sufficiently. Our results show
(a) that perceptual local speech rate correlates better with local
syllable rate than with local phone rate (r=0.81>r=0.73), (b) that
the linear combination of both is well-correlated with perceptual local
speech rate (r=0.88), and (c) that it is now possible to calculate
the perceptual local speech rate with the aid of automatic phone boundary
detectors and syllable nuclei detectors directly from the speech signal.
Authors:
Solange Rossato, Institut de la Communication Parlée de Grenoble (France)
Gang Feng, Institut de la Communication Parlée de Grenoble (France)
Rafaël Laboissière, Institut de la Communication Parlée de Grenoble (France)
Page (NA) Paper number 540
Abstract:
For nasal vowels, a gesture as simple as the lowering of the velum
produces complex acoustic spectra. However, we still find a relative
simplicity in the perceptual space; nasality is perceived easily. In
this preliminary study, we use statistic method to recover the gesture
of the velum. In order to reduce the extreme variability of nasal vowels,
we introduced a simulation based on Maeda's model instead of using
a natural speech signal. In previous studies, nasality is supposed
to increase either with size of the nasal area or with the area ratio
between nasal and oral tracts at the extremity of the velum. In this
work, both types of data are considered and analyzed with linear and
non-linear tools. Finally, statistic inference is described and results
are given for various areas of the nasal tract entrance and for various
area ratios. The results show that velar port area is correctly estimated
for small values while area ratio is a better parameter when velar
port area increases.
Authors:
Guenther Ruske, Institute for Human-Machine-Communication, Technical University of Munich (Germany)
Robert Faltlhauser, Institute for Human-Machine-Communication, Technical University of Munich (Germany)
Thilo Pfau, Institute for Human-Machine-Communication, Technical University of Munich (Germany)
Page (NA) Paper number 100
Abstract:
Speech recognition systems based on hidden Markov models (HMM) favourably
apply a linear discriminant analysis transform (LDA) which yields low-dimensional
and uncorrelated feature components. However, since the distributions
in the HMM states usually are modeled by mixture gaussian densities,
the description by second-order moments no longer is correct. For
this purpose we introduced a new "extended linear discriminant analysis"
transform (ELDA) which starts from conventional LDA. The ELDA transform
is derived by use of a gradient descent optimization procedure based
on a "minimum classification error" (MCE) principle, which is applied
to the original high-dimensional pattern space. The transform matrix,
the best fitting prototype of the correct class (i.e. HMM state) and
the nearest rival are adapted. We developed a method which additionally
updates all prototypes by a separate maximum likelihood (ML) estimation
step. This avoids that such means and covariances, which mostly remain
unaffected by the MCE procedure, may diverge step by step.
Authors:
Ara Samouelian, University Of Wollongong (Australia)
Jordi Robert-Ribes, Digital Media Information Systems, CSIRO Mathematical and Information Sciences (Australia)
Mike Plumpe, Microsoft Corporation (USA)
Page (NA) Paper number 620
Abstract:
Speech processing can be of great help for indexing and archiving TV
broadcast material. Broadcasting station standards will be soon digital.
There will be a huge increase in the use of speech processing techniques
for maintaining the archives as well as accessing them. We present
an application of information theory to the classification and automatic
labelling of TV broadcast material into speech, music and noise. We
use information theory to construct a decision tree from several different
TV programs and then apply it to a different set of TV programs. We
present classification results on training and test data sets. Frame
level correct classification rate, for training data was 95.5%, while
for test data it ranged from 60.4% to 84.5%, depending on TV program
type. At the segment level, correct recognition rate and accuracy on
train data were 100% and 95.1%, respectively while for test data the
% correct ranged from 80% to 100% and %accuracy ranged from 64.7% to
100%.
Authors:
Jean Schoentgen, Université Libre de Bruxelles (Belgium)
Alain Soquet, Université Libre de Bruxelles (Belgium)
Véronique Lecuit, Université Libre de Bruxelles (Belgium)
Sorin Ciocea, Université Libre de Bruxelles (Belgium)
Page (NA) Paper number 1104
Abstract:
The objective is to present a formalism which offers a framework for
several articulatory models and notions such as targets, gestures and
the quantal principle of speech production. The formalism is based
on coupled differential equations that relate the vocal tract shape
to its eigenfrequencies. The shape of the vocal tract is described
either directly by means of an area function model or indirectly via
an articulatory model. Possible synergetic relations between phonetic
gestures or targets and the quantal principle of speech production
are discussed.
Authors:
Youngjoo Suh, ETRI (Korea)
Kyuwoong Hwang, ETRI (Korea)
Oh-Wook Kwon, ETRI (Korea)
Jun Park, ETRI (Korea)
Page (NA) Paper number 638
Abstract:
We propose a new approach to improve the performance of speech recognizers
by utilizing acoustic-phonetic knowledge sources. We use the unvoiced,
voiced. and silence (UVS) group information of the input speech signal
in the conventional speech recognizer. We extract the UVS information
by, using a recurrent neural network (RNN). generate a rule-based score,
and then add the score representing the INS information to the conventional
spectral feature-driven score in the search module. Experimental results
showed that the approach reduces 9% of errors in a 5000- word Korean
spontaneous speech recognition domain.
Authors:
C. William Thorpe, National Voice Centre (Australia)
Page (NA) Paper number 244
Abstract:
A subtractive deconvolution algorithm is described which allows one
to separate a voiced speech signal into two components, representing
the time-invariant and dynamic parts of the signal respectively. The
resulting dynamic component can be encoded at a lower data rate than
can the original speech signal. Results are presented which validate
the utility of decomposing the speech waveform into these two components,
and demonstrate the ability of the algorithm to represent speech signals
at a reduced data rate.
Authors:
Hesham Tolba, INRS-Telecommunications (Canada)
Douglas O'Shaughnessy, INRS-Telecommunications (Canada)
Page (NA) Paper number 343
Abstract:
This study presents a novel technique to reconstruct the missing frequency
bands of bandlimited telephone speech signals. This technique is based
on the Amplitude and Frequency Modulation (AM-FM) model, which models
the speech signal as the sum of N successive AM-FM signals. Based on
a least-mean-square error criterion, each AM-FM signal is modified
using an iterative algorithm in order to regenerate the high-frequency
AM-FM signals. These modified signals are then combined in order to
reconstruct the broadband speech signal. Experiments were conducted
using speech signals extracted from the NTIMIT database. Such experiments
demonstrate the ability of the algorithm for speech recovery, in terms
of a comparison between the original and synthesized speech and informal
listening tests.
Authors:
Chang-Sheng Yang, Utsunomiya University (Japan)
Hideki Kasuya, Utsunomiya University (Japan)
Page (NA) Paper number 1143
Abstract:
In this paper, a high quality pole-zero speech analysis technique is
proposed. The speech production process is represented by a source-filter
model. A Rosenberg-Klatt model is used to approximate a voicing source
waveform for voiced speech, whereas a white noise is assumed for unvoiced.
The vocal tract transfer function is represented by a pole-zero filter.
For voiced speech, parameters of the source model are jointly estimated
with those of the vocal tract filter. A combined algorithm is developed
to estimate the vocal tract parameters, i.e., formants and anti-formants
which are calculated from the poles and zeros of the filter. By the
algorithm, poles are estimated based on a subspace algorithm, while
zeros are estimated from the amplitude spectrum. For unvoiced speech,
an AR model is assumed, which can be solved by LPC analysis. An experiment
using synthesized nasal sounds shows that the poles and zeros are estimated
quite accurately.
Authors:
Fang Zheng, Tsinghua University (China)
Zhanjiang Song, Tsinghua University (China)
Ling Li, Tsinghua University (China)
Wenjian Yu, Tsinghua University (China)
Fengzhou Zheng, Tsinghua University (China)
Wenhu Wu, Tsinghua University (China)
Page (NA) Paper number 171
Abstract:
The Line Spectrum Pair (LSP) based on the principle of linear predictive
coding (LPC) plays a very important role in the speech synthesis; it
has many interesting properties. Several famous speech compression
/ decompression algorithms, including the famous code excited linear
predictive coding (CELP), are based on the LSP analysis, where the
information loss or predicting errors are often very small due to the
LSP's characteristics. Unfortunately till now there is not a satisfying
kind of distance measure available for LSP so that this kind of features
can be used for speech recognition applications. In this paper, the
principle of LSP analysis is studied at first, and then several distance
measures for LSP are proposed which can describe very well the difference
between two groups of different LSP parameters. Experimental results
are also given to show the efficiency of the proposed distance measures.
|