Authors:
Shin Suzuki, NTT Basic Research Laboratories (Japan)
Takesi Okadome, NTT Basic Research Laboratories (Japan)
Masaaki Honda, NTT Basic Research Laboratories (Japan)
Page (NA) Paper number 130
Abstract:
A method for determining articulatory parameters from speech acoustics
is presented. The method is based on a search of an articulatory-acoustic
codebook which is designed from simultaneous observation data of articulatory
motions and speech acoustics. The codebook search employs dynamic constraints
on acoustic behavior as well as articulatory behavior. There are two
constrains. One of the constraints is use of spectral segments in
the codebook search and the other is use of the smoothness of articulatory
trajectories in the articulatory parameter path search. The articulatory
parameters are determined by selecting the articulatory code vector
in the codebook which minimizes the weighted distance measure of segmental
spectral distance and squared distance between succeeding articulatory
parameters. The results of an experiment show that an rms error between
the estimated and observed articulatory parameter was about 2.0 mm
on average, and the articulatory features for vowels and consonants
are recovered well.
Authors:
Yang Li, University of Illinois at Urbana-Champaign (USA)
Yunxin Zhao, University of Illinois at Urbana-Champaign (USA)
Page (NA) Paper number 379
Abstract:
The acoustic characteristics of speech are influenced by speakers'
emotional status. In this study, we attempted to recognize the emotional
status of individual speakers by using speech features that were extracted
from short-time analysis frames as well as speech features that represented
entire utterances. Principal component analysis was used to analyze
the importance of individual features in representing emotional categories.
Three classification methods including vector quantization, artificial
neural networks and Gaussian mixture density model were used. Classifications
using short-term features only, long-term features only and both short-term
and long-term features were conducted. The best recognition performance
of 62% accuracy was achieved by using the Gaussian mixture density
method with both short-term and long-term features.
Authors:
Arnaud Robert, CIRC Group, Swiss Federal Institute of Technology, Lausanne (Switzerland)
Jan Eriksson, Physiology Department, University of Lausanne (Switzerland)
Page (NA) Paper number 748
Abstract:
This paper describes a phenomenological model of the auditory periphery
which consists of a bank of nonlinear time-varying parallel filters.
Realistic filter shapes are obtained with the all-pole gammatone filter
(APGF) which provides both a good approximation of the far more complex
wave-propagation or cochlear mechanics models and a very simple implementation.
The model also includes an active, distributed feedback that controls
the damping parameter of the APGF. As a result, the model reproduces
several observed phenomena including compression and two-tone suppression.
It is now used to study responses to complex stimuli in models of the
auditory nerve and cochlear nucleus neurons and to provide physiologically
plausible front-end for speech analysis.
Authors:
Padma Ramesh, Bell Labs - Lucent Technologies (USA)
Partha Niyogi, Bell Labs - Lucent Technologies (USA)
Page (NA) Paper number 881
Abstract:
We examine the distinctive feature [voice] that separates the voiced
from the unvoiced sounds for the case of stop consonants. We conduct
acoustic phonetic analyses on a large database and demonstrate the
superior separability using a temporal measure (voice onset time; VOT)
rather than spectral measures. We describe several algorithms to automatically
estimate the VOT from continuous speech and compare them on a speech
recognition problem to reduce error rates by as much as 53 percent
over a baseline HMM based system.
Authors:
Sankar Basu, IBM T.J. Watson Research Center (USA)
Stéphane Maes, IBM T.J. Watson Research Center (USA)
Page (NA) Paper number 982
Abstract:
Speech production models, coding methods as well as text to speech
technology often lead to the introduction of modulation models to represent
speech signals with primary components which are amplitude-and-phase-modulated
sine functions. Parallelisms between properties of the wavelet transform
of primary components and algorithmic representations of speech signals
derived from auditory nerve models like the EIH lead to the introduction
of synchrosqueezing measures. On the other hand, in automatic speech
(and speaker) recognition, cepstral feature have imposed themselves
quasi-universally as acoustic characteristic of speech utterances.
This paper analyses cepstral representation in the context of the synchrosqueezed
representation - wastrum. It discusses energy accumulation derived
wastra as opposed to classical MEL and LPC derived cepstra. In the
former method the primary components and formants play a primary role.
Recognition results are presented on the Wall Street Journal database
using IBM continuous decoder.
Authors:
Carlos Silva, Dept. de Electrónica Industrial - Universidade do Minho (Portugal)
Samir Chennoukh, Center for Computer Aids for Industrial Productivity (CAIP), Rutgers University. (USA)
Page (NA) Paper number 899
Abstract:
Fundamental to the success of the articulatory based speech coding
is the mapping from acoustics to articulatory description. As the mapping
is not unique and based on articulatory continuity criteria, the non-uniqueness
of the articulatory trajectories is solved using a forward dynamic
network. In this paper, we present new results on forward dynamic network
used to estimate articulatory trajectories when using an improved articulatory
codebook for acoustic-to-articulatory mapping. The improvement on the
codebook design is based on a new model that provides more details
on the vocal tract area function and on more appropriate articulatory
parameter samplings according to the articulatory-acoustics relation.
0899_01.WAV
(was: 0899_1.wav)
| Speech file of the original sentence "Where are you?"
File type: Sound File
Format: Sound File: WAV
Tech. description: 16 KHz, 16 bits, mono, signed linear encoding.
Creating Application:: sox
Creating OS: Linux
|
0899_02.WAV
(was: 0899_2.wav)
| Speech file of the mimic result of the sentence "Where are you?"
using our imploved codebook.
File type: Sound File
Format: Sound File: WAV
Tech. description: 16 KHz, 16 bits, mono, signed linear encoding.
Creating Application:: sox
Creating OS: Linux
|
0899_03.WAV
(was: 0899_3.wav)
| Speech file of the mimic result of the sentence "Where are you?"
using our old codebook.
File type: Sound File
Format: Sound File: WAV
Tech. description: 16 KHz, 16 bits, mono, signed linear encoding.
Creating Application:: sox
Creating OS: Linux
|
0899_04.PDF
(was: 0899.gif)
| Spectrogram of the sentence "Where are you?" using our old codebook.
File type: Image File
Format: Image : GIF
Tech. description: None
Creating Application:: XV
Creating OS: Linux.
|
|