Home
Full List of Titles
1: ICSLP'98 Proceedings
Keynote Speeches
Text-To-Speech Synthesis 1
Spoken Language Models and Dialog 1
Prosody and Emotion 1
Hidden Markov Model Techniques 1
Speaker and Language Recognition 1
Multimodal Spoken Language Processing 1
Isolated Word Recognition
Robust Speech Processing in Adverse Environments 1
Spoken Language Models and Dialog 2
Articulatory Modelling 1
Talking to Infants, Pets and Lovers
Robust Speech Processing in Adverse Environments 2
Spoken Language Models and Dialog 3
Speech Coding 1
Articulatory Modelling 2
Prosody and Emotion 2
Neural Networks, Fuzzy and Evolutionary Methods 1
Utterance Verification and Word Spotting 1 / Speaker Adaptation 1
Text-To-Speech Synthesis 2
Spoken Language Models and Dialog 4
Human Speech Perception 1
Robust Speech Processing in Adverse Environments 3
Speech and Hearing Disorders 1
Prosody and Emotion 3
Spoken Language Understanding Systems 1
Signal Processing and Speech Analysis 1
Spoken Language Generation and Translation 1
Spoken Language Models and Dialog 5
Segmentation, Labelling and Speech Corpora 1
Multimodal Spoken Language Processing 2
Prosody and Emotion 4
Neural Networks, Fuzzy and Evolutionary Methods 2
Large Vocabulary Continuous Speech Recognition 1
Speaker and Language Recognition 2
Signal Processing and Speech Analysis 2
Prosody and Emotion 5
Robust Speech Processing in Adverse Environments 4
Segmentation, Labelling and Speech Corpora 2
Speech Technology Applications and Human-Machine Interface 1
Large Vocabulary Continuous Speech Recognition 2
Text-To-Speech Synthesis 3
Language Acquisition 1
Acoustic Phonetics 1
Speaker Adaptation 2
Speech Coding 2
Hidden Markov Model Techniques 2
Multilingual Perception and Recognition 1
Large Vocabulary Continuous Speech Recognition 3
Articulatory Modelling 3
Language Acquisition 2
Speaker and Language Recognition 3
Text-To-Speech Synthesis 4
Spoken Language Understanding Systems 4
Human Speech Perception 2
Large Vocabulary Continuous Speech Recognition 4
Spoken Language Understanding Systems 2
Signal Processing and Speech Analysis 3
Human Speech Perception 3
Speaker Adaptation 3
Spoken Language Understanding Systems 3
Multimodal Spoken Language Processing 3
Acoustic Phonetics 2
Large Vocabulary Continuous Speech Recognition 5
Speech Coding 3
Language Acquisition 3 / Multilingual Perception and Recognition 2
Segmentation, Labelling and Speech Corpora 3
Text-To-Speech Synthesis 5
Spoken Language Generation and Translation 2
Human Speech Perception 4
Robust Speech Processing in Adverse Environments 5
Text-To-Speech Synthesis 6
Speech Technology Applications and Human-Machine Interface 2
Prosody and Emotion 6
Hidden Markov Model Techniques 3
Speech and Hearing Disorders 2 / Speech Processing for the Speech and Hearing Impaired 1
Human Speech Production
Segmentation, Labelling and Speech Corpora 4
Speaker and Language Recognition 4
Speech Technology Applications and Human-Machine Interface 3
Utterance Verification and Word Spotting 2
Large Vocabulary Continuous Speech Recognition 6
Neural Networks, Fuzzy and Evolutionary Methods 3
Speech Processing for the Speech-Impaired and Hearing-Impaired 2
Prosody and Emotion 7
2: SST Student Day
SST Student Day - Poster Session 1
SST Student Day - Poster Session 2
Author Index
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z
Multimedia Files
|
Authors:
Qing Guo, Tsinghua University (China)
Fang Zheng, Tsinghua University (China)
Jian Wu, Tsinghua University (China)
Wenhu Wu, Tsinghua University (China)
Page (NA) Paper number 124
Abstract:
In this paper we present a novel method to incorporate temporal correlation
into a speech recognition system based on HMM. An obvious way to incorporate
temporal correlation is to condition the probability of the current
observation on the current state as well as on the previous observation
and the previous state. But use this method directly must lead to unreliable
parameter estimates for the number of parameters to be estimated may
increase too excessively to limited train data. In this paper, we approximate
the joint conditional PD by non-linear estimation method. The HMM incorporated
temporal correlation by non-linear estimation method, which we called
it FC HMM does not need any additional parameters and it only brings
a little additional computing quantity. The results in the experiment
show that the top 1 recognition rate of FC HMM has been raised by 6
percent compared to the traditional HMM method.
Authors:
Shuri Kumagai, Department of Linguistic Science, University of Reading (U.K.)
Page (NA) Paper number 222
Abstract:
It is widely claimed that close vowels in Japanese are devoiced when
they occur between voiceless consonants. In this paper, voiceless
vowels are represented symbolically as [V-] and voiced vowels as [V+].
The patterns of linguopalatal contact during C[V-]C units and the corresponding
C[V+]C units are examined using the method of electropalatography (EPG).
Our results show that C[V-]C units and the corresponding C[V+]C units
often differ with respect to: (1) the amount (patterns) of tongue-palate
contact from C1 (the preceding consonant) to C2 (the following consonant)
and (2) the articulatory time interval from C1 to C2. Generally, the
amount of linguopalatal contact is significantly greater at the front
part of the palate in C[V-]C units compared to the corresponding C[V+]C
units. The articulatory time interval from C1 to C2 is generally shorter
in C[V-]C units compared to the corresponding C[V+]C units, though
this is not always the case for all consonantal types. However, the
articulatory gesture of the vowel appears to exist between voiceless
consonants regardless of whether they are voiced or devoiced. Devoiced
vowels have often been examined from the aspect of the opening gesture
of the glottis since a turbulent noise during devoiced vowels is expected
to be made at the glottis. However, our study seems to suggest that
a turbulent noise can also be produced in the oral cavity - as well
as at the glottis - by increasing the degree of tongue-palate contact.
In principle, it is expected that the larger the tongue-palate contact
is, the greater the turbulent noise will become due to the increased
rate of airflow. This kind of linguopalatal contact appears to be a
positive effort of a speaker rather than simply a matter of a shorter
articulatory time interval in C[V-]C units: both factors seem to be
related to the production of vowel devoicing, which seems to suggest
that aerodynamic effects are involved.
Authors:
Xiao Yu, Shanghai Jiao Tong University (China)
Guangrui Hu, Shanghai Jiao Tong University (China)
Page (NA) Paper number 286
Abstract:
In this paper, the speech separation task will be regarded as a convolutive
mixture Blind Source Separation (BSS) problem. The Maximum Entropy
(ME) algorithm, the Minimum Mutual Information (MMI) algorithm and
the Maximum Likelihood (ML) algorithm are main approaches of the algorithms
solving the BSS problem. The relationship of these three algorithms
has been analyzed in this paper. Based on the feedback network architecture,
a new speech separation algorithm is proposed by using the Gaussian
Mixture Model (GMM) pdf estimation in this paper. From the computer
simulation results, it can be concluded that the proposed algorithm
can get faster convergence rate and lower output Mean Square Error
than the conventional ME algorithm.
Authors:
Xiaoqiang Luo, Center for Language and Speech Processing, Dept. of ECE, Johns Hopkins University (USA)
Page (NA) Paper number 364
Abstract:
Gopalakrishnan et al described a method called "growth transform" to
optimize rational functions over a domain, which has been found useful
to train discriminatively Hidden Markov Models(HMM) in speech recognition.
A sum of rational functions is encountered when the contributions from
other HMM states are weighted in estimating Gaussian parameters of
a state, and the weights are optimized using cross- validation. We
will show that the growth transform of a sum of rational function can
be obtained by computing term-wise gradients and term-wise function
values, as opposed to forming first a single rational function and
then applying the result in [Gopal91]. This is computationally advantageous
when the objective function consists of many rational terms and the
dimensionality of the domain is high. We also propose a gradient directed
search algorithm to find the appropriate transform constant C.
Authors:
Mirjam Wester, A2RT, University of Nijmegen (The Netherlands)
Judith M. Kessens, A2RT, University of Nijmegen (The Netherlands)
Helmer Strik, A2RT, University of Nijmegen (The Netherlands)
Page (NA) Paper number 373
Abstract:
This paper describes two automatic approaches used to study connected
speech processes (CSPs) in Dutch. The first approach was from a linguistic
point of view - the top-down method. This method can be used for verification
of hypotheses about CSPs. The second approach - the bottom-up method
-uses a constrained phone recognizer to generate phone transcriptions.
An alignment was carried out between the two transcriptions and a reference
transcription. A comparison between the two methods showed that 68%
agreement was achieved on the CSPs. Although phone accuracy is only
63%, the bottom-up approach is useful for studying CSPs. From the data
generated using the bottom-up method, indications of which CSPs are
present in the material can be found. These indications can be used
to generate hypotheses which can then be tested using the top-down
method.
Authors:
Johan W. Koolwaaij, A2RT, KUN (The Netherlands)
Johan de Veth, A2RT, KUN (The Netherlands)
Page (NA) Paper number 380
Abstract:
We investigate the use of broad phonetic class (BPC) models in a text
independent speaker recognition task. These models can be used to
bring down the variability due to the intrinsic differences between
mutual phonetic classes in the speech material used for training of
the speaker models. Combining BPC recognition with text independent
speaker recognition moves a bit in the direction of text dependent
speaker recognition: a task which is known to reach better performance.
The performance of BPC modelling is compared to our baseline system
using ergodic 5-state HMMs. The question which BPC contains most speaker
specific information is addressed. Also, it is investigated if and
how the BPC alignment is correlated with the state alignment from the
baseline system to check the assumption that states of an ergodic HMM
can model broad phonetic classes.
Authors:
Jorge Miquélez, Public University of Navarra (Spain)
Rocío Sesma, Public University of Navarra (Spain)
Yolanda Blanco, Public University of Navarra (Spain)
Page (NA) Paper number 592
Abstract:
This paper resumes an analysis of esophageal speech, and the developing
of a method for improving its intelligibility through speech synthesis.
Esophageal speech is characterized by low average frequency, while
the formant patterns are found to be similar of those of normal speakers.
The treatment is different for voiced and unvoiced frames of the signal.
While the unvoiced frames are hold like in the original speech, the
voiced frames are re-synthesized using linear prediction. Various
models of vocal sources have been tested, and the results were better
with a polynomial model. The fundamental frequency is raised up to
normal values, keeping the intonation.
0592_01.PDF
(was: S0592_01.BMP)
| Linear model of speech production.
File type: Image File
Format: OTHER
Tech. description: 524x204, 24 bits per pixel
Creating Application:: MS Paint
Creating OS: Windows 95
|
0592_02.PDF
(was: S0592_02.BMP)
| Fundamental frequencies for an esophageal speaker and a normal speaker, sayin the Spanish word "martes"
File type: Image File
Format: OTHER
Tech. description: 514x177, 24 bits per pixel
Creating Application:: MS Paint
Creating OS: Windows 95
|
0592_03.PDF
(was: S0592_03.BMP)
| Central frequencies of the 3 first formants for the Spanish vowels, F1, F2, and F3. The values are similar for both alaryngeal and laryngeal speakers.
File type: Image File
Format: OTHER
Tech. description: 524x394, 24 bits per pixel
Creating Application:: MS Paint
Creating OS: Windows 95
|
0592_04.PDF
(was: S0592_04.BMP)
| Spectrograms of the word martes a) said by an esophageal speaker, and b) re-synthesized according to the method proposed.
File type: Image File
Format: OTHER
Tech. description: 703x284, 24 bits per pixel
Creating Application:: MS Paint
Creating OS: Windows 95
|
0592_01.WAV
(was: S0592_01.WAV)
| The Spanish word "lunes" said by an esophageal speaker
File type: Sound File
Format: Sound File: WAV
Tech. description: Samplig rate: 22050, Bits-per-sample: 16, Mono
Creating Application:: Soundo'LE by Creative
Creating OS: Windows 95
|
0592_02.WAV
(was: S0592_02.WAV)
| The Spanish word "martes" said by an esophageal speaker
File type: Sound File
Format: Sound File: WAV
Tech. description: Samplig rate: 22050, Bits-per-sample: 16, Mono
Creating Application:: Soundo'LE by Creative
Creating OS: Windows 95
|
0592_03.WAV
(was: S0592_03.WAV)
| The Spanish word "miércoles" said by an esophageal speaker
File type: Sound File
Format: Sound File: WAV
Tech. description: Samplig rate: 22050, Bits-per-sample: 16, Mono
Creating Application:: Soundo'LE by Creative
Creating OS: Windows 95
|
0592_04.WAV
(was: S0592_04.WAV)
| The Spanish word "jueves" said by an esophageal speaker
File type: Sound File
Format: Sound File: WAV
Tech. description: Samplig rate: 22050, Bits-per-sample: 16, Mono
Creating Application:: Soundo'LE by Creative
Creating OS: Windows 95
|
0592_05.WAV
(was: S0592_05.WAV)
| The Spanish word "viernes" said by an esophageal speaker
File type: Sound File
Format: Sound File: WAV
Tech. description: Samplig rate: 22050, Bits-per-sample: 16, Mono
Creating Application:: Soundo'LE by Creative
Creating OS: Windows 95
|
0592_06.WAV
(was: S0592_06.WAV)
| The Spanish word "sábado" said by an esophageal speaker
File type: Sound File
Format: Sound File: WAV
Tech. description: Samplig rate: 22050, Bits-per-sample: 16, Mono
Creating Application:: Soundo'LE by Creative
Creating OS: Windows 95
|
0592_07.WAV
(was: S0592_07.WAV)
| The Spanish sentence "y domingo" said by an esophageal speaker
File type: Sound File
Format: Sound File: WAV
Tech. description: Samplig rate: 22050, Bits-per-sample: 16, Mono
Creating Application:: Soundo'LE by Creative
Creating OS: Windows 95
|
0592_08.WAV
(was: S0592_08.WAV)
| The Spanish word "lunes" re-synthesized from a esophageal speaker recording
File type: Sound File
Format: Sound File: WAV
Tech. description: Samplig rate: 22050, Bits-per-sample: 16, Mono
Creating Application:: Soundo'LE by Creative
Creating OS: Windows 95
|
0592_09.WAV
(was: S0592_09.WAV)
| The Spanish word "martes" re-synthesized from a esophageal speaker recording
File type: Sound File
Format: Sound File: WAV
Tech. description: Samplig rate: 22050, Bits-per-sample: 16, Mono
Creating Application:: Soundo'LE by Creative
Creating OS: Windows 95
|
0592_10.WAV
(was: S0592_10.WAV)
| The Spanish word "miércoles" re-synthesized from a esophageal speaker recording
File type: Sound File
Format: Sound File: WAV
Tech. description: Samplig rate: 22050, Bits-per-sample: 16, Mono
Creating Application:: Soundo'LE by Creative
Creating OS: Windows 95
|
0592_11.WAV
(was: S0592_11.WAV)
| The Spanish word "jueves" re-synthesized from a esophageal speaker recording
File type: Sound File
Format: Sound File: WAV
Tech. description: Samplig rate: 22050, Bits-per-sample: 16, Mono
Creating Application:: Soundo'LE by Creative
Creating OS: Windows 95
|
0592_12.WAV
(was: S0592_12.WAV)
| The Spanish word "viernes" re-synthesized from a esophageal speaker recording
File type: Sound File
Format: Sound File: WAV
Tech. description: Samplig rate: 22050, Bits-per-sample: 16, Mono
Creating Application:: Soundo'LE by Creative
Creating OS: Windows 95
|
0592_13.WAV
(was: S0592_13.WAV)
| The Spanish word "sábado" re-synthesized from a esophageal speaker recording
File type: Sound File
Format: Sound File: WAV
Tech. description: Samplig rate: 22050, Bits-per-sample: 16, Mono
Creating Application:: Soundo'LE by Creative
Creating OS: Windows 95
|
0592_14.WAV
(was: S0592_14.WAV)
| The Spanish sentence "y domingo" re-synthesized from a esophageal speaker recording
File type: Sound File
Format: Sound File: WAV
Tech. description: Samplig rate: 22050, Bits-per-sample: 16, Mono
Creating Application:: Soundo'LE by Creative
Creating OS: Windows 95
|
0592_15.WAV
(was: S0592_15.WAV)
| The Spanish word "martes" said by an esophageal speaker
File type: Sound File
Format: Sound File: WAV
Tech. description: Samplig rate: 22050, Bits-per-sample: 16, Mono
Creating Application:: Soundo'LE by Creative
Creating OS: Windows 95
|
0592_16.WAV
(was: S0592_16.WAV)
| The Spanish word "martes" re-synthesized from a esophageal speaker recording
File type: Sound File
Format: Sound File: WAV
Tech. description: Samplig rate: 22050, Bits-per-sample: 16, Mono
Creating Application:: Soundo'LE by Creative
Creating OS: Windows 95
|
Authors:
Fernando Lacunza, Universidad Publica de Navarra (Spain)
Yolanda Blanco, Universidad Publica de Navarra (Spain)
Page (NA) Paper number 596
Abstract:
This paper describes a high-quality Text-to-Speech system for Spanish,
based on the concatenation of diphonemes with the MBR-PSOLA algorithm.
Since it was designed as a substitute of natural voice for handicapped
people, it must offer a easy to hear speech, with emotional and emphatic
information embedded in it. This is obtained with the prosody generator,
which uses a series of phonological patterns for phonic groups and
a grammatical database to vary three speech parameters: pitch, amplitude
and duration. This system accepts plain text, which can be complemented
with data about emotions and emphasis.
Authors:
Corinna Ng, Department of Computer Science, RMIT (Australia)
Ross Wilkinson, CSIRO, Division of Mathematical and Information Science (Australia)
Justin Zobel, Department of Computer Science, RMIT (Australia)
Page (NA) Paper number 740
Abstract:
Collections of speech documents can be searched using speech retrieval,
where the documents are processed by a speech recogniser to give text
that can be searched by text retrieval techniques. We investigated
the use of a phoneme-based recogniser to obtain phoneme sequences.
We found that phoneme recognition is worse than word recognition,
because of lack of context and difficulty in phoneme boundary detection.
Comparing the transcriptions of two different phoneme-based recogniser,
we found that the effects of training using well-defined phoneme data,
the lack of a language model, and lack of a context-dependent model
affected recognition performance. Retrieval using trigrams performed
better than quadgrams because the longer n-gram features contained
too many transcription errors. Comparing the phonetic transcriptions
from a word recogniser to that from a phoneme recogniser, we found
that using 61 phones modelled with an algorithmic approach were better
than using 40 phones modelled with a dictionary approach.
Authors:
Johan Frid, Dept. of linguistics and phonetics, Lund University (Sweden)
Page (NA) Paper number 743
Abstract:
This study deals with listeners' ability to identify linguistic units
from linguistically incomplete stimuli and relates this to the potentiality
of vowel reduction in a word. Synthetic speech was used to produce
stimuli that were similar to real words, but where the vowel in the
pre-stress syllable was excluded. Listeners then performed a lexical
decision test, where they had to decide whether a stimulus sounded
like a word or not. The effects of the identity of the removed vowel
and of features of the consonants adjacent to the removed vowel were
then examined, as well as syllabic features. For type of vowel, lower
word rates were found for words with the vowels /a/ and /o/, whereas
words with nasals after the reduced vowel tended to result in higher
word rates. Furthermore, words that still conformed to the phonotactic
structure of Swedish after reduction got lower word rates than words
that violated this, possibly because the conforming words are more
eligible to resyllabification, which renders them as phonotactically
legal nonsense words rather than real words.
|