SST Student Day - Poster Session 1

Home
Full List of Titles
1: ICSLP'98 Proceedings
Keynote Speeches
Text-To-Speech Synthesis 1
Spoken Language Models and Dialog 1
Prosody and Emotion 1
Hidden Markov Model Techniques 1
Speaker and Language Recognition 1
Multimodal Spoken Language Processing 1
Isolated Word Recognition
Robust Speech Processing in Adverse Environments 1
Spoken Language Models and Dialog 2
Articulatory Modelling 1
Talking to Infants, Pets and Lovers
Robust Speech Processing in Adverse Environments 2
Spoken Language Models and Dialog 3
Speech Coding 1
Articulatory Modelling 2
Prosody and Emotion 2
Neural Networks, Fuzzy and Evolutionary Methods 1
Utterance Verification and Word Spotting 1 / Speaker Adaptation 1
Text-To-Speech Synthesis 2
Spoken Language Models and Dialog 4
Human Speech Perception 1
Robust Speech Processing in Adverse Environments 3
Speech and Hearing Disorders 1
Prosody and Emotion 3
Spoken Language Understanding Systems 1
Signal Processing and Speech Analysis 1
Spoken Language Generation and Translation 1
Spoken Language Models and Dialog 5
Segmentation, Labelling and Speech Corpora 1
Multimodal Spoken Language Processing 2
Prosody and Emotion 4
Neural Networks, Fuzzy and Evolutionary Methods 2
Large Vocabulary Continuous Speech Recognition 1
Speaker and Language Recognition 2
Signal Processing and Speech Analysis 2
Prosody and Emotion 5
Robust Speech Processing in Adverse Environments 4
Segmentation, Labelling and Speech Corpora 2
Speech Technology Applications and Human-Machine Interface 1
Large Vocabulary Continuous Speech Recognition 2
Text-To-Speech Synthesis 3
Language Acquisition 1
Acoustic Phonetics 1
Speaker Adaptation 2
Speech Coding 2
Hidden Markov Model Techniques 2
Multilingual Perception and Recognition 1
Large Vocabulary Continuous Speech Recognition 3
Articulatory Modelling 3
Language Acquisition 2
Speaker and Language Recognition 3
Text-To-Speech Synthesis 4
Spoken Language Understanding Systems 4
Human Speech Perception 2
Large Vocabulary Continuous Speech Recognition 4
Spoken Language Understanding Systems 2
Signal Processing and Speech Analysis 3
Human Speech Perception 3
Speaker Adaptation 3
Spoken Language Understanding Systems 3
Multimodal Spoken Language Processing 3
Acoustic Phonetics 2
Large Vocabulary Continuous Speech Recognition 5
Speech Coding 3
Language Acquisition 3 / Multilingual Perception and Recognition 2
Segmentation, Labelling and Speech Corpora 3
Text-To-Speech Synthesis 5
Spoken Language Generation and Translation 2
Human Speech Perception 4
Robust Speech Processing in Adverse Environments 5
Text-To-Speech Synthesis 6
Speech Technology Applications and Human-Machine Interface 2
Prosody and Emotion 6
Hidden Markov Model Techniques 3
Speech and Hearing Disorders 2 / Speech Processing for the Speech and Hearing Impaired 1
Human Speech Production
Segmentation, Labelling and Speech Corpora 4
Speaker and Language Recognition 4
Speech Technology Applications and Human-Machine Interface 3
Utterance Verification and Word Spotting 2
Large Vocabulary Continuous Speech Recognition 6
Neural Networks, Fuzzy and Evolutionary Methods 3
Speech Processing for the Speech-Impaired and Hearing-Impaired 2
Prosody and Emotion 7
2: SST Student Day
SST Student Day - Poster Session 1
SST Student Day - Poster Session 2

Author Index
A B C D E F G H I
J K L M N O P Q R
S T U V W X Y Z

Multimedia Files

Non-Linear Probability Estimation Method Used in HMM for Modeling Frame Correlation

Authors:

Qing Guo, Tsinghua University (China)
Fang Zheng, Tsinghua University (China)
Jian Wu, Tsinghua University (China)
Wenhu Wu, Tsinghua University (China)

Page (NA) Paper number 124

Abstract:

In this paper we present a novel method to incorporate temporal correlation into a speech recognition system based on HMM. An obvious way to incorporate temporal correlation is to condition the probability of the current observation on the current state as well as on the previous observation and the previous state. But use this method directly must lead to unreliable parameter estimates for the number of parameters to be estimated may increase too excessively to limited train data. In this paper, we approximate the joint conditional PD by non-linear estimation method. The HMM incorporated temporal correlation by non-linear estimation method, which we called it FC HMM does not need any additional parameters and it only brings a little additional computing quantity. The results in the experiment show that the top 1 recognition rate of FC HMM has been raised by 6 percent compared to the traditional HMM method.

SL980124.PDF (From Author) SL980124.PDF (Rasterized)

TOP


Patterns Of Linguopalatal Contact During Japanese Vowel Devoicing

Authors:

Shuri Kumagai, Department of Linguistic Science, University of Reading (U.K.)

Page (NA) Paper number 222

Abstract:

It is widely claimed that close vowels in Japanese are devoiced when they occur between voiceless consonants. In this paper, voiceless vowels are represented symbolically as [V-] and voiced vowels as [V+]. The patterns of linguopalatal contact during C[V-]C units and the corresponding C[V+]C units are examined using the method of electropalatography (EPG). Our results show that C[V-]C units and the corresponding C[V+]C units often differ with respect to: (1) the amount (patterns) of tongue-palate contact from C1 (the preceding consonant) to C2 (the following consonant) and (2) the articulatory time interval from C1 to C2. Generally, the amount of linguopalatal contact is significantly greater at the front part of the palate in C[V-]C units compared to the corresponding C[V+]C units. The articulatory time interval from C1 to C2 is generally shorter in C[V-]C units compared to the corresponding C[V+]C units, though this is not always the case for all consonantal types. However, the articulatory gesture of the vowel appears to exist between voiceless consonants regardless of whether they are voiced or devoiced. Devoiced vowels have often been examined from the aspect of the opening gesture of the glottis since a turbulent noise during devoiced vowels is expected to be made at the glottis. However, our study seems to suggest that a turbulent noise can also be produced in the oral cavity - as well as at the glottis - by increasing the degree of tongue-palate contact. In principle, it is expected that the larger the tongue-palate contact is, the greater the turbulent noise will become due to the increased rate of airflow. This kind of linguopalatal contact appears to be a positive effort of a speaker rather than simply a matter of a shorter articulatory time interval in C[V-]C units: both factors seem to be related to the production of vowel devoicing, which seems to suggest that aerodynamic effects are involved.

SL980222.PDF (From Author) SL980222.PDF (Rasterized)

TOP


Speech Separation Based on the GMM PDF Estimation

Authors:

Xiao Yu, Shanghai Jiao Tong University (China)
Guangrui Hu, Shanghai Jiao Tong University (China)

Page (NA) Paper number 286

Abstract:

In this paper, the speech separation task will be regarded as a convolutive mixture Blind Source Separation (BSS) problem. The Maximum Entropy (ME) algorithm, the Minimum Mutual Information (MMI) algorithm and the Maximum Likelihood (ML) algorithm are main approaches of the algorithms solving the BSS problem. The relationship of these three algorithms has been analyzed in this paper. Based on the feedback network architecture, a new speech separation algorithm is proposed by using the Gaussian Mixture Model (GMM) pdf estimation in this paper. From the computer simulation results, it can be concluded that the proposed algorithm can get faster convergence rate and lower output Mean Square Error than the conventional ME algorithm.

SL980286.PDF (From Author) SL980286.PDF (Rasterized)

TOP


Growth Transform of A Sum of Rational Functions and Its Application in Estimating HMM Parameters

Authors:

Xiaoqiang Luo, Center for Language and Speech Processing, Dept. of ECE, Johns Hopkins University (USA)

Page (NA) Paper number 364

Abstract:

Gopalakrishnan et al described a method called "growth transform" to optimize rational functions over a domain, which has been found useful to train discriminatively Hidden Markov Models(HMM) in speech recognition. A sum of rational functions is encountered when the contributions from other HMM states are weighted in estimating Gaussian parameters of a state, and the weights are optimized using cross- validation. We will show that the growth transform of a sum of rational function can be obtained by computing term-wise gradients and term-wise function values, as opposed to forming first a single rational function and then applying the result in [Gopal91]. This is computationally advantageous when the objective function consists of many rational terms and the dimensionality of the domain is high. We also propose a gradient directed search algorithm to find the appropriate transform constant C.

SL980364.PDF (From Author) SL980364.PDF (Rasterized)

TOP


Two Automatic Approaches for Analyzing Connected Speech Processes in Dutch

Authors:

Mirjam Wester, A2RT, University of Nijmegen (The Netherlands)
Judith M. Kessens, A2RT, University of Nijmegen (The Netherlands)
Helmer Strik, A2RT, University of Nijmegen (The Netherlands)

Page (NA) Paper number 373

Abstract:

This paper describes two automatic approaches used to study connected speech processes (CSPs) in Dutch. The first approach was from a linguistic point of view - the top-down method. This method can be used for verification of hypotheses about CSPs. The second approach - the bottom-up method -uses a constrained phone recognizer to generate phone transcriptions. An alignment was carried out between the two transcriptions and a reference transcription. A comparison between the two methods showed that 68% agreement was achieved on the CSPs. Although phone accuracy is only 63%, the bottom-up approach is useful for studying CSPs. From the data generated using the bottom-up method, indications of which CSPs are present in the material can be found. These indications can be used to generate hypotheses which can then be tested using the top-down method.

SL980373.PDF (From Author) SL980373.PDF (Rasterized)

TOP


The Use Of Broad Phonetic Class Models In Speaker Recognition

Authors:

Johan W. Koolwaaij, A2RT, KUN (The Netherlands)
Johan de Veth, A2RT, KUN (The Netherlands)

Page (NA) Paper number 380

Abstract:

We investigate the use of broad phonetic class (BPC) models in a text independent speaker recognition task. These models can be used to bring down the variability due to the intrinsic differences between mutual phonetic classes in the speech material used for training of the speaker models. Combining BPC recognition with text independent speaker recognition moves a bit in the direction of text dependent speaker recognition: a task which is known to reach better performance. The performance of BPC modelling is compared to our baseline system using ergodic 5-state HMMs. The question which BPC contains most speaker specific information is addressed. Also, it is investigated if and how the BPC alignment is correlated with the state alignment from the baseline system to check the assumption that states of an ergodic HMM can model broad phonetic classes.

SL980380.PDF (From Author) SL980380.PDF (Rasterized)

TOP


Analysis and Treatment of Esophageal Speech for the Enhancement of its Comprehension

Authors:

Jorge Miquélez, Public University of Navarra (Spain)
Rocío Sesma, Public University of Navarra (Spain)
Yolanda Blanco, Public University of Navarra (Spain)

Page (NA) Paper number 592

Abstract:

This paper resumes an analysis of esophageal speech, and the developing of a method for improving its intelligibility through speech synthesis. Esophageal speech is characterized by low average frequency, while the formant patterns are found to be similar of those of normal speakers. The treatment is different for voiced and unvoiced frames of the signal. While the unvoiced frames are hold like in the original speech, the voiced frames are re-synthesized using linear prediction. Various models of vocal sources have been tested, and the results were better with a polynomial model. The fundamental frequency is raised up to normal values, keeping the intonation.

SL980592.PDF (From Author) SL980592.PDF (Rasterized)

0592_01.PDF
(was: S0592_01.BMP)
Linear model of speech production.
File type: Image File
Format: OTHER
Tech. description: 524x204, 24 bits per pixel
Creating Application:: MS Paint
Creating OS: Windows 95
0592_02.PDF
(was: S0592_02.BMP)
Fundamental frequencies for an esophageal speaker and a normal speaker, sayin the Spanish word "martes"
File type: Image File
Format: OTHER
Tech. description: 514x177, 24 bits per pixel
Creating Application:: MS Paint
Creating OS: Windows 95
0592_03.PDF
(was: S0592_03.BMP)
Central frequencies of the 3 first formants for the Spanish vowels, F1, F2, and F3. The values are similar for both alaryngeal and laryngeal speakers.
File type: Image File
Format: OTHER
Tech. description: 524x394, 24 bits per pixel
Creating Application:: MS Paint
Creating OS: Windows 95
0592_04.PDF
(was: S0592_04.BMP)
Spectrograms of the word martes a) said by an esophageal speaker, and b) re-synthesized according to the method proposed.
File type: Image File
Format: OTHER
Tech. description: 703x284, 24 bits per pixel
Creating Application:: MS Paint
Creating OS: Windows 95
0592_01.WAV
(was: S0592_01.WAV)
The Spanish word "lunes" said by an esophageal speaker
File type: Sound File
Format: Sound File: WAV
Tech. description: Samplig rate: 22050, Bits-per-sample: 16, Mono
Creating Application:: Soundo'LE by Creative
Creating OS: Windows 95
0592_02.WAV
(was: S0592_02.WAV)
The Spanish word "martes" said by an esophageal speaker
File type: Sound File
Format: Sound File: WAV
Tech. description: Samplig rate: 22050, Bits-per-sample: 16, Mono
Creating Application:: Soundo'LE by Creative
Creating OS: Windows 95
0592_03.WAV
(was: S0592_03.WAV)
The Spanish word "miércoles" said by an esophageal speaker
File type: Sound File
Format: Sound File: WAV
Tech. description: Samplig rate: 22050, Bits-per-sample: 16, Mono
Creating Application:: Soundo'LE by Creative
Creating OS: Windows 95
0592_04.WAV
(was: S0592_04.WAV)
The Spanish word "jueves" said by an esophageal speaker
File type: Sound File
Format: Sound File: WAV
Tech. description: Samplig rate: 22050, Bits-per-sample: 16, Mono
Creating Application:: Soundo'LE by Creative
Creating OS: Windows 95
0592_05.WAV
(was: S0592_05.WAV)
The Spanish word "viernes" said by an esophageal speaker
File type: Sound File
Format: Sound File: WAV
Tech. description: Samplig rate: 22050, Bits-per-sample: 16, Mono
Creating Application:: Soundo'LE by Creative
Creating OS: Windows 95
0592_06.WAV
(was: S0592_06.WAV)
The Spanish word "sábado" said by an esophageal speaker
File type: Sound File
Format: Sound File: WAV
Tech. description: Samplig rate: 22050, Bits-per-sample: 16, Mono
Creating Application:: Soundo'LE by Creative
Creating OS: Windows 95
0592_07.WAV
(was: S0592_07.WAV)
The Spanish sentence "y domingo" said by an esophageal speaker
File type: Sound File
Format: Sound File: WAV
Tech. description: Samplig rate: 22050, Bits-per-sample: 16, Mono
Creating Application:: Soundo'LE by Creative
Creating OS: Windows 95
0592_08.WAV
(was: S0592_08.WAV)
The Spanish word "lunes" re-synthesized from a esophageal speaker recording
File type: Sound File
Format: Sound File: WAV
Tech. description: Samplig rate: 22050, Bits-per-sample: 16, Mono
Creating Application:: Soundo'LE by Creative
Creating OS: Windows 95
0592_09.WAV
(was: S0592_09.WAV)
The Spanish word "martes" re-synthesized from a esophageal speaker recording
File type: Sound File
Format: Sound File: WAV
Tech. description: Samplig rate: 22050, Bits-per-sample: 16, Mono
Creating Application:: Soundo'LE by Creative
Creating OS: Windows 95
0592_10.WAV
(was: S0592_10.WAV)
The Spanish word "miércoles" re-synthesized from a esophageal speaker recording
File type: Sound File
Format: Sound File: WAV
Tech. description: Samplig rate: 22050, Bits-per-sample: 16, Mono
Creating Application:: Soundo'LE by Creative
Creating OS: Windows 95
0592_11.WAV
(was: S0592_11.WAV)
The Spanish word "jueves" re-synthesized from a esophageal speaker recording
File type: Sound File
Format: Sound File: WAV
Tech. description: Samplig rate: 22050, Bits-per-sample: 16, Mono
Creating Application:: Soundo'LE by Creative
Creating OS: Windows 95
0592_12.WAV
(was: S0592_12.WAV)
The Spanish word "viernes" re-synthesized from a esophageal speaker recording
File type: Sound File
Format: Sound File: WAV
Tech. description: Samplig rate: 22050, Bits-per-sample: 16, Mono
Creating Application:: Soundo'LE by Creative
Creating OS: Windows 95
0592_13.WAV
(was: S0592_13.WAV)
The Spanish word "sábado" re-synthesized from a esophageal speaker recording
File type: Sound File
Format: Sound File: WAV
Tech. description: Samplig rate: 22050, Bits-per-sample: 16, Mono
Creating Application:: Soundo'LE by Creative
Creating OS: Windows 95
0592_14.WAV
(was: S0592_14.WAV)
The Spanish sentence "y domingo" re-synthesized from a esophageal speaker recording
File type: Sound File
Format: Sound File: WAV
Tech. description: Samplig rate: 22050, Bits-per-sample: 16, Mono
Creating Application:: Soundo'LE by Creative
Creating OS: Windows 95
0592_15.WAV
(was: S0592_15.WAV)
The Spanish word "martes" said by an esophageal speaker
File type: Sound File
Format: Sound File: WAV
Tech. description: Samplig rate: 22050, Bits-per-sample: 16, Mono
Creating Application:: Soundo'LE by Creative
Creating OS: Windows 95
0592_16.WAV
(was: S0592_16.WAV)
The Spanish word "martes" re-synthesized from a esophageal speaker recording
File type: Sound File
Format: Sound File: WAV
Tech. description: Samplig rate: 22050, Bits-per-sample: 16, Mono
Creating Application:: Soundo'LE by Creative
Creating OS: Windows 95

TOP


High Quality Text-to-Speech System in Spanish for Handicapped People

Authors:

Fernando Lacunza, Universidad Publica de Navarra (Spain)
Yolanda Blanco, Universidad Publica de Navarra (Spain)

Page (NA) Paper number 596

Abstract:

This paper describes a high-quality Text-to-Speech system for Spanish, based on the concatenation of diphonemes with the MBR-PSOLA algorithm. Since it was designed as a substitute of natural voice for handicapped people, it must offer a easy to hear speech, with emotional and emphatic information embedded in it. This is obtained with the prosody generator, which uses a series of phonological patterns for phonic groups and a grammatical database to vary three speech parameters: pitch, amplitude and duration. This system accepts plain text, which can be complemented with data about emotions and emphasis.

SL980596.PDF (From Author) SL980596.PDF (Rasterized)

TOP


Factors Affecting Speech Retrieval

Authors:

Corinna Ng, Department of Computer Science, RMIT (Australia)
Ross Wilkinson, CSIRO, Division of Mathematical and Information Science (Australia)
Justin Zobel, Department of Computer Science, RMIT (Australia)

Page (NA) Paper number 740

Abstract:

Collections of speech documents can be searched using speech retrieval, where the documents are processed by a speech recogniser to give text that can be searched by text retrieval techniques. We investigated the use of a phoneme-based recogniser to obtain phoneme sequences. We found that phoneme recognition is worse than word recognition, because of lack of context and difficulty in phoneme boundary detection. Comparing the transcriptions of two different phoneme-based recogniser, we found that the effects of training using well-defined phoneme data, the lack of a language model, and lack of a context-dependent model affected recognition performance. Retrieval using trigrams performed better than quadgrams because the longer n-gram features contained too many transcription errors. Comparing the phonetic transcriptions from a word recogniser to that from a phoneme recogniser, we found that using 61 phones modelled with an algorithmic approach were better than using 40 phones modelled with a dictionary approach.

SL980740.PDF (From Author) SL980740.PDF (Rasterized)

TOP


Perception Of Words With Vowel Reduction

Authors:

Johan Frid, Dept. of linguistics and phonetics, Lund University (Sweden)

Page (NA) Paper number 743

Abstract:

This study deals with listeners' ability to identify linguistic units from linguistically incomplete stimuli and relates this to the potentiality of vowel reduction in a word. Synthetic speech was used to produce stimuli that were similar to real words, but where the vowel in the pre-stress syllable was excluded. Listeners then performed a lexical decision test, where they had to decide whether a stimulus sounded like a word or not. The effects of the identity of the removed vowel and of features of the consonants adjacent to the removed vowel were then examined, as well as syllabic features. For type of vowel, lower word rates were found for words with the vowels /a/ and /o/, whereas words with nasals after the reduced vowel tended to result in higher word rates. Furthermore, words that still conformed to the phonotactic structure of Swedish after reduction got lower word rates than words that violated this, possibly because the conforming words are more eligible to resyllabification, which renders them as phonotactically legal nonsense words rather than real words.

SL980743.PDF (From Author) SL980743.PDF (Rasterized)

TOP