Text-To-Speech Synthesis 6

Home
Full List of Titles
1: ICSLP'98 Proceedings
Keynote Speeches
Text-To-Speech Synthesis 1
Spoken Language Models and Dialog 1
Prosody and Emotion 1
Hidden Markov Model Techniques 1
Speaker and Language Recognition 1
Multimodal Spoken Language Processing 1
Isolated Word Recognition
Robust Speech Processing in Adverse Environments 1
Spoken Language Models and Dialog 2
Articulatory Modelling 1
Talking to Infants, Pets and Lovers
Robust Speech Processing in Adverse Environments 2
Spoken Language Models and Dialog 3
Speech Coding 1
Articulatory Modelling 2
Prosody and Emotion 2
Neural Networks, Fuzzy and Evolutionary Methods 1
Utterance Verification and Word Spotting 1 / Speaker Adaptation 1
Text-To-Speech Synthesis 2
Spoken Language Models and Dialog 4
Human Speech Perception 1
Robust Speech Processing in Adverse Environments 3
Speech and Hearing Disorders 1
Prosody and Emotion 3
Spoken Language Understanding Systems 1
Signal Processing and Speech Analysis 1
Spoken Language Generation and Translation 1
Spoken Language Models and Dialog 5
Segmentation, Labelling and Speech Corpora 1
Multimodal Spoken Language Processing 2
Prosody and Emotion 4
Neural Networks, Fuzzy and Evolutionary Methods 2
Large Vocabulary Continuous Speech Recognition 1
Speaker and Language Recognition 2
Signal Processing and Speech Analysis 2
Prosody and Emotion 5
Robust Speech Processing in Adverse Environments 4
Segmentation, Labelling and Speech Corpora 2
Speech Technology Applications and Human-Machine Interface 1
Large Vocabulary Continuous Speech Recognition 2
Text-To-Speech Synthesis 3
Language Acquisition 1
Acoustic Phonetics 1
Speaker Adaptation 2
Speech Coding 2
Hidden Markov Model Techniques 2
Multilingual Perception and Recognition 1
Large Vocabulary Continuous Speech Recognition 3
Articulatory Modelling 3
Language Acquisition 2
Speaker and Language Recognition 3
Text-To-Speech Synthesis 4
Spoken Language Understanding Systems 4
Human Speech Perception 2
Large Vocabulary Continuous Speech Recognition 4
Spoken Language Understanding Systems 2
Signal Processing and Speech Analysis 3
Human Speech Perception 3
Speaker Adaptation 3
Spoken Language Understanding Systems 3
Multimodal Spoken Language Processing 3
Acoustic Phonetics 2
Large Vocabulary Continuous Speech Recognition 5
Speech Coding 3
Language Acquisition 3 / Multilingual Perception and Recognition 2
Segmentation, Labelling and Speech Corpora 3
Text-To-Speech Synthesis 5
Spoken Language Generation and Translation 2
Human Speech Perception 4
Robust Speech Processing in Adverse Environments 5
Text-To-Speech Synthesis 6
Speech Technology Applications and Human-Machine Interface 2
Prosody and Emotion 6
Hidden Markov Model Techniques 3
Speech and Hearing Disorders 2 / Speech Processing for the Speech and Hearing Impaired 1
Human Speech Production
Segmentation, Labelling and Speech Corpora 4
Speaker and Language Recognition 4
Speech Technology Applications and Human-Machine Interface 3
Utterance Verification and Word Spotting 2
Large Vocabulary Continuous Speech Recognition 6
Neural Networks, Fuzzy and Evolutionary Methods 3
Speech Processing for the Speech-Impaired and Hearing-Impaired 2
Prosody and Emotion 7
2: SST Student Day
SST Student Day - Poster Session 1
SST Student Day - Poster Session 2

Author Index
A B C D E F G H I
J K L M N O P Q R
S T U V W X Y Z

Multimedia Files

How To Handle "Foreign" Sounds in Swedish Text-to-Speech Conversion: Approaching the 'Xenophone' Problem

Authors:

Robert Eklund, Telia Research AB (Sweden)
Anders Lindström, Telia Research AB (Sweden)

Page (NA) Paper number 514

Abstract:

This paper discusses the problem of handling 'foreign' speech sounds in Swedish speech technology systems, in particular speech synthesis. A production study is made, where it is shown that Swedish speakers add foreign speech sounds, here termed 'xenophones', to their phone repertoire when reading Swedish sentences with embedded English names and words. As a result of the observations, the phone set of a Swedish concatenative synthesizer is extended, and it is shown (by example) that this produces more natural-sounding synthetic speech.

SL980514.PDF (From Author) SL980514.PDF (Rasterized)

0514_01.WAV
(was: 0514_01.WAV)
Speech synthesis example.
File type: Sound File
Format: OTHER
Tech. description: Sampling rate: 16 kHz, bits-per-sample: 16, Mono., Encoding: Linear PCM,
Creating Application:: Unknown
Creating OS: unix
0514_02.WAV
(was: 0514_02.WAV)
Speech synthesis example.
File type: Sound File
Format: OTHER
Tech. description: Sampling rate: 16 kHz, Bits-per-sample: 16, Mono., Encoding: Linear PCM
Creating Application:: Unknown
Creating OS: unix

TOP


Multi-lingual Concatenative Speech Synthesis

Authors:

Nick Campbell, ATR-ITL (Japan)

Page (NA) Paper number 24

Abstract:

This paper describes a method of concatenative speech synthesis that makes use of 3-dimensional labelling of speech, and shows how this can be applied to the synthesis of both mono-lingual and foreign-language speech. The dimensions encode phonetic, prosodic, and voice-quality information in order to fully describe the acoustic characteristics of each speech segment.

SL980024.PDF (From Author) SL980024.PDF (Rasterized)

0024_01.WAV
(was: 0024_01.wav)
Since CHATR produces speech in the recognisable voice of a known person, it offers the potential to extend that person's apparent abilities into the realm of multi-linguality. By offering this ability to the voice of a young child, we are perhaps meeting Furui's expectations [3]. [SOUND 0024.01.WAV][SOUND 0024.02.WAV]
File type: Sound File
Format: Sound File: WAV
Tech. description: Unknown
Creating Application:: Unknown
Creating OS: Unknown
0024_02.WAV
(was: 0024_02.wav)
Since CHATR produces speech in the recognisable voice of a known person, it offers the potential to extend that person's apparent abilities into the realm of multi-linguality. By offering this ability to the voice of a young child, we are perhaps meeting Furui's expectations [3]. [SOUND 0024.01.WAV][SOUND 0024.02.WAV]
File type: Sound File
Format: Sound File: WAV
Tech. description: Unknown
Creating Application:: Unknown
Creating OS: Unknown
0024_03.WAV
(was: 0024_03.wav)
By mapping from the phone sequence predicted for synthesis in one language to the phone-set used to label the speech of another, we can produce foreign-language speech using the voice of any speaker. In these examples we use the voice of a small Japanese child to speak in English ([SOUND 0024.03.WAV][SOUND 0024.04.WAV] greeting) and Korean ([SOUND 0024.05.WAV] [SOUND 0024.06.WAV] explaining the technical processing within CHATR).
File type: Sound File
Format: Sound File: WAV
Tech. description: Unknown
Creating Application:: Unknown
Creating OS: Unknown
0024_04.WAV
(was: 0024_04.wav)
By mapping from the phone sequence predicted for synthesis in one language to the phone-set used to label the speech of another, we can produce foreign-language speech using the voice of any speaker. In these examples we use the voice of a small Japanese child to speak in English ([SOUND 0024.03.WAV][SOUND 0024.04.WAV] greeting) and Korean ([SOUND 0024.05.WAV] [SOUND 0024.06.WAV] explaining the technical processing within CHATR).
File type: Sound File
Format: Sound File: WAV
Tech. description: Unknown
Creating Application:: Unknown
Creating OS: Unknown
0024_05.WAV
(was: 0024_05.wav)
By mapping from the phone sequence predicted for synthesis in one language to the phone-set used to label the speech of another, we can produce foreign-language speech using the voice of any speaker. In these examples we use the voice of a small Japanese child to speak in English ([SOUND 0024.03.WAV][SOUND 0024.04.WAV] greeting) and Korean ([SOUND 0024.05.WAV] [SOUND 0024.06.WAV] explaining the technical processing within CHATR).
File type: Sound File
Format: Sound File: WAV
Tech. description: Unknown
Creating Application:: Unknown
Creating OS: Unknown
0024_06.WAV
(was: 0024_06.wav)
By mapping from the phone sequence predicted for synthesis in one language to the phone-set used to label the speech of another, we can produce foreign-language speech using the voice of any speaker. In these examples we use the voice of a small Japanese child to speak in English ([SOUND 0024.03.WAV][SOUND 0024.04.WAV] greeting) and Korean ([SOUND 0024.05.WAV] [SOUND 0024.06.WAV] explaining the technical processing within CHATR).
File type: Sound File
Format: Sound File: WAV
Tech. description: Unknown
Creating Application:: Unknown
Creating OS: Unknown
0024_07.PDF
(was: 0024_01.GIF)
Section 5.2: To reduce the `accent', we adopt the following two-stage process: ([IMAGE 0024\_01.GIF] schematic).
File type: Image File
Format: GIF
Tech. description: Unknown
Creating Application:: Unknown
Creating OS: Unknown

TOP


On The Use Of F0 Features In Automatic Segmentation For Speech Synthesis

Authors:

Takashi Saito, Tokyo Research Laboratory, IBM Japan Ltd. (Japan)

Page (NA) Paper number 1044

Abstract:

This paper focuses on a method for automatically dividing speech utterances into phonemic segments, which are used for constructing synthesis unit inventories for speech synthesis. Here, we propose a new segmentation parameter called, "F0 dynamics (DF0)." In the fine structures of F0 contours, there exist phonemic events which are observed as local dips at phonemic transition regions, especially around voiced consonants. We apply this observation about F0 contours to a speech segmentation method. The DF0 segmentation parameter is used in the final stage of the segmentation procedure to refine the phonemic boundaries roughly obtained by DP alignment. We conduct experiments on the proposed automatic segmentation with a speech database prepared for unit inventory construction, and compare the obtained boundaries with those of manual segmentation to show the effectiveness of the proposed method. We also discuss the effects of the boundary refinement on the synthesized speech.

SL981044.PDF (From Author) SL981044.PDF (Rasterized)

TOP


A Linguistic and Prosodic Database for Data-Driven Japanese TTS Synthesis

Authors:

Atsuhiro Sakurai, Dep. of Information and Communication Engineering, The Univ. of Tokyo and Tsukuba R&D Center, Texas Instruments (Japan)
Takashi Natsume, Dep. of Information and Communication Engineering, The Univ. of Tokyo (Japan)
Keikichi Hirose, Dep. of Information and Communication Engineering, The Univ. of Tokyo (Japan)

Page (NA) Paper number 735

Abstract:

We propose a method to generate a database that contains a parametric representation of F0 contours associated with linguistic and acoustic information, to be used by data-driven Japanese text-to-speech (TTS) systems. The configuration of the database includes recorded speech, F0 contours and their parametric labels, phonetic transcription with durations, and other linguistic information such as orthographic transcription, part-of-speech (POS) tags, and accent types. All information that is not available by dictionary lookup is obtained automatically. In this paper, we propose a method to automatically obtain parametric labels that describe F0 contours based on a superpositional model. Preliminary tests on a small data set show that the method can find the parametric representation of F0 contours with acceptable accuracy, and that accuracy can be improved by introducing additional linguistic information.

SL980735.PDF (From Author) SL980735.PDF (Rasterized)

TOP


Text-to-Speech Voice Adaptation from Sparse Training Data

Authors:

Alexander Kani, Oregon Graduate Institute of Science and Technology (USA)
Michael W. Macon, Oregon Graduate Institute of Science and Technology (USA)

Page (NA) Paper number 902

Abstract:

Voice adaptation describes the process of converting the output of a text-to-speech synthesizer voice to sound like a different voice after a training process in which only a small amount of the desired target speaker's speech is seen. We employ a locally linear conversion function based on Gaussian mixture models to map bark-scaled line spectral frequencies. We compare performance for three different estimation methods while varying the number of mixture components and the amount of data used for training. An objective evaluation revealed that all three methods yield similar test results. In perceptual tests, listeners judged the converted speech quality as acceptable and fairly successful in adapting to the target speaker.

SL980902.PDF (From Author) SL980902.PDF (Rasterized)

TOP


Describing Intonation with a Parametric Model

Authors:

Gregor Möhler, University of Stuttgart (Germany)

Page (NA) Paper number 205

Abstract:

In this study a data-based approach to intonation modeling is presented. The model incorporates knowledge from intonation theories like the expected types of F0 movements and syllable anchoring. The knowledge is integrated into the model using an appropriate approximation function for F0 parametrization. The F0 parameters that result from the parametrization are predicted from a set of features using neural nets. The quality of the generated contours is assessed by means of numerical measures and perception tests. They show that the basic hypotheses about intonation description and modeling are in principle correct and that they have the potential to be successfully applied to speech synthesis. We argue for a clear interface with a linguistic description (using pitch-accent and boundary labels as input) and discourse structure (using pitch-range normalized F0 parameters).

SL980205.PDF (From Author) SL980205.PDF (Rasterized)

TOP