Text-To-Speech Synthesis 6

This paper discusses the problem of handling 'foreign' speech sounds in Swedish speech technology systems, in particular speech synthesis. A production study is made, where it is shown that Swedish speakers add foreign speech sounds, here termed 'xenophones', to their phone repertoire when reading Swedish sentences with embedded English names and words. As a result of the observations, the phone set of a Swedish concatenative synthesizer is extended, and it is shown (by example) that this produces more natural-sounding synthetic speech.

SL980514.PDF (From Author) SL980514.PDF (Rasterized)

0514_01.WAV (was: 0514_01.WAV)	Speech synthesis example. File type: Sound File Format: OTHER Tech. description: Sampling rate: 16 kHz, bits-per-sample: 16, Mono., Encoding: Linear PCM, Creating Application:: Unknown Creating OS: unix
0514_02.WAV (was: 0514_02.WAV)	Speech synthesis example. File type: Sound File Format: OTHER Tech. description: Sampling rate: 16 kHz, Bits-per-sample: 16, Mono., Encoding: Linear PCM Creating Application:: Unknown Creating OS: unix

TOP

Multi-lingual Concatenative Speech Synthesis

Authors:

Nick Campbell, ATR-ITL (Japan)

Page (NA) Paper number 24

Abstract:

This paper describes a method of concatenative speech synthesis that makes use of 3-dimensional labelling of speech, and shows how this can be applied to the synthesis of both mono-lingual and foreign-language speech. The dimensions encode phonetic, prosodic, and voice-quality information in order to fully describe the acoustic characteristics of each speech segment.

SL980024.PDF (From Author) SL980024.PDF (Rasterized)

0024_01.WAV (was: 0024_01.wav)	Since CHATR produces speech in the recognisable voice of a known person, it offers the potential to extend that person's apparent abilities into the realm of multi-linguality. By offering this ability to the voice of a young child, we are perhaps meeting Furui's expectations [3]. [SOUND 0024.01.WAV][SOUND 0024.02.WAV] File type: Sound File Format: Sound File: WAV Tech. description: Unknown Creating Application:: Unknown Creating OS: Unknown
0024_02.WAV (was: 0024_02.wav)	Since CHATR produces speech in the recognisable voice of a known person, it offers the potential to extend that person's apparent abilities into the realm of multi-linguality. By offering this ability to the voice of a young child, we are perhaps meeting Furui's expectations [3]. [SOUND 0024.01.WAV][SOUND 0024.02.WAV] File type: Sound File Format: Sound File: WAV Tech. description: Unknown Creating Application:: Unknown Creating OS: Unknown
0024_03.WAV (was: 0024_03.wav)	By mapping from the phone sequence predicted for synthesis in one language to the phone-set used to label the speech of another, we can produce foreign-language speech using the voice of any speaker. In these examples we use the voice of a small Japanese child to speak in English ([SOUND 0024.03.WAV][SOUND 0024.04.WAV] greeting) and Korean ([SOUND 0024.05.WAV] [SOUND 0024.06.WAV] explaining the technical processing within CHATR). File type: Sound File Format: Sound File: WAV Tech. description: Unknown Creating Application:: Unknown Creating OS: Unknown
0024_04.WAV (was: 0024_04.wav)	By mapping from the phone sequence predicted for synthesis in one language to the phone-set used to label the speech of another, we can produce foreign-language speech using the voice of any speaker. In these examples we use the voice of a small Japanese child to speak in English ([SOUND 0024.03.WAV][SOUND 0024.04.WAV] greeting) and Korean ([SOUND 0024.05.WAV] [SOUND 0024.06.WAV] explaining the technical processing within CHATR). File type: Sound File Format: Sound File: WAV Tech. description: Unknown Creating Application:: Unknown Creating OS: Unknown
0024_05.WAV (was: 0024_05.wav)	By mapping from the phone sequence predicted for synthesis in one language to the phone-set used to label the speech of another, we can produce foreign-language speech using the voice of any speaker. In these examples we use the voice of a small Japanese child to speak in English ([SOUND 0024.03.WAV][SOUND 0024.04.WAV] greeting) and Korean ([SOUND 0024.05.WAV] [SOUND 0024.06.WAV] explaining the technical processing within CHATR). File type: Sound File Format: Sound File: WAV Tech. description: Unknown Creating Application:: Unknown Creating OS: Unknown
0024_06.WAV (was: 0024_06.wav)	By mapping from the phone sequence predicted for synthesis in one language to the phone-set used to label the speech of another, we can produce foreign-language speech using the voice of any speaker. In these examples we use the voice of a small Japanese child to speak in English ([SOUND 0024.03.WAV][SOUND 0024.04.WAV] greeting) and Korean ([SOUND 0024.05.WAV] [SOUND 0024.06.WAV] explaining the technical processing within CHATR). File type: Sound File Format: Sound File: WAV Tech. description: Unknown Creating Application:: Unknown Creating OS: Unknown
0024_07.PDF (was: 0024_01.GIF)	Section 5.2: To reduce the `accent', we adopt the following two-stage process: ([IMAGE 0024\_01.GIF] schematic). File type: Image File Format: GIF Tech. description: Unknown Creating Application:: Unknown Creating OS: Unknown

TOP

On The Use Of F0 Features In Automatic Segmentation For Speech Synthesis

Authors:

Takashi Saito, Tokyo Research Laboratory, IBM Japan Ltd. (Japan)

Page (NA) Paper number 1044

Abstract:

This paper focuses on a method for automatically dividing speech utterances into phonemic segments, which are used for constructing synthesis unit inventories for speech synthesis. Here, we propose a new segmentation parameter called, "F0 dynamics (DF0)." In the fine structures of F0 contours, there exist phonemic events which are observed as local dips at phonemic transition regions, especially around voiced consonants. We apply this observation about F0 contours to a speech segmentation method. The DF0 segmentation parameter is used in the final stage of the segmentation procedure to refine the phonemic boundaries roughly obtained by DP alignment. We conduct experiments on the proposed automatic segmentation with a speech database prepared for unit inventory construction, and compare the obtained boundaries with those of manual segmentation to show the effectiveness of the proposed method. We also discuss the effects of the boundary refinement on the synthesized speech.

SL981044.PDF (From Author) SL981044.PDF (Rasterized)

TOP

A Linguistic and Prosodic Database for Data-Driven Japanese TTS Synthesis

Authors:

Atsuhiro Sakurai, Dep. of Information and Communication Engineering, The Univ. of Tokyo and Tsukuba R&D Center, Texas Instruments (Japan)
Takashi Natsume, Dep. of Information and Communication Engineering, The Univ. of Tokyo (Japan)
Keikichi Hirose, Dep. of Information and Communication Engineering, The Univ. of Tokyo (Japan)

Page (NA) Paper number 735

Abstract:

We propose a method to generate a database that contains a parametric representation of F0 contours associated with linguistic and acoustic information, to be used by data-driven Japanese text-to-speech (TTS) systems. The configuration of the database includes recorded speech, F0 contours and their parametric labels, phonetic transcription with durations, and other linguistic information such as orthographic transcription, part-of-speech (POS) tags, and accent types. All information that is not available by dictionary lookup is obtained automatically. In this paper, we propose a method to automatically obtain parametric labels that describe F0 contours based on a superpositional model. Preliminary tests on a small data set show that the method can find the parametric representation of F0 contours with acceptable accuracy, and that accuracy can be improved by introducing additional linguistic information.

SL980735.PDF (From Author) SL980735.PDF (Rasterized)

TOP

Text-to-Speech Voice Adaptation from Sparse Training Data

Authors:

Alexander Kani, Oregon Graduate Institute of Science and Technology (USA)
Michael W. Macon, Oregon Graduate Institute of Science and Technology (USA)

Page (NA) Paper number 902

Abstract:

Voice adaptation describes the process of converting the output of a text-to-speech synthesizer voice to sound like a different voice after a training process in which only a small amount of the desired target speaker's speech is seen. We employ a locally linear conversion function based on Gaussian mixture models to map bark-scaled line spectral frequencies. We compare performance for three different estimation methods while varying the number of mixture components and the amount of data used for training. An objective evaluation revealed that all three methods yield similar test results. In perceptual tests, listeners judged the converted speech quality as acceptable and fairly successful in adapting to the target speaker.

SL980902.PDF (From Author) SL980902.PDF (Rasterized)

TOP

Describing Intonation with a Parametric Model

Authors:

Gregor Möhler, University of Stuttgart (Germany)

Page (NA) Paper number 205

Abstract:

In this study a data-based approach to intonation modeling is presented. The model incorporates knowledge from intonation theories like the expected types of F0 movements and syllable anchoring. The knowledge is integrated into the model using an appropriate approximation function for F0 parametrization. The F0 parameters that result from the parametrization are predicted from a set of features using neural nets. The quality of the generated contours is assessed by means of numerical measures and perception tests. They show that the basic hypotheses about intonation description and modeling are in principle correct and that they have the potential to be successfully applied to speech synthesis. We argue for a clear interface with a linguistic description (using pitch-accent and boundary labels as input) and discourse structure (using pitch-range normalized F0 parameters).

Text-To-Speech Synthesis 6

Authors:

Page (NA) Paper number 514

Abstract:

(was: 0514_01.WAV)

(was: 0514_02.WAV)

Authors:

Page (NA) Paper number 24

Abstract:

(was: 0024_01.wav)

(was: 0024_02.wav)

(was: 0024_03.wav)

(was: 0024_04.wav)

(was: 0024_05.wav)

(was: 0024_06.wav)

(was: 0024_01.GIF)

Authors:

Page (NA) Paper number 1044

Abstract:

Authors:

Page (NA) Paper number 735

Abstract:

Authors:

Page (NA) Paper number 902

Abstract:

Authors:

Page (NA) Paper number 205

Abstract: