Text-To-Speech Synthesis 4

Home
Full List of Titles
1: ICSLP'98 Proceedings
Keynote Speeches
Text-To-Speech Synthesis 1
Spoken Language Models and Dialog 1
Prosody and Emotion 1
Hidden Markov Model Techniques 1
Speaker and Language Recognition 1
Multimodal Spoken Language Processing 1
Isolated Word Recognition
Robust Speech Processing in Adverse Environments 1
Spoken Language Models and Dialog 2
Articulatory Modelling 1
Talking to Infants, Pets and Lovers
Robust Speech Processing in Adverse Environments 2
Spoken Language Models and Dialog 3
Speech Coding 1
Articulatory Modelling 2
Prosody and Emotion 2
Neural Networks, Fuzzy and Evolutionary Methods 1
Utterance Verification and Word Spotting 1 / Speaker Adaptation 1
Text-To-Speech Synthesis 2
Spoken Language Models and Dialog 4
Human Speech Perception 1
Robust Speech Processing in Adverse Environments 3
Speech and Hearing Disorders 1
Prosody and Emotion 3
Spoken Language Understanding Systems 1
Signal Processing and Speech Analysis 1
Spoken Language Generation and Translation 1
Spoken Language Models and Dialog 5
Segmentation, Labelling and Speech Corpora 1
Multimodal Spoken Language Processing 2
Prosody and Emotion 4
Neural Networks, Fuzzy and Evolutionary Methods 2
Large Vocabulary Continuous Speech Recognition 1
Speaker and Language Recognition 2
Signal Processing and Speech Analysis 2
Prosody and Emotion 5
Robust Speech Processing in Adverse Environments 4
Segmentation, Labelling and Speech Corpora 2
Speech Technology Applications and Human-Machine Interface 1
Large Vocabulary Continuous Speech Recognition 2
Text-To-Speech Synthesis 3
Language Acquisition 1
Acoustic Phonetics 1
Speaker Adaptation 2
Speech Coding 2
Hidden Markov Model Techniques 2
Multilingual Perception and Recognition 1
Large Vocabulary Continuous Speech Recognition 3
Articulatory Modelling 3
Language Acquisition 2
Speaker and Language Recognition 3
Text-To-Speech Synthesis 4
Spoken Language Understanding Systems 4
Human Speech Perception 2
Large Vocabulary Continuous Speech Recognition 4
Spoken Language Understanding Systems 2
Signal Processing and Speech Analysis 3
Human Speech Perception 3
Speaker Adaptation 3
Spoken Language Understanding Systems 3
Multimodal Spoken Language Processing 3
Acoustic Phonetics 2
Large Vocabulary Continuous Speech Recognition 5
Speech Coding 3
Language Acquisition 3 / Multilingual Perception and Recognition 2
Segmentation, Labelling and Speech Corpora 3
Text-To-Speech Synthesis 5
Spoken Language Generation and Translation 2
Human Speech Perception 4
Robust Speech Processing in Adverse Environments 5
Text-To-Speech Synthesis 6
Speech Technology Applications and Human-Machine Interface 2
Prosody and Emotion 6
Hidden Markov Model Techniques 3
Speech and Hearing Disorders 2 / Speech Processing for the Speech and Hearing Impaired 1
Human Speech Production
Segmentation, Labelling and Speech Corpora 4
Speaker and Language Recognition 4
Speech Technology Applications and Human-Machine Interface 3
Utterance Verification and Word Spotting 2
Large Vocabulary Continuous Speech Recognition 6
Neural Networks, Fuzzy and Evolutionary Methods 3
Speech Processing for the Speech-Impaired and Hearing-Impaired 2
Prosody and Emotion 7
2: SST Student Day
SST Student Day - Poster Session 1
SST Student Day - Poster Session 2

Author Index
A B C D E F G H I
J K L M N O P Q R
S T U V W X Y Z

Multimedia Files

A Mixed-Excitation Frequency Domain Model for Time-Scale Pitch-Scale Modification of Speech

Authors:

Alex Acero, Microsoft Research (USA)

Page (NA) Paper number 72

Abstract:

This paper presents a time-scale pitch-scale modification technique for concatenative speech synthesis. The method is based on a frequency domain source-filter model, where the source is modeled as a mixed excitation. This model is highly coupled with a compression scheme that result in compact acoustic inventories. When compared to the approach in the Whistler system using no mixed excitation, the new method shows improvement in voiced fricatives and over-stretched voiced sounds. In addition, it allows for spectral manipulation such as smoothing of discontinuities at unit boundaries, voice transformations or loudness equalization.

SL980072.PDF (From Author) SL980072.PDF (Rasterized)

TOP


Analytic Generation of Synthesis Units by Closed Loop Training for Totally Speaker Driven Text to Speech System (TOS Drive TTS)

Authors:

Masami Akamine, Toshiba (Japan)
Takehiko Kagoshima, Toshiba (Japan)

Page (NA) Paper number 139

Abstract:

This paper provides a new method for automatically generating speech synthesis units. The algorithm, called Closed-Loop Training (CLT), is based on evaluating and reducing the distortion in synthesized speech. It minimizes distortion caused by synthesis process such as prosodic modification in an analytic way. The distortion is measured by calculating the error between synthesized speech units and natural speech units in a large speech database (corpus). The CLT method effectively generates the synthesis units that are most resembling of natural speech after synthesis process. In this paper, CLT is applied to a waveform concatenation based synthesizer, whose basic unit is a diphone. By using CLT, the synthesizer generates clear and smooth synthetic speech even with a relatively small volume of synthesis units.

SL980139.PDF (From Author) SL980139.PDF (Scanned)

TOP


Modeling the Microprosody of Pitch and Loudness for Speech Synthesis with Neural Networks

Authors:

Martti Vainio, Department of Phonetics, University of Helsinki (Finland)
Toomas Altosaar, Acoustics Laboratory, Helsinki University of Technology (Finland)

Page (NA) Paper number 886

Abstract:

In this study of Finnish microprosody, two prosodic parameters --- pitch and loudness --- were modeled with artificial neural networks. The networks are of the general feed forward type trained with backpropagation. For each phoneme, the network predicts a series of either pitch or loudness values on the basis of information of the phoneme's phonologically motivated features and its phonetic environment. The tests we have run so far indicate that the neural networks are highly successful and accurate in modeling the micro-level behavior of both pitch and loudness. The tests were conducted on isolated word material but some preliminary results obtained from sentence material are also discussed.

SL980886.PDF (From Author) SL980886.PDF (Rasterized)

TOP


Spectral Smoothing for Concatenative Speech Synthesis

Authors:

David T. Chappell, Duke University (USA)
John H.L. Hansen, Duke University (USA)

Page (NA) Paper number 849

Abstract:

This paper addresses the topic of performing effective concatenative speech synthesis with a limited database by proposing methods to smooth the transitions between speech segments. The objective is to produce natural-sounding speech via segment concatenation when formants and other spectral features do not align properly. We propose several methods for adjusting the spectra between waveform segments selected for concatenation. Techniques examined include optimal coupling, waveform interpolation, linear predictive pole shifting, and psychoacoustic closure. Several of these algorithms have been previously developed for either coding or synthesis, but our application of closure for segment processing is novel. After spectral smoothing, the final synthesized speech can better approximate the desired speech characteristics and is continuous in both the time domain and spectral structure.

SL980849.PDF (From Author) SL980849.PDF (Rasterized)

TOP


MIMIC : A Voice-Adaptive Phonetic-Tree Speech Synthesiser

Authors:

Aimin Chen, The Queen's University of Belfast (Ireland)
Saeed Vaseghi, The Queen's University of Belfast (Ireland)
Charles Ho, The Queen's University of Belfast (Ireland)

Page (NA) Paper number 204

Abstract:

This paper presents Mimic : a decision-tree based concatenative voice adaptive text to speech synthesiser. Mimic integrates text to speech synthesis (TTS) with speech recognition and speaker adaptation. Speech is synthesised from concatenation of triphone synthesis units. The triphone units are obtained from clusters of training examples modelled, labelled and segmented using clustered HMMs and Viterbi segmentation. The prosodic structure of pitch, duration and energy contours are captured using statistical training methods. The concept of a decision-tree based statistical micro-prosody model is introduced as a hierarchical method of modelling prosodic parameters. The voice adaptation component involves the adaptation of the spectral parameters as well as pitch, duration, and energy.

SL980204.PDF (From Author) SL980204.PDF (Rasterized)

TOP


Automatic Generation Of Korean Pronunciation Variants By Multistage Applications Of Phonological Rules

Authors:

Jehun Jeon, Sogang University (Korea)
Sunhwa Cha, Sogang University (Korea)
Minhwa Chung, Sogang University (Korea)
Jun Park, Electronics and Telecommunications Research Institute (Korea)
Kyuwoong Hwang, Electronics and Telecommunications Research Institute (Korea)

Page (NA) Paper number 675

Abstract:

Phonetic transcriptions are often manually encoded in a pronunciation lexicon. This process is time consuming and requires linguistic expertise. Moreover, it is very difficult to maintain consistency. To handle these problems, we present a model that produces Korean pronunciation variants based on morphophonological analysis. By analyzing phonological variations frequently found in spoken Korean, we have derived about 800 phonemic contexts that would trigger the applications of the corresponding phonemic and allophonic rules. In generating pronunciation variants, morphological analysis is preceded to handle variations of phonological words. According to the morphological category, a set of finite state automata tables reflecting phonemic context is looked up to generate pronunciation variants. Our experiments show that the proposed model produces mostly correct pronunciation variants of phonological words consisting of several morphemes.

SL980675.PDF (From Author) SL980675.PDF (Rasterized)

TOP


Techniques for Accurate Automatic Annotation of Speech Waveforms

Authors:

Stephen Cox, University of East Anglia, Norwich. (U.K.)
Richard Brady, University of East Anglia, Norwich. (U.K.)
Peter Jackson, British Telecom Laboratories (U.K.)

Page (NA) Paper number 466

Abstract:

We describe techniques used in the development of a high-accuracy automatic annotation system designed to provide new voices for a concatenative speech synthesiser. We have used standard HMM-based "forced alignment" techniques and have concentrated on refining both acoustic and pronunciation modelling to achieve greater alignment accuracy. Acoustic models were improved by Bayesian speaker adaptation and the use of confidence measures from N-Best decodings to produce speaker dependent HMM's. Pronunciation modelling improvements involved the use of a large pronunciation dictionary containing multiple pronunciations for many words, use of pronunciation probabilities, accommodation of interword silences and using information derived from existing manual annotations to guide the recogniser during decoding. The system produces time-aligned phonetic alignments for UK accents in which the automatic and manual alignments agree on the segmental labelling 93% of the time and in which the boundaries have an r.m.s. error of 14.5 ms from the manual boundary.

SL980466.PDF (From Author) SL980466.PDF (Rasterized)

TOP


Optimized Stopping Criteria for Tree-Based Unit Selection in Concatenative Synthesis

Authors:

Andrew Cronk, Oregon Graduate Institute (USA)
Michael W. Macon, Oregon Graduate Institute (USA)

Page (NA) Paper number 680

Abstract:

The lack of naturalness hampers the widespread application of speech synthesis. Increasing the size of the unit database in a concatenative speech synthesizer has been proposed as a method to increase the variety of units-thereby improving naturalness. However, expanding the unit database increases the computational cost of selecting the most appropriate unit and compounds the risk that a perceptually suboptimal unit is chosen. Clustering the unit database prior to synthesis is an effective method for reducing this cost and risk. In this study, a unit selection method based on tree-structured clustering of data is implemented and evaluated. This approach to tree construction differs from similar approaches used in both synthesis and recognition in that a "right-sized" tree is found automatically rather than using hand-tuned stopping criteria. The tree is grown to its maximum size, and its leaves are systematically recombined in order to determine the most suitable subtree. Trees are grown using the automatic stopping method and compared with those grown using thresholds. Cross validation shows that trees grown to their maximum size and systematically recombined produce fuller clusters with lower objective distortion measures than trees whose growth is arrested by a threshold. The study concludes with a discussion of how these results may affect the perceptual quality of a speech synthesizer.

SL980680.PDF (From Author) SL980680.PDF (Rasterized)

TOP


Automatic Transcription of Intonation Using an Identified Prosodic Alphabet

Authors:

Stephanie de Tournemire, France Telecom, CNET. (France)

Page (NA) Paper number 1035

Abstract:

A solution is proposed for rapidly adapting prosodic models to a new voice or a new application. First, a prosodic alphabet that is supported by linguistic knowledge is identified at the acoustic level. The observation of the realisation of prosodic events on the acoustic corpus allows classes of breaks, F0 shapes and accents to be constructed and automatic transcription rules to be written. Then the transcribed corpus is used in the estimation of the parameters of a prosodic model for French. The good F0 contours and duration generated with the prosodic model verify the agreement of the identified alphabets to describe prosodic phenomena. Finally, the prosodic model is integrated in the CNET standard French Text-to-Speech Synthesis system. The quality of the generated prosody is considered by naive listeners as equivalent to the handcrafted system. This result verifies the appropriateness of the alphabet as prosodic descriptors.

SL981035.PDF (From Author) SL981035.PDF (Rasterized)

TOP


Frequency Analysis of Phonetic Units for Concatenative Synthesis in Catalan

Authors:

Ignasi Esquerra, Universitat Politècnica de Catalunya (Spain)
Albert Febrer, Universitat Politècnica de Catalunya (Spain)
Climent Nadeu, Universitat Politècnica de Catalunya (Spain)

Page (NA) Paper number 817

Abstract:

Knowledge of phonetic unit frequency is very necessary for developing databases in both concatenative synthesis and continuous speech recognition. In the present work, a large corpus of text was processed and phonetically transcribed to obtain allophone and diphone frequencies for the Catalan language. The corpus was acquired from newspaper articles, in which there were a lot of foreign words that represented a problem in the normalisation process. After automatic transcription, units were counted to get their relative frequency and results were compared to other analysis. Finally, diphones found in the corpus were compared to units of a synthesis database to validate both the normalisation and transcription modules and the synthesis unit database.

SL980817.PDF (From Author) SL980817.PDF (Rasterized)

TOP


Investigating the Syntactic Characteristics of English Tone Units

Authors:

Alex Chengyu Fang, Department of Phonetics and Linguistics, University College London (U.K.)
Jill House, Dept of Linguistics and Phonetics, University College London (U.K.)
Mark Huckvale, Dept of Phonetics and Linguistics, University College London (U.K.)

Page (NA) Paper number 273

Abstract:

This paper describes an investigation into the correspondence between grammatical units and English tone units. Our first aim is to provide some statistics based on scripted read speech since past studies mainly dealt with spontaneous speech. The second aim is to investigate whether the clause structure is a reliable indication of the tone unit. We start with a description of the annotation of transcribed speech data selected from the Spoken English Corpus (SEC), which is tagged for detailed wordclass information with AUTASYS and then parsed for rich syntactic description with the Survey Parser. Prosodic annotations in SEC, including both major and minor tone unit boundaries, were then mapped onto the parse trees. We then present our observations of tone units in the light of the clause structure. The paper will demonstrate that there is an overall correspondence between the clause structure and the tone unit in the sense that tone units generally co-start with the clause and that they seldom occur at major clause element junctures.

SL980273.PDF (From Author) SL980273.PDF (Rasterized)

TOP


The UPC Text-to-Speech System for Spanish and Catalan

Authors:

Antonio Bonafonte, Universitat Politecnica de Catalunya (Spain)
Ignasi Esquerra, Universitat Politecnica de Catalunya (Spain)
Albert Febrer, Universitat Politecnica de Catalunya (Spain)
José A.R. Fonollosa, Universitat Politecnica de Catalunya (Spain)
Francesc Vallverdú, Universitat Politecnica de Catalunya (Spain)

Page (NA) Paper number 1146

Abstract:

This paper summarizes the text-to-speech system that has been developed in the Speech Group of the Universitat Politècnica de Catalunya (UPC). The system is composed of a core and different interfaces so that it is compatible for research, for telephone applications (either CTI boards or standard ISDN PC cards supporting CAPI), and Windows applications developed using Microsoft SAPI. The paper reviews the system making emphasis in the parts of the system which are language dependent and which allow the reading of bilingual text (Spanish and Catalan). The paper also presents new approaches in prosodic modeling (segmental duration modeling) and generation of the database of speech segments, which have been introduced last year.

SL981146.PDF (From Author) SL981146.PDF (Rasterized)

TOP


The New Version of the ROMVOX Text-to-Speech Synthesis System Based on a Hybrid Time Domain-LPC Synthesis Technique

Authors:

Attila Ferencz, Software ITC (Romania)
István Nagy, Software ITC (Romania)
Tünde-Csilla Kovács, Software ITC (Romania)
Maria Ferencz, Software ITC (Romania)
Teodora Ratiu, Software ITC (Romania)

Page (NA) Paper number 144

Abstract:

Through the years we developed several TTS systems for the Romanian language, each of them presenting some advantages and disadvantages [2]. Taking into account that waveform coding (time domain) methods assures a maximum level of intelligibility and naturalness of the synthesized speech, and that prosodic effects superimposing requires the alteration of pitch (frequency domain), we developed a hybrid time domain-LPC method, obtaining a better quality of the synthesized voice than those obtained with our former systems. This paper presents some theoretical considerations, signal processing and implementation aspects of this new synthesis method developed for the ROMVOX TTS system.

SL980144.PDF (From Author) SL980144.PDF (Rasterized)

TOP


An F0 Contour Control Model for Totally Speaker Driven Text to Speech System

Authors:

Takehiko Kagoshima, Toshiba corporation, kansai research laboratories (Japan)
Masahiro Morita, Toshiba corporation, kansai research laboratories (Japan)
Shigenobu Seto, Toshiba corporation, kansai research laboratories (Japan)
Masami Akamine, Toshiba corporation, kansai research laboratories (Japan)

Page (NA) Paper number 214

Abstract:

Totally Speaker Driven Text to Speech System produces high quality and natural speech resembling the acoustic and prosodic characteristics of the original speech corpus. In the F0 contour control of this system, an F0 contour of a whole sentence is produced by concatenating segmental F0 contours generated by modifying vectors that are representatives of typical F0 contours. The representative vectors are selected from the F0 contour codebook, which is designed so as to minimize the approximation error between F0 contours generated by the proposed model and real F0 contours extracted from a speech corpus. It was confirmed by experiments with Japanese speech corpus that F0 contours can be modeled with small approximation errors by only 48 representative vectors, and the synthetic speech sounded very natural and resembled the prosodic characteristics of the original speaker.

SL980214.PDF (From Author) SL980214.PDF (Rasterized)

TOP


On the Relationship of Speech Rates with Prosodic Units in Dialogue Speech

Authors:

Keikichi Hirose, Dept. of Inf. and Commu. Engineering, School of Engineering, (Japan)
Hiromichi Kawanami, Dept. of Inf. and Commu. Engineering, School of Engineering, (Japan)

Page (NA) Paper number 730

Abstract:

For the purpose of constructing prosodic rules for dialogue speech synthesis, a comparative study was conducted on speech rates between dialogue speech and read speech. Based on the generation modeling of F0 contour, we can define 4 prosodic units, such as prosodic sentence, prosodic phrase and so on. Speech rate was analyzed with respect to these units. In a prosodic sentence, dialogue speech starts with a speech rate similar to that of read speech. The speech rate then gradually increases and, after passing through the middle of the unit, decreases towards the end. Similar tendencies were also observed in lower level units, but the degree of speech rate change in a unit was smaller. Based on the above results, a preliminary rules for speech rate control were developed for dialogue speech synthesis. A hearing test showed that the developed rules could make the synthetic speech sound more dialogue-like.

SL980730.PDF (From Author) SL980730.PDF (Rasterized)

TOP


On the Reduction of Concatenation Artefacts in Diphone Synthesis

Authors:

Esther Klabbers, IPO, Center for Research on User-System Interaction (The Netherlands)
Raymond Veldhuis, IPO, Center for Research on User-System Interaction (The Netherlands)

Page (NA) Paper number 115

Abstract:

One well-known problem with diphone concatenation is the occurrence of audible discontinuities at diphone boundaries, which are most prominent in vowels and semi-vowels. Significant formant jumps at certain boundaries suggest that the problem is of a spectral nature. We have examined this hypothesis by correlating the results of a listening experiment with spectral distances measured across diphone boundaries. The aim is to find a spectral distance measure that best predicts when discontinuities are audible in order to find out how the diphone database can best be extended with context-sensitive diphones. The results show that the Kullback-Leibler measure is the best predictor.

SL980115.PDF (From Author) SL980115.PDF (Rasterized)

TOP


Error Analysis and Confidence Measure of Chinese Word Segmentation

Authors:

Chih-Chung Kuo, Industrial Technology Research Institute (Taiwan)
Kun-Yuan Ma, Industrial Technology Research Institute (Taiwan)

Page (NA) Paper number 1078

Abstract:

Word segmentation for a Chinese sentence is essential for many applications in language and speech processing. There's no perfect method that could achieve word segmentation without any errors. We propose a confidence measure for the segmentation result to cope with the problem caused by the errors. The effective method depends mainly on the error analysis of the word segmentation. With the confidence measure the suspected errors can be identified such that manual inspection loads can be largely reduced for non-real-time applications. A soft-decision method and a composite-word approach for prosody generation are also designed for text-to-speech systems by exploiting the confidence measure, such that the wrong prosody caused by wrong word boundaries can be alleviated.

SL981078.PDF (From Author) SL981078.PDF (Rasterized)

1078_01.WAV
(was: 1078_01.wav)
The speech synthesized with poor prosody due to wrong word segmentation.
File type: Sound File
Format: Sound File: WAV
Tech. description: 11025 samples per second, 8 bits per sample, mono, u-Law encoded
Creating Application:: Unknown
Creating OS: Windows 95/NT
1078_02.WAV
(was: 1078_02.wav)
The speech synthesized is based on composite word approach, which obviously produces more correct and natural prosody.
File type: Sound File
Format: Sound File: WAV
Tech. description: 11025 samples per second, 8 bits per sample, mono, u-Law encoded
Creating Application:: Unknown
Creating OS: Windows 95/NT

TOP


Energy Contour Generation for a Sentence Using a Neural Network Learning Method

Authors:

Jungchul Lee, ETRI (Korea)
Donggyu Kang, ETRI (Korea)
Sanghoon Kim, ETRI (Korea)
Koengmo Sung, Department of Electronic Engineering, Seoul National University (Korea)

Page (NA) Paper number 404

Abstract:

Energy contour in a sentence is one of major factors that affect the naturalness of synthetic speech. In this paper. we propose a method to control the energy contour for the enhancement in the naturalness of Korean synthetic speech. Our algorithm adopts syllable as a basic unit and predicts the peak amplitude for each syllable in a word using a neural network (NN). We utilize indirect linguistic features as well as acoustic features of phonemes as input data to the NN to accommodate the grammatical effects of words in a sentence. The simulation results show that prediction error is less than 10% and our algorithm is very effective for analysis/synthesis of the energy contour of a sentence.. and generates a fairly good declarative contour for TTS.

SL980404.PDF (Scanned)

TOP


A Computational Algorithm For F0 Contour Generation In Korean Developed With Prosodically Labeled Databases Using K-ToBI System

Authors:

Yong-Ju Lee, (Korea)
Sook-hyang Lee, (Korea)
Jong-Jin Kim, (Korea)
Hyun-Ju Ko, (Korea)
Young-Il Kim, (Korea)
Sang-Hun Kim, Spoken Language Processing Section, Electronics and Telecommunication Research Institute (Korea)
Jung-Cheol Lee, Spoken Language Processing Section, Electronics and Telecommunication Research Institute (Korea)

Page (NA) Paper number 704

Abstract:

This study describes an algorithm for the F0 contour generation system for Korean sentences and its evaluation results. 400 K-ToBI labeled utterances were used which were read by one male and one female announcers. F0 contour generation system uses two classification trees for prediction of K-ToBI labels for input text and 11 regression trees for prediction of F0 values for the labels. Evaluation results of the system showed 77.2% prediction accuracy for prediction of IP boundaries and 72.0% prediction accuracy for AP boundaries. Information of voicing and duration of the segments was not changed for F0 contour generation and its evaluation. Evaluation results showed 23.5Hz RMS error and 0.55 correlation coefficient in F0 generation experiment using labelling information from the original speech data.

SL980704.PDF (From Author) SL980704.PDF (Rasterized)

TOP


Rapid-Deployment Text-to-Speech in the DIPLOMAT System

Authors:

Kevin Lenzo, Carnegie Mellon University (USA)
Christopher Hogan, Carnegie Mellon University (USA)
Jeffrey Allen, Carnegie Mellon University (USA)

Page (NA) Paper number 868

Abstract:

The DIPLOMAT project at Carnegie Mellon University instantiates a program of rapid-deployment speech-to-speech machine translation; we have developed techniques for quickly producing text-to-speech (TTS) systems for new target languages to support this work. While the resulting systems are not immediately of comparable quality to commercial systems on unrestricted tasks in well-developed languages, they are more than adequate for limited-domain scenarios and rapid prototyping - they generalize to unseen data with some degradation, while quality in-domain can be quite good. Voices and engines for synthesizing new target languages may be developed in a period as short as two weeks after text corpus collection. We have successfully used these techniques to build a TTS module for English, Croatian, Spanish, Haitian Creole and Korean.

SL980868.PDF (From Author) SL980868.PDF (Rasterized)

TOP


Formant Diphone Parameter Extraction Utilising a Labelled Single-Speaker Database

Authors:

Robert H. Mannell, SHLRC, Macquarie University (Australia)

Page (NA) Paper number 627

Abstract:

This paper examines a method for formant parameter extraction from a labeled single speaker database for use in a formant-parameter diphone-concatenation speech synthesis system. This procedure commences with an initial formant analysis of the labelled database, which is then used to obtain formant (F1-F5) probability spaces for each phoneme. These probability spaces guide a more careful speaker-specific extraction of formant frequencies. An analysis-by-synthesis procedure is then used to provide best-matching formant intensity and bandwidth parameters. The great majority of the parameters so extracted produce speech which is highly intelligible and which has a voice quality close to the original speaker.

SL980627.PDF (From Author) SL980627.PDF (Rasterized)

0627_01.WAV
(was: 0627_01.WAV)
Included with postscript file in 0627.zip
File type: Sound File
Format: Sound File: WAV
Tech. description: Sampling rate 10 kHz, mono, little-endian, not encoded
Creating Application:: Unknown
Creating OS: Windows NT
0627_02.WAV
(was: 0627_02.WAV)
Included with postscript file in 0627.zip
File type: Sound File
Format: Sound File: WAV
Tech. description: Sampling rate 10 kHz, mono, little-endian, not encoded
Creating Application:: Unknown
Creating OS: Windows NT 4.0

TOP


A New Synthetic Speech/Sound Control Language

Authors:

Osamu Mizuno, NTT Human Interface Labs. (Japan)
Shin'ya Nakajima, NTT Human Interface Labs. (Japan)

Page (NA) Paper number 1015

Abstract:

The Multi-layered Speech/Sound Synthesis Control Language (MSCL) proposed herein facilitates the synthesizing of several speech modes such as nuance, mental state and emotion, and allows speech to be synchronized to other media easily. MSCL is a multi-layered linguistic system and encompasses three layers: and semantic level layer (The S-layer), interpretation level layer (The I-layer), and parameter level layer (The P-layer). The S-layer is the description level of semantics such as emotional and emphasized speech. The I-layer is the description level of prosodic feature controls and interprets The S-layer scripts to for control on I-layer level. The P-layer represents prosodic parameters for speech synthesis. This multi-level description system is convenient for both laymen and professional users. MSCL also encompasses many effective prosodic feature control functions such as a time-varying pattern description function, absolute and relative control forms, and SDS(Speaker Dependent Scale). MSCL enables more emotional and expressive synthetic speech than conventional TTS systems. This paper describes these functions and the effective prosodic feature controls possible with MSCL.

SL981015.PDF (From Author) SL981015.PDF (Rasterized)

TOP


A Study on the Natural-Sounding Japanese Phonetic Word Synthesis by Using the VCV-Balanced Word Database That Consists of the Words Uttered Forcibly in Two Types of Pitch Accent

Authors:

Ryo Mochizuki, Matsushita Communication Ind. Co., Ltd. (Japan)
Yasuhiko Arai, Matsushita Communication Ind. Co., Ltd. (Japan)
Takashi Honda, School of Science and Tech., Meiji Univ. (Japan)

Page (NA) Paper number 247

Abstract:

In order to synthesize natural-sounding Japanese phonetic words, a novel VCV-concatenation synthesis with an advanced word database is proposed. The word database consists of VCV-balanced phonetic words which are uttered forcibly in type-0 and type-1 pitch accents. The advantage of using the advanced word database is that a variety of VCV-segments with the same phonetic chains and the different pitch patterns could be collected efficiently at the same time. The following pitch modification techniques are used to achieve the sound quality: (1) The optimal VCV-segment set which minimizes the pitch modification rate is selected. (2) Pitch waveforms are extracted by referring to excitation points. (3) Wavelengths of pitch waveforms are adjusted depending on the pitch modification rates. (4) Natural prosody in the VCV-segments in the database is effectively used. Superiority of the proposed database is ensured by means of the pitch pattern matching measurement and the subjective quality evaluation.

SL980247.PDF (From Author) SL980247.PDF (Rasterized)

TOP


Letter to Sound Rules for Accented Lexicon Compression

Authors:

Vincent Pagel, Faculté Polytechnique de Mons (Belgium)
Kevin Lenzo, Carnegie Mellon University (USA)
Alan W. Black, CSTR, University of Edinburgh (U.K.)

Page (NA) Paper number 561

Abstract:

This paper presents trainable methods for generating letter to sound rules from a given lexicon for use in pronouncing out-of-vocabulary words and as a method for lexicon compression. As the relationship between a string of letters and a string of phonemes representing its pronunciation for many languages is not trivial, we discuss two alignment procedures, one fully automatic and one hand-seeded which produce reasonable alignments of letters to phones. Top Down Induction Tree models are trained on the aligned entries. We show how combined phoneme/stress prediction is better than separate prediction processes, and still better when including in the model the last phonemes transcribed and part of speech information. For the lexicons we have tested, our models have a word accuracy (including stress) of 78% for OALD, 62% for CMU and 94% for BRULEX. The extremely high scores on the training sets allow substantial size reductions (more than 1/20). WWW site: http://tcts.fpms.ac.be/synthesis/mbrdico

SL980561.PDF (From Author) SL980561.PDF (Rasterized)

TOP


A Name Announcement Algorithm with Memory Size and Computational Power Constraints

Authors:

Ze'ev Roth, DSP Group (Israel)
Judith Rosenhouse, Technion (Israel)

Page (NA) Paper number 280

Abstract:

This paper describes an algorithm for name (surnames and personal names) announcement in American English implemented on DSP Group's SmartCores (registered trade-mark) digital signal processor (dsp) core. The name announcement module is targeted for low cost applications therefore the amount of memory that can be allocated for dictionaries, program code, and runtime parameters is limited. The required response time of 0.5 seconds limits the computations performed in the linguistic analysis phase of each name. The synthesis scheme is limited by the real time capacity of the processor (since this task may be performed in parallel with other real time tasks).

SL980280.PDF (From Author) SL980280.PDF (Rasterized)

TOP


How a French TTS System can Describe Loanwords

Authors:

Frédérique Sannier, Institut de la Communication Parlée (France)
Rabia Belrhali, Institut de la Communication Parlée (France)
Véronique Aubergé, Institut de la Communication Parlée (France)

Page (NA) Paper number 497

Abstract:

We give a survey of the phonographical behaviour of the loanwords introduced into the French lexicon, through the observation of the systematic functioning of the French letter-to-phone TOPH system. We thus define sub-systems, isolated into lexicons. The processing of loans through the Toph TTS made it possible to give clues about the importance of one or other language in the French lexicon. The observation of the utterances graphonical functioning, made it thus possible to delimit classes. The second part of this study more specifically deals with the loanwords inflexion paradigms, for which as well different functionings are drawn.

SL980497.PDF (From Author) SL980497.PDF (Rasterized)

TOP


Improvements in Slovene Text-to-Speech Synthesis

Authors:

Tomaz Sef, Jozef Stefan Institute (Slovenia)
Ales Dobnikar, Jozef Stefan Institute (Slovenia)
Matjaz Gams, Jozef Stefan Institute (Slovenia)

Page (NA) Paper number 128

Abstract:

This paper presents a new text-to-speech (TTS) system that is capable of synthesising continuous Slovenian speech. Input text is processed by a series of independent modules: text normalisation, grapheme-to-phoneme conversion, prosody generation and segmental concatenation. That enables easy improvements of separate parts of the system. In order to generate rules for our synthesis scheme, data was collected by analyzing the readings of ten speakers, five males and five females. Our system is used in several applications. It is built into an employment agent EMA that provides employment information through the Internet. Currently we are developing a system that will enable blind and partially sightless people to work in the Windows environment.

SL980128.PDF (From Author) SL980128.PDF (Rasterized)

TOP


Automatic Rule Generation for Linguistic Features Analysis Using Inductive Learning Technique: Linguistic Features Analysis in TOS Drive TTS System

Authors:

Shigenobu Seto, Toshiba Corporation (Japan)
Masahiro Morita, Toshiba Corporation (Japan)
Takehiko Kagoshima, Toshiba Corporation (Japan)
Masami Akamine, Toshiba Corporation (Japan)

Page (NA) Paper number 1059

Abstract:

The linguistic features analysis for input text plays an important role in achieving natural prosodic control in text-to-speech (TTS) systems. In a conventional scheme, experts refine suspicious if-then rules and change the tree structure manually to obtain correct analysis results when input texts that have been analyzed incorrectly. However, altering the tree structure drastically is difficult since attention is often paid only to the suspicious if-then rules. If earlier rule-tree structure is inappropriate, any attempt to improve the performance may be limited by the stiffness of the structure. To cope with these problems, the new development scheme generates analysis rules by using C4.5 [1], where an if-then rule-tree structure is generated by off-line training. The scheme has the advantage that since the generated rule-tree structure is simple, the rules are easier to maintain. The scheme is applied to generating four types of analysis rule-trees: rules for forming accent phrases, rules for determining accent position, rules for analyzing syntactic structure, and rules for pause insertion. An experimental evaluation was performed on these four rules. The accuracy was 96.5 percent for the accent phrase formation, 95.5 percent for the accent positioning, 87.0 percent for the pause insertion, and 88.3 percent for the syntactic analysis despite using small training data. These results indicate the validity of the scheme. The new scheme is used for developing linguistic features analysis rules in a Japanese TTS system, TOS Drive TTS [3].

SL981059.PDF (From Author) SL981059.PDF (Rasterized)

TOP


Segmental Duration Control Based on an Articulatory Model

Authors:

Yoshinori Shiga, Multimedia Engineering Laboratory, TOSHIBA Corporation (Japan)
Hiroshi Matsuura, Multimedia Engineering Laboratory, TOSHIBA Corporation (Japan)
Tsuneo Nitta, Multimedia Engineering Laboratory, TOSHIBA Corporation (Japan)

Page (NA) Paper number 518

Abstract:

This paper proposes a new method that determines segmental duration for text-to-speech conversion based on the movement of articulatory organs which compose an articulatory model. The articulatory model comprises four time-variable articulatory parameters representing the conditions of articulatory organs whose physical restriction seems to significantly influence the segmental duration. The parameters are controlled according to an input sequence of phonetic symbols, following which segmental duration is determined based on the variation of the articulatory parameters. The proposed method is evaluated through an experiment using a Japanese speech database that consists of 150 phonetically balanced sentences. The results indicate that the mean square error of predicted segmental duration is approximately 15[ms] for the closed set and 15-17[ms] for the open set. The error is within 20[ms], the level of acceptability for distortion of segmental duration without loss of naturalness, and hence the method is proved to effectively predict segmental duration.

SL980518.PDF (From Author) SL980518.PDF (Rasterized)

TOP


Text Analysis for the Bell Labs French Text-to-Speech System

Authors:

Evelyne Tzoukermann, Bell Labs - Lucent Technologies (USA)

Page (NA) Paper number 75

Abstract:

The Bell Labs text-to-speech synthesis system for French is part of a multilingual effort for text-to-speech generation. The text analysis component consists of four main parts: the morphological analysis module, the language models, the grapheme-to-phoneme conversion rules, and the prosodic module. The system is built in a pipeline architecture, the output of which feeds the subsequent synthesis modules. The originality of this work lies in the fact that we use weighted finite-state transducer technology to process the entire analysis of the French system. Moreover, the implementation not only accounts for most orthographic representations, such as numerals, abbreviations, dates, currencies, etc, but we also solve the hard questions of French liaison, mute e, and aspirated h using refined intermediate representations either in the form of traces or in the form of archigraphemes.

SL980075.PDF (From Author) SL980075.PDF (Rasterized)

TOP


Modeling Vowel Duration for Japanese Text-to-Speech Synthesis

Authors:

Jennifer J. Venditti, Bell Labs and Ohio State Univ (USA)
Jan P.H. van Santen, Bell Labs (USA)

Page (NA) Paper number 786

Abstract:

Accurate estimation of segmental durations is crucial for natural-sounding text-to-speech (TTS) synthesis. This paper presents a model of vowel duration used in the Bell Labs Japanese TTS system. We describe the constraints on vowel devoicing, and effects of factors such as phone identity, surrounding phone identities, accentuation, syllabic structure, and phrasal position on the duration of both long and short vowels. A Sum-of-Products approach is used to model key interactions observed in the data, and to predict values of factor combinations not found in the speech database. We report root mean squared deviations between observed and predicted durations ranging from 8 to 15 ms, and an overall correlation of 0.89.

SL980786.PDF (From Author) SL980786.PDF (Rasterized)

TOP


Towards A Chinese Text-To-Speech System With Higher Naturalness

Authors:

Ren-Hua Wang, University of Science & Technology of China (China)
Qinfeng Liu, University of Science & Technology of China (China)
Yongsheng Teng, University of Science & Technology of China (China)
Deyu Xia, University of Science & Technology of China (China)

Page (NA) Paper number 172

Abstract:

This paper presents our research efforts on Chinese text-to-speech towards higher naturalness. The main results can be summarized as follows: 1. In the proposed TTS system the syllable-sized units were cut out from the real recorded speech, the synthetic speech was generated by concatenating these units back together. 2. The integration of units synthesized by rules with natural units was tested. A LMA filter based synthesizer was developed successfully to test and generate those units, which were difficult to be collected from the speech corpus. 3. A new efficient Chinese character coding scheme - "Yin Xu Code"(YX Code) has been developed to assist the GB Code . With the YX Code a new lexicon structure was designed up. The new dictionary system not only supplies with the pronunciation information, but also is much helpful for the words-segmentation. Based on above results, a Chinese text-to-speech system named as "KD-863" has been developed. The system converts any Chinese written text to speech in real time with high naturalness. In the national assessment of Chinese TTS systems held at the end of March 1998 in Beijing, the system achieved a first of the naturalness MOS (Mean Opinion Score).

SL980172.PDF (From Author) SL980172.PDF (Rasterized)

TOP