Session TMb Speech Synthesis Techniques

Chairperson Rolf Carlson KTH, Sweden

Home


OPTIMISING UNIT SELECTION WITH VOICE SOURCE AND FORMANTS IN THE CHATR SPEECH SYNTHESIS SYSTEM

Authors: Wen Ding and Nick Campbell

ATR Interpreting Telecommunications Research Labs. 2-2 Hikaridai, Seika-cho, Soraku-gun, Kyoto 619-02, Japan ding@itl.atr.co.jp

Volume 2 pages 537 - 540

ABSTRACT

High quality corpus-based synthetic speech requires minimization of prosodic and acoustic distortions between an ideal phoneme sequence and the actual waveform segments used to reproduce it. Our synthesis system concatenates phoneme-sized wave- form segments, without signal processing, selected from a large-scale speech database according to both prosodic and phonetic contextual suitability. This paper describes an approach to optimising such unit selection in speech synthesis by using voice source parameters and formant information, instead of selection based on cepstral features. We present results showing that formants and voice source parameters are more effective as acoustic features in the unit selection. These features can be estimated automatically from speech waveforms using the ARX joint estimation method. Results are compared with mel-frequency cepstrum coefficients (MFCC), previously used for unit selection, and both objective and subjective experiments showed that the new features outperformed the previous ones, and confirmed that the synthesized speech sounded much more natural.

A0077.pdf

TOP


A NEW FRAMEWORK TO PROVIDE HIGH-CONTROLLABILITY SPEECH SIGNAL AND THE DEVELOPMENT OF A WORKBENCH FOR IT

Authors: Masanobu ABE, Hideyuki MIZUNO, Satoshi TAKAHASHI and Shin'ya NAKAJIMA

NTT Human Interface Labs. 1-1 Hikarinooka Yokosuka-Shi Kanagawa 239 Japan Tel: +81 468 59 2547, Fax: +81 468 55 1054, E-mail: ave@nttspch.hil.ntt.co.jp

Volume 2 pages 541 - 544

ABSTRACT

This paper proposes a new framework to enhance the access to and control of speech signals. To enhance accessibility, the proposed framework assigns multi-layered tags such as orthographic transcriptions, and phonetic transcriptions. The tags also make it possible to precisely synchronize a speech signal with animation. In terms of control, the proposed framework provides hybrid speech; combining both human speech and speech synthesis-by-rule. Its quality ranges from simple TTS (the worst case) to encoded natural speech (the best case) depending on the resources available: texts, fundamental frequency(Fo) contour, power contour, phoneme duration, and so on. To create speech messages based on the proposed framework, we developed a workbench employing speech synthesis and recognition techniques. Important features of the workbench are a powerful GUI(Graphical User Interface) with which to manipulate prosodic information and a function to synthesize speech in trial-and-error manner. An evaluation by creating speech messages shows the good performance of the workbench.

A0151.pdf

TOP


SHAPE-INVARIANT PROSODIC MODIFICATION ALGORITHM FOR CONCATENATIVE TEXT-TO-SPEECH SYNTHESIS

Authors: Eduardo R. Banga, Carmen García-Mateo and Xavier Fernández-Salgado

Dpto. Tecnologías de las Comunicaciones. ETSI Telecomunicación. Universidad de Vigo. E-36200. Vigo. SPAIN e-mail: erbanga@tsc.uvigo.es carmen@tsc.uvigo.es xsalgado@tsc.uvigo.es

Volume 2 pages 545 - 548

ABSTRACT

Concatenative text-to-speech systems require an algorithm that allows prosodic modifications of the speech units during the concatenation process. Nowadays, sinusoidal modeling seems to be a promising technique to achieve very flexible algorithms that provide high quality synthetic speech. The main difficulty of these type of algorithms is the treatment of the phase information, since an inadequate processing of this information gives rise to reverberation and audible artefacts. In this contribution we discuss the application of a shape-invariant sinusoidal model [1] to a text-to-speech system based on concatenation of speech units.

A0397.pdf

TOP


AN RNN-BASED SPECTRAL INFORMATION GENERATION FOR MANDARIN TEXT-TO-SPEECH

Authors: Shaw-Hwa Hwang*, Sin-Horng Chen@, and Saga Chang*

*E000/CCL, Industrial Technology Research Institute, Chutung, Hsinchu, Taiwan, R.O.C @Department of Communication Engineering, National Chiao Tung University, Taiwan, R.O.C email:hsf@porsche.ccl.itri.org.tw Tel:+886-3-5917255, Fax:+886-3-5820098

Volume 2 pages 549 - 552

ABSTRACT

In this paper, an RNN-based spectral model is pro- posed to generate spectral parameters for Mandarin text- to-speech(TTS). The RNN is employed to learn the re- lations between the linguistic features and the spectral parameters. The phoneme-to-spectral parameter rules and the coarticulation rules between each two adjacen- t phones are automatically learned and memorized into the weights of RNN. The synthesized speech sounds more fluent and smooth. The RNN is divided into two parts. The first part is synchronized with syllable and is expect- ed to simulate the phoneme-to-spectral parameter rules. The second part is synchronized with frame and is ex- pected to simulate the coarticulation rules between each two adjacent phones. The line spectrum pair(LSP) pa- rameters and the normalized energy contour are taken as target value. Training with large database, the synthet- ic LSP and energy contour match to the original LSP and energy contours quite well. Moreover, an RNN-based prosodic model which was proposed in our previous s- tudy was combined to the spectral model to efficiently simulate the spectral and prosodic information genera- tion. Lastly, the LPC-based Mandarin TTS is implement- ed to examine the performance of our spectral model. The synthetic speech sounds fluent and natural. The coartic- ulation effect between each two adjacent phones which makes synthesized speech sounds un-fluent and echo-like was improved. However, due to the simple structure of LPC-based synthesizer, the clarity of synthetic speech can be improved by using the other spectral parameter as target value. For example, the modify mel-cepstrum parameter[5, 6, 7] or the FFT-based spectral parameter can also be learned by RNN and synthesizes more clarity speech. This is a initial work on the RNN-based spectral model for text-to-speech. Some advantages of our spectral model can be found. First, large memory space of synthe- sis unit in traditional TTS is replaced by small memory space of RNN's weights. Second, the coarticulation ef- fect can be alleviated and produces more fluent speech. Third, the RNN-based prosodic and spectral information generator[8, 9] can be easily combined to formed a more compact RNN-based TTS system.

A0441.pdf

TOP


METHODS FOR OPTIMAL TEXT SELECTION

Authors: Jan P. H. van Santen 1 Adam L. Buchsbaum 2

1 Lucent Technologies – Bell Labs, 600 Mountain Ave., Murray Hill, NJ 07974, U.S.A., jphvs@research.bell-labs.com 2 AT&T Labs, 180 Park Ave., P.O. Box 971, Florham Park, NJ 07932-0971, U.S.A., alb@research.att.com

Volume 2 pages 553 - 556

ABSTRACT

Construction of both text-to-speech synthesis (TTS) and au-tomatic speech recognition (ASR) systems involves usage of speech data bases. These data bases usually consist of read text, which means that one has significant control over the content of the data base. Here we address how one can take advantage of this control, by discussing a number of variants of "greedy" text selection methods and showing their application in a variety of examples.

A0488.pdf

TOP


HIGH RESOLUTION PROSODY MODIFICATION FOR SPEECH SYNTHESIS

Authors: Francisco M. Gimenez de los Galanes and David Talkin

Entropic Research Laboratory, Inc. 600 Pensylvannia Ave. SE, Suite 202. Washington, DC. 20003 Tel. +1 202 547 1420, FAX: +1 202 546 6648, E-mail: galanes@entropic.com

Volume 2 pages 557 - 560

ABSTRACT

In this paper we will introduce RTIPS, a system for arbitrary high-resolution modification of the prosodic variables of speech: fundamental frequency, rhythm (segmental duration) and intensity. It is based on the Resample and ovelap-add (R-OLA) algorithm for fundamental frequency and duration modification of speech. The algorithm works pitch-synchronously in order to accurately modify the pitch contour, and it uses estimates of the glottal closure instants (epochs) as the synchronism marks. This technique is very similar to other OLA-based methods for time or pitch modification, but because of the introduction of the resampling step, voice quality (especially for high-pitched voices) is much more natural after resynthesis, at any given output sampling frequency. The reliability of the R-OLA algorithm is highly depen- dent on the accuracy of the method used for epoch detection, so this preprocessing step has to be carefully designed.

A0564.pdf

TOP


TEXT-TO-SPEECH CONVERSION WITH NEURAL NETWORKS: A RECURRENT TDNN APPROACH

Authors: O. Karaali, G. Corrigan, I. Gerson, and N. Massey

Speech Processing Laboratory Motorola, Inc. 1301 E. Algonquin Rd., Schaumburg, IL 60196, U.S.A. Tel. (847)576-2764, FAX: (847)576-8378, E-mail: karaali@mot.com

Volume 2 pages 561 - 564

ABSTRACT

This paper describes the design of a neural network that performs the phonetic-to-acoustic mapping in a speech synthesis system. The use of a time-domain neural network architecture limits discontinuities that occur at phone boundaries. Recurrent data input also helps smooth the output parameter tracks. Independent testing has demonstrated that the voice quality produced by this system compares favorably with speech from existing commercial text-to-speech systems.

A0573.pdf

TOP


DATA DRIVEN FORMANT SYNTHESIS

Authors: Jesper Hogberg

Department of Speech, Music and Hearing, KTH, S-10044 Stockholm, Sweden Tel. +46 8 790 78 94, FAX: +46 8 790 78 54, E-mail: Jesper.Hogberg@speech.kth.se

Volume 2 pages 565 - 568

ABSTRACT

In this study we introduce combined data driven and rule based methods to synthesise speech. The aim is to improve on the coarticulatory modelling by adapting the KTH TTS system to data from one speaker. Regression trees are trained on a manually corrected speech database to provide predictions for vowel formant frequencies. At runtime, the TTS system produces formant frequency trajectories that are derived from weighted contributions from both the rules and the regression trees. The weighting strategy allows flexible adjustment of the synthesis parameters and thus of the quality of the output speech. An informal perceptual test was conducted to compare the performance of the hybrid approach to that of the traditional rule based system. A great majority of the test subjects judged the speech output of the hybrid system to be more natural than the competing rule derived speech. The speech produced by the hybrid system was also generally preferred.

A0588.pdf

TOP


SPEECH SYNTHESIS USING NON-UNIFORM UNITS IN THE VERBMOBIL PROJECT

Authors: Simon King (1) Thomas Portele Florian Hofer

Institut fur Kommunikationsforschung und Phonetik (IKP), Universitat Bonn Poppelsdorfer Allee 47, D-53115 Bonn, Germany http://www.ikp.uni-bonn.de (1) now at the Centre for Speech Technology Research, University of Edinburgh, 80, South Bridge, Edinburgh EH1 1HN, GB http://www.cstr.ed.ac.uk email: Simon.King@ed.ac.uk

Volume 2 pages 569 - 572

ABSTRACT

We describe a concatenative speech synthesiser for British English which uses the HADIFIX [8] inventory structure originally developed for German by Portele. An inventory of non-uniform units was investigated with the aim of improving segmental quality compared to diphones. A combination of soft (diphone) and hard concatenation was used, which allowed a dramatic reduction in inventory size. We also present a unit selection algorithm which selects an optimum sequence of units from this inventory for a given phoneme sequence. The work described is part of the concept-to-speech synthesiser for the language and speech project Verbmobil [12] which is funded by the German Ministry of Science (BMBF).

A0629.pdf

TOP


ON THE PRONUNCIATION MODE OF ACRONYMS IN SEVERAL EUROPEAN LANGUAGES

Authors: I. Trancoso and M. C. Viana

(1) INESC/IST, (2) CLUL INESC, R. Alves Redol, 9, 1000 Lisbon, Portugal. Tel. +351 1 3100268, FAX: +351 1 3145843, E-mail: Isabel.Trancoso@inesc.pt

Volume 2 pages 573 - 576

ABSTRACT

The paper describes our research work concerning the pronunciation mode of acronyms in German, French, and Portuguese. Most of the rules are related with the well-formedness of the constituents and the minimum and maximum weight thresholds required for reading and spelling an acronym. The results of the tests for the three languages were considered very promising, reaching decision errors below 4%. The rule set was also applied to a very small English corpus, with relative success. We believe that further optimisation is still possible, if language specific parametrisation is taken into account, in particular for the languages where a limited corpus of acronyms was available.

A0658.pdf

TOP


EVALUATION OF SPEECH SYNTHESIS SYSTEMS FOR DUTCH IN TELE- COMMUMCATION APPLICATIONS IN GSM AND PSTN NETWORKS

Authors: T. Rietveld (I), J. Kerkhoff (I), M.J.W.M. Emons (2), E.J. Meijer (2), A.A. Sanderman (2) A.M.C. Sluijter (2).

(1) University of Nijmegen, the Netherlands, Erasmusplein 1, 6525 HT Nijmegen, The Netherlands, Tel. +31 24 3612905, E-mail: a.rietveld@let.kun.nl (2) KPN Research, Leidschendam, the Netherlands

Volume 2 pages 577 - 580

ABSTRACT

In this contribution the subjective evaluation of three Text-To-Speech systems (two diphone and one allophone system) is reported in three'transmission conditions: standard telephone (PSTN) and GSM. The three TTS-systems realised three different texts: Travel information, Stock Exchange Reports and E-mail messages. The subjects had to carry out three tasks: a) to give preference judgements on the three TTS-systems and b) to rate the readings on 16 five-point scales. The rankorder on the scale of general quality was: Public Transport > Stock Exchange > E-mail reading, in both transmission conditions. The GSM-transmission tends to decrease the perceptual scores on a number of subjective scales, In the transliteration task significantly more errors were made in the GSM- condition than in the PSTN-condition. In both conditions less errors were made with the diphone TTS-systems than with the allophone system.

A0659.pdf

TOP


AUTOMATIC DIPHONE EXTRACTION FOR AN ITALIAN TEXT-TO-SPEECH SYNTHESIS SYSTEM

Authors: Bianca Angelini (*) , Claudia Barolo (**) , Daniele Falavigna (*) , Maurizio Omologo (*) and Stefano Sandri (***)

(*) IRST - Istituto per la Ricerca Scientifica e Tecnologica, 38050 Povo di Trento, Italy (**) Eikon Informatica, Via Sostegno 65/bis, 10146 Torino, Italy (***) CSELT - Centro Studi e Laboratori Telecomunicazioni S.p.A., Via G. Reiss Romoli 274, 10148 Torino, Italy

Volume 2 pages 581 - 584

ABSTRACT

This paper describes a system for the automatic extraction of diphone units from given speech utterances. The method is based on an automatic phonetic segmentation and on a subsequent rule-driven diphone boundary detection. The phonetic segmenter, developed at IRST, was trained and tested both in speaker independent and speaker dependent mode. A rule formalism, involving acoustic parameters, arithmetical and logical operators, was defined to express the acoustic/phonetic knowledge acquired during previous experiences on manual diphone segmentation. A specialized tool for rule parsing was designed that processes a given sequence of automatically derived phone boundaries using a corresponding sequence of predefined acoustic parameters. Several sets of rules were developed that include both general principles and specific details concerning the content of the diphone database of "Eloquens"Ò, the CSELT text-to-speech synthesis system for the Italian language. The accuracy was evaluated by comparing the manual and the automatic segmentations of the speech utterances of a female speaker, resulting in nearly 95% of correct boundary position, given a tolerance of 20 ms.

A0688.pdf

TOP


SIMPLIFICATION OF TTS ARCHITECTURE VS. OPERATIONAL QUALITY

Authors: Eric Keller

Laboratoire d'analyse informatique de la parole (LAIP) Faculte des Lettres, Universite de Lausanne, Switzerland eric.keller @ imm.unil.ch

Volume 2 pages 585 - 588

ABSTRACT

Many applications in mobile telephony and portable computing require high-quality speech synthesis systems with a very modest computational footprint. Our text-to-speech system for French gives satisfactory performance in phonetisation and prosody with considerably reduced computational resources. Using the Mons (Belgium) diphone data base, the program's current version runs in real time on Pentium-type PCs or Mac PPCs. The code requires 442 k, minimum RAM requirement is 4700 k, the minimum disk requirement is 5560 k. The phonetisation and prosody processing has been brought to a first level of optimal compromise between quality and computational footprint. Major further reductions in space requirements would probably necessitate a re-evaluation of sound generation procedures.

A0735.pdf

Recordings

TOP


FELIX - A TTS SYSTEM WITH IMPROVED PRE-PROCESSING AND SOURCE SIGNAL GENERATION

Authors: Georg Fries and Antje Wirth

Deutsche Telekom Berkom GmbH Forschungsgruppe Sprachverarbeitung Am Kavalleriesand 3, D-64295 Darmstadt, Germany E-mail: {friesg, wirth;@tzd.telekom.de

Volume 2 pages 589 - 592

ABSTRACT

Felix is our recent PC-based TTS research-system for testing, analyzing, and evaluating TTS algorithms. The object-oriented interface allows efficient algorithm improvement and overall system prototyping by combining different modules. The re- sults of each TTS-processing step can be monitored and all kinds of data may be reviewed and modified. The paper will outline the algorithms currently implemented in the Felix system, focusing on lexical analysis, duration modeling, and source signal generation, where we suggest ways to improve intelligibility and naturalness of synthetic speech.

A0741.pdf

TOP


INVESTIGATING THE LIMITATIONS OF CONCATENATIVE SYNTHESIS

Authors: M. Edgington

Speech Technology Unit Applied Research and Technology BT Laboratories, IPSWICH IP5 3RE, UK E-mail: mike.edgington@bt-sys.bt.co.uk

Volume 2 pages 593 - 596

ABSTRACT

Concatenative text-to-speech (TTS) systems are now quite widespread through the availability of simple time- domain speech modification algorithms. Many of these systems produce intelligible speech with a higher degree of naturalness than that achieved by the previous generation of formant synthesis systems. This perceived improvement in quality has lead to the view in some circles that TTS is a solved problem, at least for many practical applications. Three experiments are reported in this paper, all performed with a concatenative TTS system. These experiments investigated aspects of the concatenative model by respectively addressing copy synthesis of emotional speech, modelling glottalisation, and the effect of speech database design on the quality of synthesised speech. This paper suggests that the lack of an explicit speech model in most concatenative synthesis strategies fundamentally limits the usefulness of many current systems to the relatively restricted task of 'neutral' spoken renderings of text, where deficiencies in other system components usually mask the limitations of the synthesis strategy itself.

A0743.pdf

Recordings

TOP


SPEECH CODING AND SYNTHESIS USING PARAMETRIC CURVES

Authors: Luis Miguel Teixeira de Jesus and Gavin C. Cawley

School of Information Systems, University of East Anglia, Norwich, U.K. E-mail: flmj,gccg@sys.uea.ac.uk

Volume 2 pages 597 - 600

ABSTRACT

Accurate modeling of co-articulation, the context- sensitive merging of the boundaries between allophones in continuous speech, is vital for natural sounding speech synthesis. This paper describes initial research investigating the use of Bezier Curves to form models of co-articulation in human speech. A 12th order, pitch synchronous line spectral pair (LSP) [1] analysis is performed on a corpus of 239 phonetically balanced sentences of English speech. The resulting data are divided to form an inventory of the diphones occurring in the speech database. The trajectory of each line spectral pair parameter through each diphone can then be represented by a single cubic Bezier curve segment, found using the Levenberg- Marquardt curve fitting method [2, 3]. Results are presented showing the accuracy of Bezier models of the coarticulation between different types of speech sounds.

A0753.pdf

TOP


AUTOMATICALLY CLUSTERING SIMILAR UNITS FOR UNIT SELECTION IN SPEECH SYNTHESIS.

Authors: Alan W Black and Paul Taylor

Centre for Speech Technology Research, University of Edinburgh, 80, South Bridge, Edinburgh, U.K. EH1 1HN http://www.cstr.ed.ac.uk email: awb@cstr.ed.ac.uk, Paul.Taylor@ed.ac.uk

Volume 2 pages 601 - 604

ABSTRACT

This paper describes a new method for synthesizing speech by concatenating sub-word units from a database of labelled speech. A large unit inventory is created by automatically clustering units of the same phone class based on their phonetic and prosodic context. The appropriate cluster is then selected for a target unit offering a small set of candidate units. An optimal path is found through the candidate units based on their distance from the cluster center and an acoustically based join cost. Details of the method and justification are presented. The results of experiments using two different databases are given, optimising various parameters within the system. Also a comparison with other existing selection based synthesis techniques is given showing the advantages this method has over existing ones. The method is implemented within a full text-to-speech system offering efficient natural sounding speech synthesis.

A0790.pdf

TOP


IMPROVEMENTS ON A TRAINABLE LETTER-TO-SOUND CONVERTER

Authors: Li Jiang, Hsiao-Wuen Hon and Xuedong Huang

Microsoft Research One Microsoft Way Redmond, Washington 98052, USA

Volume 2 pages 605 - 608

ABSTRACT

Letter-to-sound (LTS) conversion is important for both text-to-speech (TTS) and automatic speech recognition (ASR). In this paper we discuss some improvements we have made on our trainable LTS converter. We use a classification and regression tree (CART) to automatically configure the most salient phonological rules needed for the LTS conversion. We address problems in growing multiple trees and use of phonotactic information for better generalization. The experiments were carried on both the NETTALK database and the CMU dictionary. With improved techniques, the conversion error rate at the phoneme level and word level was reduced by 15% and 20% respectively. For both tasks, the phoneme conversion error rate was reduced to about 8%.

A0916.pdf

TOP


ABSTRACT

ON A CEPSTRAL PITCH ALTERATION TECHNIQUE FOR PROSODY CONTROL IN THE SPEECH SYNTHESIS SYSTEM WITH HIGH QUALITY

Authors: MyungJin BAE, KyuHong KIM and WonCheol LEE

Dept. of Telecommunication Engr, Soongsil University, Seoul 156-743, Korea mjbae@saint.soongsil.ac.kr

Volume 2 pages 609 - 612

ABSTRACT

In the area of the speech synthesis techniques, the waveform coding methods maintain the intelligibility and naturalness of synthetic speech. In order to apply the waveform coding techniques to synthesis by rule, we must be able to alter the pitches of synthetic speech. In this paper, we propose a new pitch altering method that compensates phase distortion of the cepstral pitch alteration method with time scaling method in the time domain. This method can remove some spectrum distortion which is occurred in conjunction point between the waveforms. Also, we can obtain little spectrum distortion below 1.18% for pitch alteration of 200%.

A0963.pdf

TOP


Diphone Concatenation using a Harmonic plus Noise Model of Speech

Authors: Yannis Stylianou, Thierry Dutoit, and Juergen Schroeter

AT&T Labs-Research 180 Park Ave, PO Box 971 Florham Park, NJ 07932-0971 email: [styliano, dutoit, jsh]@research.att.com

Volume 2 pages 613 - 616

ABSTRACT

In this paper we present a high-quality text-to-speech system using diphones. The system is based on a Harmonic plus Noise (HNM representation of the speech signal. HNM is a pitch-synchronous analysis-synthesis system but does not require pitch marks to be determined as necessary in PSOLA-based methods. HNM assumes the speech signal to be composed of a periodic part and a stochastic part. As a result, different prosody and spectral envelope modification methods can be applied to each part, yielding more natural-sounding synthetic speech. The fully parametric representation of speech using HNM also provides a straightforward way of smoothing diphone boundaries. Informal listening tests, using natural prosody, have shown that the synthetic speech quality is close to the quality of the original sentences, without smoothing problems and without buzziness or other oddities observed with other speech representations used for TTS.

A1281.pdf

TOP