Full List of Titles 1: ICSLP'98 Proceedings 2: SST Student Day Author Index A B C D E F G H I J K L M N O P Q R S T U V W X Y Z Multimedia Files |
A Mixed-Excitation Frequency Domain Model for Time-Scale Pitch-Scale Modification of SpeechAuthors:Alex Acero, Microsoft Research (USA) Page (NA) Paper number 72Abstract:This paper presents a time-scale pitch-scale modification technique for concatenative speech synthesis. The method is based on a frequency domain source-filter model, where the source is modeled as a mixed excitation. This model is highly coupled with a compression scheme that result in compact acoustic inventories. When compared to the approach in the Whistler system using no mixed excitation, the new method shows improvement in voiced fricatives and over-stretched voiced sounds. In addition, it allows for spectral manipulation such as smoothing of discontinuities at unit boundaries, voice transformations or loudness equalization.
|
1078_01.WAV(was: 1078_01.wav) | The speech synthesized with poor prosody due to wrong word segmentation. File type: Sound File Format: Sound File: WAV Tech. description: 11025 samples per second, 8 bits per sample, mono, u-Law encoded Creating Application:: Unknown Creating OS: Windows 95/NT |
1078_02.WAV(was: 1078_02.wav) | The speech synthesized is based on composite word approach, which obviously produces more correct and natural prosody. File type: Sound File Format: Sound File: WAV Tech. description: 11025 samples per second, 8 bits per sample, mono, u-Law encoded Creating Application:: Unknown Creating OS: Windows 95/NT |
Jungchul Lee, ETRI (Korea)
Donggyu Kang, ETRI (Korea)
Sanghoon Kim, ETRI (Korea)
Koengmo Sung, Department of Electronic Engineering, Seoul National University (Korea)
Energy contour in a sentence is one of major factors that affect the naturalness of synthetic speech. In this paper. we propose a method to control the energy contour for the enhancement in the naturalness of Korean synthetic speech. Our algorithm adopts syllable as a basic unit and predicts the peak amplitude for each syllable in a word using a neural network (NN). We utilize indirect linguistic features as well as acoustic features of phonemes as input data to the NN to accommodate the grammatical effects of words in a sentence. The simulation results show that prediction error is less than 10% and our algorithm is very effective for analysis/synthesis of the energy contour of a sentence.. and generates a fairly good declarative contour for TTS.
Yong-Ju Lee, (Korea)
Sook-hyang Lee, (Korea)
Jong-Jin Kim, (Korea)
Hyun-Ju Ko, (Korea)
Young-Il Kim, (Korea)
Sang-Hun Kim, Spoken Language Processing Section, Electronics and Telecommunication Research Institute (Korea)
Jung-Cheol Lee, Spoken Language Processing Section, Electronics and Telecommunication Research Institute (Korea)
This study describes an algorithm for the F0 contour generation system for Korean sentences and its evaluation results. 400 K-ToBI labeled utterances were used which were read by one male and one female announcers. F0 contour generation system uses two classification trees for prediction of K-ToBI labels for input text and 11 regression trees for prediction of F0 values for the labels. Evaluation results of the system showed 77.2% prediction accuracy for prediction of IP boundaries and 72.0% prediction accuracy for AP boundaries. Information of voicing and duration of the segments was not changed for F0 contour generation and its evaluation. Evaluation results showed 23.5Hz RMS error and 0.55 correlation coefficient in F0 generation experiment using labelling information from the original speech data.
Kevin Lenzo, Carnegie Mellon University (USA)
Christopher Hogan, Carnegie Mellon University (USA)
Jeffrey Allen, Carnegie Mellon University (USA)
The DIPLOMAT project at Carnegie Mellon University instantiates a program of rapid-deployment speech-to-speech machine translation; we have developed techniques for quickly producing text-to-speech (TTS) systems for new target languages to support this work. While the resulting systems are not immediately of comparable quality to commercial systems on unrestricted tasks in well-developed languages, they are more than adequate for limited-domain scenarios and rapid prototyping - they generalize to unseen data with some degradation, while quality in-domain can be quite good. Voices and engines for synthesizing new target languages may be developed in a period as short as two weeks after text corpus collection. We have successfully used these techniques to build a TTS module for English, Croatian, Spanish, Haitian Creole and Korean.
Robert H. Mannell, SHLRC, Macquarie University (Australia)
This paper examines a method for formant parameter extraction from a labeled single speaker database for use in a formant-parameter diphone-concatenation speech synthesis system. This procedure commences with an initial formant analysis of the labelled database, which is then used to obtain formant (F1-F5) probability spaces for each phoneme. These probability spaces guide a more careful speaker-specific extraction of formant frequencies. An analysis-by-synthesis procedure is then used to provide best-matching formant intensity and bandwidth parameters. The great majority of the parameters so extracted produce speech which is highly intelligible and which has a voice quality close to the original speaker.
0627_01.WAV(was: 0627_01.WAV) | Included with postscript file in 0627.zip File type: Sound File Format: Sound File: WAV Tech. description: Sampling rate 10 kHz, mono, little-endian, not encoded Creating Application:: Unknown Creating OS: Windows NT |
0627_02.WAV(was: 0627_02.WAV) | Included with postscript file in 0627.zip File type: Sound File Format: Sound File: WAV Tech. description: Sampling rate 10 kHz, mono, little-endian, not encoded Creating Application:: Unknown Creating OS: Windows NT 4.0 |
Osamu Mizuno, NTT Human Interface Labs. (Japan)
Shin'ya Nakajima, NTT Human Interface Labs. (Japan)
The Multi-layered Speech/Sound Synthesis Control Language (MSCL) proposed herein facilitates the synthesizing of several speech modes such as nuance, mental state and emotion, and allows speech to be synchronized to other media easily. MSCL is a multi-layered linguistic system and encompasses three layers: and semantic level layer (The S-layer), interpretation level layer (The I-layer), and parameter level layer (The P-layer). The S-layer is the description level of semantics such as emotional and emphasized speech. The I-layer is the description level of prosodic feature controls and interprets The S-layer scripts to for control on I-layer level. The P-layer represents prosodic parameters for speech synthesis. This multi-level description system is convenient for both laymen and professional users. MSCL also encompasses many effective prosodic feature control functions such as a time-varying pattern description function, absolute and relative control forms, and SDS(Speaker Dependent Scale). MSCL enables more emotional and expressive synthetic speech than conventional TTS systems. This paper describes these functions and the effective prosodic feature controls possible with MSCL.
Ryo Mochizuki, Matsushita Communication Ind. Co., Ltd. (Japan)
Yasuhiko Arai, Matsushita Communication Ind. Co., Ltd. (Japan)
Takashi Honda, School of Science and Tech., Meiji Univ. (Japan)
In order to synthesize natural-sounding Japanese phonetic words, a novel VCV-concatenation synthesis with an advanced word database is proposed. The word database consists of VCV-balanced phonetic words which are uttered forcibly in type-0 and type-1 pitch accents. The advantage of using the advanced word database is that a variety of VCV-segments with the same phonetic chains and the different pitch patterns could be collected efficiently at the same time. The following pitch modification techniques are used to achieve the sound quality: (1) The optimal VCV-segment set which minimizes the pitch modification rate is selected. (2) Pitch waveforms are extracted by referring to excitation points. (3) Wavelengths of pitch waveforms are adjusted depending on the pitch modification rates. (4) Natural prosody in the VCV-segments in the database is effectively used. Superiority of the proposed database is ensured by means of the pitch pattern matching measurement and the subjective quality evaluation.
Vincent Pagel, Faculté Polytechnique de Mons (Belgium)
Kevin Lenzo, Carnegie Mellon University (USA)
Alan W. Black, CSTR, University of Edinburgh (U.K.)
This paper presents trainable methods for generating letter to sound rules from a given lexicon for use in pronouncing out-of-vocabulary words and as a method for lexicon compression. As the relationship between a string of letters and a string of phonemes representing its pronunciation for many languages is not trivial, we discuss two alignment procedures, one fully automatic and one hand-seeded which produce reasonable alignments of letters to phones. Top Down Induction Tree models are trained on the aligned entries. We show how combined phoneme/stress prediction is better than separate prediction processes, and still better when including in the model the last phonemes transcribed and part of speech information. For the lexicons we have tested, our models have a word accuracy (including stress) of 78% for OALD, 62% for CMU and 94% for BRULEX. The extremely high scores on the training sets allow substantial size reductions (more than 1/20). WWW site: http://tcts.fpms.ac.be/synthesis/mbrdico
Ze'ev Roth, DSP Group (Israel)
Judith Rosenhouse, Technion (Israel)
This paper describes an algorithm for name (surnames and personal names) announcement in American English implemented on DSP Group's SmartCores (registered trade-mark) digital signal processor (dsp) core. The name announcement module is targeted for low cost applications therefore the amount of memory that can be allocated for dictionaries, program code, and runtime parameters is limited. The required response time of 0.5 seconds limits the computations performed in the linguistic analysis phase of each name. The synthesis scheme is limited by the real time capacity of the processor (since this task may be performed in parallel with other real time tasks).
Frédérique Sannier, Institut de la Communication Parlée (France)
Rabia Belrhali, Institut de la Communication Parlée (France)
Véronique Aubergé, Institut de la Communication Parlée (France)
We give a survey of the phonographical behaviour of the loanwords introduced into the French lexicon, through the observation of the systematic functioning of the French letter-to-phone TOPH system. We thus define sub-systems, isolated into lexicons. The processing of loans through the Toph TTS made it possible to give clues about the importance of one or other language in the French lexicon. The observation of the utterances graphonical functioning, made it thus possible to delimit classes. The second part of this study more specifically deals with the loanwords inflexion paradigms, for which as well different functionings are drawn.
Tomaz Sef, Jozef Stefan Institute (Slovenia)
Ales Dobnikar, Jozef Stefan Institute (Slovenia)
Matjaz Gams, Jozef Stefan Institute (Slovenia)
This paper presents a new text-to-speech (TTS) system that is capable of synthesising continuous Slovenian speech. Input text is processed by a series of independent modules: text normalisation, grapheme-to-phoneme conversion, prosody generation and segmental concatenation. That enables easy improvements of separate parts of the system. In order to generate rules for our synthesis scheme, data was collected by analyzing the readings of ten speakers, five males and five females. Our system is used in several applications. It is built into an employment agent EMA that provides employment information through the Internet. Currently we are developing a system that will enable blind and partially sightless people to work in the Windows environment.
Shigenobu Seto, Toshiba Corporation (Japan)
Masahiro Morita, Toshiba Corporation (Japan)
Takehiko Kagoshima, Toshiba Corporation (Japan)
Masami Akamine, Toshiba Corporation (Japan)
The linguistic features analysis for input text plays an important role in achieving natural prosodic control in text-to-speech (TTS) systems. In a conventional scheme, experts refine suspicious if-then rules and change the tree structure manually to obtain correct analysis results when input texts that have been analyzed incorrectly. However, altering the tree structure drastically is difficult since attention is often paid only to the suspicious if-then rules. If earlier rule-tree structure is inappropriate, any attempt to improve the performance may be limited by the stiffness of the structure. To cope with these problems, the new development scheme generates analysis rules by using C4.5 [1], where an if-then rule-tree structure is generated by off-line training. The scheme has the advantage that since the generated rule-tree structure is simple, the rules are easier to maintain. The scheme is applied to generating four types of analysis rule-trees: rules for forming accent phrases, rules for determining accent position, rules for analyzing syntactic structure, and rules for pause insertion. An experimental evaluation was performed on these four rules. The accuracy was 96.5 percent for the accent phrase formation, 95.5 percent for the accent positioning, 87.0 percent for the pause insertion, and 88.3 percent for the syntactic analysis despite using small training data. These results indicate the validity of the scheme. The new scheme is used for developing linguistic features analysis rules in a Japanese TTS system, TOS Drive TTS [3].
Yoshinori Shiga, Multimedia Engineering Laboratory, TOSHIBA Corporation (Japan)
Hiroshi Matsuura, Multimedia Engineering Laboratory, TOSHIBA Corporation (Japan)
Tsuneo Nitta, Multimedia Engineering Laboratory, TOSHIBA Corporation (Japan)
This paper proposes a new method that determines segmental duration for text-to-speech conversion based on the movement of articulatory organs which compose an articulatory model. The articulatory model comprises four time-variable articulatory parameters representing the conditions of articulatory organs whose physical restriction seems to significantly influence the segmental duration. The parameters are controlled according to an input sequence of phonetic symbols, following which segmental duration is determined based on the variation of the articulatory parameters. The proposed method is evaluated through an experiment using a Japanese speech database that consists of 150 phonetically balanced sentences. The results indicate that the mean square error of predicted segmental duration is approximately 15[ms] for the closed set and 15-17[ms] for the open set. The error is within 20[ms], the level of acceptability for distortion of segmental duration without loss of naturalness, and hence the method is proved to effectively predict segmental duration.
Evelyne Tzoukermann, Bell Labs - Lucent Technologies (USA)
The Bell Labs text-to-speech synthesis system for French is part of a multilingual effort for text-to-speech generation. The text analysis component consists of four main parts: the morphological analysis module, the language models, the grapheme-to-phoneme conversion rules, and the prosodic module. The system is built in a pipeline architecture, the output of which feeds the subsequent synthesis modules. The originality of this work lies in the fact that we use weighted finite-state transducer technology to process the entire analysis of the French system. Moreover, the implementation not only accounts for most orthographic representations, such as numerals, abbreviations, dates, currencies, etc, but we also solve the hard questions of French liaison, mute e, and aspirated h using refined intermediate representations either in the form of traces or in the form of archigraphemes.
Jennifer J. Venditti, Bell Labs and Ohio State Univ (USA)
Jan P.H. van Santen, Bell Labs (USA)
Accurate estimation of segmental durations is crucial for natural-sounding text-to-speech (TTS) synthesis. This paper presents a model of vowel duration used in the Bell Labs Japanese TTS system. We describe the constraints on vowel devoicing, and effects of factors such as phone identity, surrounding phone identities, accentuation, syllabic structure, and phrasal position on the duration of both long and short vowels. A Sum-of-Products approach is used to model key interactions observed in the data, and to predict values of factor combinations not found in the speech database. We report root mean squared deviations between observed and predicted durations ranging from 8 to 15 ms, and an overall correlation of 0.89.
Ren-Hua Wang, University of Science & Technology of China (China)
Qinfeng Liu, University of Science & Technology of China (China)
Yongsheng Teng, University of Science & Technology of China (China)
Deyu Xia, University of Science & Technology of China (China)
This paper presents our research efforts on Chinese text-to-speech towards higher naturalness. The main results can be summarized as follows: 1. In the proposed TTS system the syllable-sized units were cut out from the real recorded speech, the synthetic speech was generated by concatenating these units back together. 2. The integration of units synthesized by rules with natural units was tested. A LMA filter based synthesizer was developed successfully to test and generate those units, which were difficult to be collected from the speech corpus. 3. A new efficient Chinese character coding scheme - "Yin Xu Code"(YX Code) has been developed to assist the GB Code . With the YX Code a new lexicon structure was designed up. The new dictionary system not only supplies with the pronunciation information, but also is much helpful for the words-segmentation. Based on above results, a Chinese text-to-speech system named as "KD-863" has been developed. The system converts any Chinese written text to speech in real time with high naturalness. In the national assessment of Chinese TTS systems held at the end of March 1998 in Beijing, the system achieved a first of the naturalness MOS (Mean Opinion Score).