ABSTRACT
High quality corpus-based synthetic speech requires minimization of prosodic and acoustic distortions between an ideal phoneme sequence and the actual waveform segments used to reproduce it. Our synthesis system concatenates phoneme-sized wave- form segments, without signal processing, selected from a large-scale speech database according to both prosodic and phonetic contextual suitability. This paper describes an approach to optimising such unit selection in speech synthesis by using voice source parameters and formant information, instead of selection based on cepstral features. We present results showing that formants and voice source parameters are more effective as acoustic features in the unit selection. These features can be estimated automatically from speech waveforms using the ARX joint estimation method. Results are compared with mel-frequency cepstrum coefficients (MFCC), previously used for unit selection, and both objective and subjective experiments showed that the new features outperformed the previous ones, and confirmed that the synthesized speech sounded much more natural.
ABSTRACT
This paper proposes a new framework to enhance the access to and control of speech signals. To enhance accessibility, the proposed framework assigns multi-layered tags such as orthographic transcriptions, and phonetic transcriptions. The tags also make it possible to precisely synchronize a speech signal with animation. In terms of control, the proposed framework provides hybrid speech; combining both human speech and speech synthesis-by-rule. Its quality ranges from simple TTS (the worst case) to encoded natural speech (the best case) depending on the resources available: texts, fundamental frequency(Fo) contour, power contour, phoneme duration, and so on. To create speech messages based on the proposed framework, we developed a workbench employing speech synthesis and recognition techniques. Important features of the workbench are a powerful GUI(Graphical User Interface) with which to manipulate prosodic information and a function to synthesize speech in trial-and-error manner. An evaluation by creating speech messages shows the good performance of the workbench.
ABSTRACT
Concatenative text-to-speech systems require an algorithm that allows prosodic modifications of the speech units during the concatenation process. Nowadays, sinusoidal modeling seems to be a promising technique to achieve very flexible algorithms that provide high quality synthetic speech. The main difficulty of these type of algorithms is the treatment of the phase information, since an inadequate processing of this information gives rise to reverberation and audible artefacts. In this contribution we discuss the application of a shape-invariant sinusoidal model [1] to a text-to-speech system based on concatenation of speech units.
ABSTRACT
In this paper, an RNN-based spectral model is pro- posed to generate spectral parameters for Mandarin text- to-speech(TTS). The RNN is employed to learn the re- lations between the linguistic features and the spectral parameters. The phoneme-to-spectral parameter rules and the coarticulation rules between each two adjacen- t phones are automatically learned and memorized into the weights of RNN. The synthesized speech sounds more fluent and smooth. The RNN is divided into two parts. The first part is synchronized with syllable and is expect- ed to simulate the phoneme-to-spectral parameter rules. The second part is synchronized with frame and is ex- pected to simulate the coarticulation rules between each two adjacent phones. The line spectrum pair(LSP) pa- rameters and the normalized energy contour are taken as target value. Training with large database, the synthet- ic LSP and energy contour match to the original LSP and energy contours quite well. Moreover, an RNN-based prosodic model which was proposed in our previous s- tudy was combined to the spectral model to efficiently simulate the spectral and prosodic information genera- tion. Lastly, the LPC-based Mandarin TTS is implement- ed to examine the performance of our spectral model. The synthetic speech sounds fluent and natural. The coartic- ulation effect between each two adjacent phones which makes synthesized speech sounds un-fluent and echo-like was improved. However, due to the simple structure of LPC-based synthesizer, the clarity of synthetic speech can be improved by using the other spectral parameter as target value. For example, the modify mel-cepstrum parameter[5, 6, 7] or the FFT-based spectral parameter can also be learned by RNN and synthesizes more clarity speech. This is a initial work on the RNN-based spectral model for text-to-speech. Some advantages of our spectral model can be found. First, large memory space of synthe- sis unit in traditional TTS is replaced by small memory space of RNN's weights. Second, the coarticulation ef- fect can be alleviated and produces more fluent speech. Third, the RNN-based prosodic and spectral information generator[8, 9] can be easily combined to formed a more compact RNN-based TTS system.
ABSTRACT
Construction of both text-to-speech synthesis (TTS) and au-tomatic speech recognition (ASR) systems involves usage of speech data bases. These data bases usually consist of read text, which means that one has significant control over the content of the data base. Here we address how one can take advantage of this control, by discussing a number of variants of "greedy" text selection methods and showing their application in a variety of examples.
ABSTRACT
In this paper we will introduce RTIPS, a system for arbitrary high-resolution modification of the prosodic variables of speech: fundamental frequency, rhythm (segmental duration) and intensity. It is based on the Resample and ovelap-add (R-OLA) algorithm for fundamental frequency and duration modification of speech. The algorithm works pitch-synchronously in order to accurately modify the pitch contour, and it uses estimates of the glottal closure instants (epochs) as the synchronism marks. This technique is very similar to other OLA-based methods for time or pitch modification, but because of the introduction of the resampling step, voice quality (especially for high-pitched voices) is much more natural after resynthesis, at any given output sampling frequency. The reliability of the R-OLA algorithm is highly depen- dent on the accuracy of the method used for epoch detection, so this preprocessing step has to be carefully designed.
ABSTRACT
This paper describes the design of a neural network that performs the phonetic-to-acoustic mapping in a speech synthesis system. The use of a time-domain neural network architecture limits discontinuities that occur at phone boundaries. Recurrent data input also helps smooth the output parameter tracks. Independent testing has demonstrated that the voice quality produced by this system compares favorably with speech from existing commercial text-to-speech systems.
ABSTRACT
In this study we introduce combined data driven and rule based methods to synthesise speech. The aim is to improve on the coarticulatory modelling by adapting the KTH TTS system to data from one speaker. Regression trees are trained on a manually corrected speech database to provide predictions for vowel formant frequencies. At runtime, the TTS system produces formant frequency trajectories that are derived from weighted contributions from both the rules and the regression trees. The weighting strategy allows flexible adjustment of the synthesis parameters and thus of the quality of the output speech. An informal perceptual test was conducted to compare the performance of the hybrid approach to that of the traditional rule based system. A great majority of the test subjects judged the speech output of the hybrid system to be more natural than the competing rule derived speech. The speech produced by the hybrid system was also generally preferred.
ABSTRACT
We describe a concatenative speech synthesiser for British English which uses the HADIFIX [8] inventory structure originally developed for German by Portele. An inventory of non-uniform units was investigated with the aim of improving segmental quality compared to diphones. A combination of soft (diphone) and hard concatenation was used, which allowed a dramatic reduction in inventory size. We also present a unit selection algorithm which selects an optimum sequence of units from this inventory for a given phoneme sequence. The work described is part of the concept-to-speech synthesiser for the language and speech project Verbmobil [12] which is funded by the German Ministry of Science (BMBF).
ABSTRACT
The paper describes our research work concerning the pronunciation mode of acronyms in German, French, and Portuguese. Most of the rules are related with the well-formedness of the constituents and the minimum and maximum weight thresholds required for reading and spelling an acronym. The results of the tests for the three languages were considered very promising, reaching decision errors below 4%. The rule set was also applied to a very small English corpus, with relative success. We believe that further optimisation is still possible, if language specific parametrisation is taken into account, in particular for the languages where a limited corpus of acronyms was available.
ABSTRACT
In this contribution the subjective evaluation of three Text-To-Speech systems (two diphone and one allophone system) is reported in three'transmission conditions: standard telephone (PSTN) and GSM. The three TTS-systems realised three different texts: Travel information, Stock Exchange Reports and E-mail messages. The subjects had to carry out three tasks: a) to give preference judgements on the three TTS-systems and b) to rate the readings on 16 five-point scales. The rankorder on the scale of general quality was: Public Transport > Stock Exchange > E-mail reading, in both transmission conditions. The GSM-transmission tends to decrease the perceptual scores on a number of subjective scales, In the transliteration task significantly more errors were made in the GSM- condition than in the PSTN-condition. In both conditions less errors were made with the diphone TTS-systems than with the allophone system.
ABSTRACT
This paper describes a system for the automatic extraction of diphone units from given speech utterances. The method is based on an automatic phonetic segmentation and on a subsequent rule-driven diphone boundary detection. The phonetic segmenter, developed at IRST, was trained and tested both in speaker independent and speaker dependent mode. A rule formalism, involving acoustic parameters, arithmetical and logical operators, was defined to express the acoustic/phonetic knowledge acquired during previous experiences on manual diphone segmentation. A specialized tool for rule parsing was designed that processes a given sequence of automatically derived phone boundaries using a corresponding sequence of predefined acoustic parameters. Several sets of rules were developed that include both general principles and specific details concerning the content of the diphone database of "Eloquens"Ò, the CSELT text-to-speech synthesis system for the Italian language. The accuracy was evaluated by comparing the manual and the automatic segmentations of the speech utterances of a female speaker, resulting in nearly 95% of correct boundary position, given a tolerance of 20 ms.
ABSTRACT
Many applications in mobile telephony and portable computing require high-quality speech synthesis systems with a very modest computational footprint. Our text-to-speech system for French gives satisfactory performance in phonetisation and prosody with considerably reduced computational resources. Using the Mons (Belgium) diphone data base, the program's current version runs in real time on Pentium-type PCs or Mac PPCs. The code requires 442 k, minimum RAM requirement is 4700 k, the minimum disk requirement is 5560 k. The phonetisation and prosody processing has been brought to a first level of optimal compromise between quality and computational footprint. Major further reductions in space requirements would probably necessitate a re-evaluation of sound generation procedures.
ABSTRACT
Felix is our recent PC-based TTS research-system for testing, analyzing, and evaluating TTS algorithms. The object-oriented interface allows efficient algorithm improvement and overall system prototyping by combining different modules. The re- sults of each TTS-processing step can be monitored and all kinds of data may be reviewed and modified. The paper will outline the algorithms currently implemented in the Felix system, focusing on lexical analysis, duration modeling, and source signal generation, where we suggest ways to improve intelligibility and naturalness of synthetic speech.
ABSTRACT
Concatenative text-to-speech (TTS) systems are now quite widespread through the availability of simple time- domain speech modification algorithms. Many of these systems produce intelligible speech with a higher degree of naturalness than that achieved by the previous generation of formant synthesis systems. This perceived improvement in quality has lead to the view in some circles that TTS is a solved problem, at least for many practical applications. Three experiments are reported in this paper, all performed with a concatenative TTS system. These experiments investigated aspects of the concatenative model by respectively addressing copy synthesis of emotional speech, modelling glottalisation, and the effect of speech database design on the quality of synthesised speech. This paper suggests that the lack of an explicit speech model in most concatenative synthesis strategies fundamentally limits the usefulness of many current systems to the relatively restricted task of 'neutral' spoken renderings of text, where deficiencies in other system components usually mask the limitations of the synthesis strategy itself.
ABSTRACT
Accurate modeling of co-articulation, the context- sensitive merging of the boundaries between allophones in continuous speech, is vital for natural sounding speech synthesis. This paper describes initial research investigating the use of Bezier Curves to form models of co-articulation in human speech. A 12th order, pitch synchronous line spectral pair (LSP) [1] analysis is performed on a corpus of 239 phonetically balanced sentences of English speech. The resulting data are divided to form an inventory of the diphones occurring in the speech database. The trajectory of each line spectral pair parameter through each diphone can then be represented by a single cubic Bezier curve segment, found using the Levenberg- Marquardt curve fitting method [2, 3]. Results are presented showing the accuracy of Bezier models of the coarticulation between different types of speech sounds.
ABSTRACT
This paper describes a new method for synthesizing speech by concatenating sub-word units from a database of labelled speech. A large unit inventory is created by automatically clustering units of the same phone class based on their phonetic and prosodic context. The appropriate cluster is then selected for a target unit offering a small set of candidate units. An optimal path is found through the candidate units based on their distance from the cluster center and an acoustically based join cost. Details of the method and justification are presented. The results of experiments using two different databases are given, optimising various parameters within the system. Also a comparison with other existing selection based synthesis techniques is given showing the advantages this method has over existing ones. The method is implemented within a full text-to-speech system offering efficient natural sounding speech synthesis.
ABSTRACT
Letter-to-sound (LTS) conversion is important for both text-to-speech (TTS) and automatic speech recognition (ASR). In this paper we discuss some improvements we have made on our trainable LTS converter. We use a classification and regression tree (CART) to automatically configure the most salient phonological rules needed for the LTS conversion. We address problems in growing multiple trees and use of phonotactic information for better generalization. The experiments were carried on both the NETTALK database and the CMU dictionary. With improved techniques, the conversion error rate at the phoneme level and word level was reduced by 15% and 20% respectively. For both tasks, the phoneme conversion error rate was reduced to about 8%.
ABSTRACT
In the area of the speech synthesis techniques, the waveform coding methods maintain the intelligibility and naturalness of synthetic speech. In order to apply the waveform coding techniques to synthesis by rule, we must be able to alter the pitches of synthetic speech. In this paper, we propose a new pitch altering method that compensates phase distortion of the cepstral pitch alteration method with time scaling method in the time domain. This method can remove some spectrum distortion which is occurred in conjunction point between the waveforms. Also, we can obtain little spectrum distortion below 1.18% for pitch alteration of 200%.
ABSTRACT
In this paper we present a high-quality text-to-speech system using diphones. The system is based on a Harmonic plus Noise (HNM representation of the speech signal. HNM is a pitch-synchronous analysis-synthesis system but does not require pitch marks to be determined as necessary in PSOLA-based methods. HNM assumes the speech signal to be composed of a periodic part and a stochastic part. As a result, different prosody and spectral envelope modification methods can be applied to each part, yielding more natural-sounding synthetic speech. The fully parametric representation of speech using HNM also provides a straightforward way of smoothing diphone boundaries. Informal listening tests, using natural prosody, have shown that the synthetic speech quality is close to the quality of the original sentences, without smoothing problems and without buzziness or other oddities observed with other speech representations used for TTS.