ABSTRACT
The application of "Multivariate Adaptive Regression Splines" (MARS) to the problem of modeling duration of a set of segments used in a text-to-speech system for German is presented. MARS is a technique to estimate general functions of high-dimensional arguments given sparse data. It automatically selects the parameters and the structure of the model based on data available. The result is a model with a correlation coefficient between observed and predicted durations of a test set of . Besides highly accurate predicting durations, a MARS model also allows interpretation of its structure, demonstrated in this study by analyses of factor importance and interactions of the MARS model.
ABSTRACT
This paper presents an original technique for solving the phonetic segmentation problem. It is based on the use of a speech synthesizer for the alignment of a text on its corresponding speech signal. A high-quality digital speech synthesizer is used to create a synthetic reference speech pattern used in the alignment process. This approach has the great advantage on other approaches that no training stage (hence no labeled database) is needed. The system has been mainly evaluated on French read utterances. Other evaluations have been made on other languages like English, German, Romanian and Spanish. Following these experiments, the system seems to be a powerful tool for the automatic constitution of large phonetically and prosodically labeled speech databases. The availability of such corpora will be a key point for the development of improved speech synthesis and recognition systems.
ABSTRACT
In order to eliminate trial-and-error in the process of selecting a good speech database as a voice source for concatenative speech synthesis, and to determine the acoustic and prosodic characteristics that best predict `appeal' or perceived `quality' in the synthesised speech, we performed tests to evaluate listener preferences over a range of different synthesised voices. We found that variation of fundamental frequency in the source database, and open-quotient of the glottis as measured by joint-estimation (ARX) were the best correlates. To our surprise, there was very little correlation between the scores for `intelligibility' and those for `naturalness' in the test data, but the former showed a close correlation with durational characteristics, and the latter with pitch and loudness.
ABSTRACT
The presented work deals with the experimental identification of parts in a tree based decoder lexicon, that are more important for decoding efficiency compared to less important lexicon parts. Three different methods for constructing only the most important nodes in a set of tree lexicon copies are presented: building large trees; tree cutting; lexicon node removal. This leads to dramatic reduction of memory requirements while retaining the original recognition performance. In addition a reduction of the active decoding search space can be observed that leads to improved recognition speed. Although the presented methods can be generally applied to any HMM speech recognizer, experiments are performed in the hybrid MMI-connectionist/HMM system framework on the speaker independent 5k WSJ database.
ABSTRACT
This paper presents a stochastic model of French intonation contours for use in text-to-speech synthesis. The model has two modules, a linguistic module that generates abstract prosodic labels from text, and a phonetic module that generates an F0 curve from the abstract prosodic labels. This model differs from previous work in the abstract prosodic labels used, which can be automatically derived from the training corpus. This feature makes it possible to use large corpora or several corpora of different speech styles, in addition to making it easy to adapt to new languages. The present paper focuses on the linguistic module, which does not require full syntactic analysis of the text but simply relies on a part-of-speech tagging technique. The results were validated by means of a perception test which showed that listeners did not perceive a significant difference in quality between the sentences synthesized with the original F0 curve (from a recording), and those synthesized with the model-generated curve. The proposed model thus appears to capture a large part of the grammatical information needed to generate F0 .
ABSTRACT
In our previous research we investigated the demarcative function of prosody at the sentence level and the importance of this information for listeners in terms of perception, acceptability and ease of comprehension. In this paper we investigate if the results obtained for utterances spoken in isolation can be generalised to utterances spoken in context.
ABSTRACT
Controlling timing in text-to-speech synthesis systems is complicated, because there are many contextual factors that affect timing; moreover, which factors matter and what their precise effects are varies among languages. We describe here a language-independent approach for duration control. At run time, a language-independent timing module accesses language-specific tables. These tables specify which sub-classes of the feature space (i.e., all combinations of context and phone identity) are homogeneous in the specific sense that the same factors have similar effects on the cases in a sub-class. Within a sub-class, durations are modeled by simple arithmetic models such as multiplicative, additive, or – more generally – sums-of-products models. Exploratory statistical methods (supervised) and parameter estimation techniques (unsupervised) are used for table construction.
ABSTRACT
This work presents and evaluates a model of segmental duration generation for Brazilian Portuguese where the notion of macrorhythmic unit is the starting point to drastically simplify duration assignment and to allow pause insertion as an integrated procedure of generation. This model is preferred to random assignment with the same error distribution. Some aspects of rhythm phonetics and phonology are also discussed that constitute a first step to the understanding of the prosodic component of the language under study.
ABSTRACT
Our work addresses the integration of speech recognition and natural language processing for spoken dialogue systems. To deal with recognition errors, we investigate two repairing strategies integrated in a parsing based on a Lexicalized Tree Grammar. The first strategy takes its root in the recognition hypothesis, the other in the linguistic expectations. We expose a formal framework to express the grammar, to describe the repairing strategies and to foresee further strategies.
ABSTRACT
During the last few years, word graphs have been gaining increasing interest within the speech community as the primary interface between speech recognizers and language processing modules. Both development and evaluation of graph-producing speech decoders require generally accepted measures of word graph quality. While the notion of recognition accuracy can easily be extended to word graphs, a meaningful measure of word graph size has not yet surfaced. We argue, that the number of derivation steps a theoretical parser would need to process all unique subpaths in a graph could provide a measure that is both application oriented enough to be meaningful and general enough to allow a useful comparison of word recognizers across different applications.
ABSTRACT
In this paper, we describe a fast search algorithm for statistical translation based on dynamic programming (DP) and present experimental results. The approach is based on the assumption that the word alignment is monotone with respect to the word order in both languages. To reduce the search effort for this approach, we introduce two methods: an acceleration technique to eficiently compute the dynamic programming recursion equation and a beam search strategy as used in speech recognition. The experimental tests carried out on the Verbmobil corpus showed that the search space, measured by the number of translation hypotheses, is reduced by a factor of about 230 without affecting the translation performance.
ABSTRACT
A corpus-based concatenative speech synthesis system using no signal processing can produce intelligible synthetic speech maintaining original voice characteristics, but it can sometimes be difficult to realize natural prosody. In such a concatenative system, it is very important to select appropriate waveform segments that are naturally close to the target prosody. This paper describes some approaches to unit selection for improving the prosody, especially intonation of such synthetic speech. If the unit selection measures for the fundamental frequency (F0) are insuficient, the concatenative system may produce speech having a discontinuous F0 pattern. Our proposed solution to this problem is to add extra measures for selecting units that form a smoother, more continuous F0 contour. Through subjective experiments, we confirmed that each of these measures effectively improved intonation naturalness.
ABSTRACT
A combination of a neural network with rule firing information from a rule-based system is used to generate segment durations for a text-to-speech system. The system shows a slight improvement in performance over a neural network system without the rule firing information. Synthesized speech using segment durations was accepted by listeners as having about the same quality as speech generated using segment durations extracted from natural speech.
ABSTRACT
In this paper, we describe a model of fundamental frequency control. In general, a two stage model which consists of a global model and a local model is used as a FO control method for Japanese text-to-speech systems. We propose a model which is represented by transition network as a global model that generates parameters of a local pitch model from linguistic parameters of a sentence. In the proposed model, syntactic analysis and generation of FO parameters are integrated, and the nodes of a network represent the linguistic and prosodic state of a sentence. The parameters of a local model is generated when taking transition. We also propose a training method of the network. The prediction results showed our model can predict the phrasal accent parameters with satisfactory high accuracy. We also describe the model can be applied prediction of pause position.
ABSTRACT
In this paper, we present a flexible architecture to implement a robust information retrieval system based on domain and linguistic modelling by means of a set of conceptual probabilistic and non-probabilistic grammars. It allows certain complexity in the functionality of the application, such as applying non-SQL functions to the results of SQL queries in order to retrieve information not explicitly included in the database, or translating certain natural spoken sentences that would produce difficult embedded queries.
ABSTRACT
Previous work on speech-to-speech translation has suffered from problems of brittleness and low quality (rule-based approaches), or from excessive data requirements and linguistic ineciency (analogical or example-based approaches). In this paper, we present a probabilistic approach to analogical speech translation, and describe its integration with linguistic processing. The evaluation results show that this approach results in high-accuracy translations in limited domains.
ABSTRACT
For very large vocabulary vocal dictation systems, we present a decoding strategy useful to reduce the lexical decoding cost. For each test-utterance, a sub-lexicon is selected from a very large recognition vocabulary. Such a recognition sub-lexicon is called Dynamic Lexicon (DL). Various algorithms of DL selection are developed and tested in terms of coverage rate of textual corpus. From these experiments, we describe a DL constitution we choose to use in D-DAL, our HMM-based recognizer competing for the first campaign of french vocal dictation supported by AUPELF. The contribution made by this original DL is a posteriori confirmed through the AUPELF-B1 test-dictation.