Session ThAB F0 and Duration Modelling, Spoken language processing

Chairperson Richard Schwartz BBN Systems and Techs, USA

Home


MODELING SEGMENTAL DURATION WITH MULTIVARIATE ADAPTIVE REGRESSION SPLINES

Authors: Marcel Riedi

Institut TIK, ETH Zurich 8092 Zurich, Switzerland riedi@tik.ee.ethz.ch

Volume 5 pages 2627 - 2630

ABSTRACT

The application of "Multivariate Adaptive Regression Splines" (MARS) to the problem of modeling duration of a set of segments used in a text-to-speech system for German is presented. MARS is a technique to estimate general functions of high-dimensional arguments given sparse data. It automatically selects the parameters and the structure of the model based on data available. The result is a model with a correlation coefficient between observed and predicted durations of a test set of . Besides highly accurate predicting durations, a MARS model also allows interpretation of its structure, demonstrated in this study by analyses of factor importance and interactions of the MARS model.

A0006.pdf

TOP


High-Quality Speech Synthesis for Phonetic Speech Segmentation

Authors: F. Malfrère and T. Dutoit

Circuits Theory and Signal Processing Lab Faculté Polytechnique de Mons 31, Boulevard Dolez, 7000 - Mons (BELGIUM) Tel. +32.65.37.41.33, FAX: +32.65.37.41.29, Email: {malfrere, dutoit}@tcts.fpms.ac.be

Volume 5 pages 2631 - 2634

ABSTRACT

This paper presents an original technique for solving the phonetic segmentation problem. It is based on the use of a speech synthesizer for the alignment of a text on its corresponding speech signal. A high-quality digital speech synthesizer is used to create a synthetic reference speech pattern used in the alignment process. This approach has the great advantage on other approaches that no training stage (hence no labeled database) is needed. The system has been mainly evaluated on French read utterances. Other evaluations have been made on other languages like English, German, Romanian and Spanish. Following these experiments, the system seems to be a powerful tool for the automatic constitution of large phonetically and prosodically labeled speech databases. The availability of such corpora will be a key point for the development of improved speech synthesis and recognition systems.

A0058.pdf

TOP


FACTORS AFFECTING PERCEIVED QUALITY AND INTELLIGIBILITY IN THE CHATR CONCATENATIVE SPEECH SYNTHESISER

Authors: Nick Campbell, Yoshiharu Itoh, Wen Ding, and Norio Higuchi

ATR Interpreting Telecommunications Research Laboratories e-mail: nick@itl.atr.co.jp, http://www.itl.atr.co.jp/chatr

Volume 5 pages 2635 - 2638

ABSTRACT

In order to eliminate trial-and-error in the process of selecting a good speech database as a voice source for concatenative speech synthesis, and to determine the acoustic and prosodic characteristics that best predict `appeal' or perceived `quality' in the synthesised speech, we performed tests to evaluate listener preferences over a range of different synthesised voices. We found that variation of fundamental frequency in the source database, and open-quotient of the glottis as measured by joint-estimation (ARX) were the best correlates. To our surprise, there was very little correlation between the scores for `intelligibility' and those for `naturalness' in the test data, but the former showed a close correlation with durational characteristics, and the latter with pitch and loudness.

A0193.pdf

Recordings

TOP


REDUCED LEXICON TREES FOR DECODING IN A MMI-CONNECTIONIST/HMM SPEECH RECOGNITION SYSTEM

Authors: Christoph Neukirchen, Daniel Willett, Gerhard Rigoll

Department of Computer Science Faculty of Electrical Engineering Gerhard-Mercator-University Duisburg, Germany e-mail: fchn,willett,rigollg@fb9-ti.uni-duisburg.de www: www.fb9-ti.uni-duisburg.de

Volume 5 pages 2639 - 2642

ABSTRACT

The presented work deals with the experimental identification of parts in a tree based decoder lexicon, that are more important for decoding efficiency compared to less important lexicon parts. Three different methods for constructing only the most important nodes in a set of tree lexicon copies are presented: building large trees; tree cutting; lexicon node removal. This leads to dramatic reduction of memory requirements while retaining the original recognition performance. In addition a reduction of the active decoding search space can be observed that leads to improved recognition speed. Although the presented methods can be generally applied to any HMM speech recognizer, experiments are performed in the hybrid MMI-connectionist/HMM system framework on the speaker independent 5k WSJ database.

A0349.pdf

TOP


A STOCHASTIC MODEL OF INTONATION FOR FRENCH TEXT-TO-SPEECH SYNTHESIS

Authors: Jean Véronis, Philippe Di Cristo, Fabienne Courtois, Benoît Lagrue

Laboratoire Parole et Langage, Université de Provence & CNRS 29 Av. Robert Schuman, 13621 Aix-en-Provence Cedex 1, France Tel. +33 4 42 95 36 33, FAX +33 4 42 59 50 96, E-mail: Jean.Veronis@lpl.univ-aix.fr

Volume 5 pages 2643 - 2646

ABSTRACT

This paper presents a stochastic model of French intonation contours for use in text-to-speech synthesis. The model has two modules, a linguistic module that generates abstract prosodic labels from text, and a phonetic module that generates an F0 curve from the abstract prosodic labels. This model differs from previous work in the abstract prosodic labels used, which can be automatically derived from the training corpus. This feature makes it possible to use large corpora or several corpora of different speech styles, in addition to making it easy to adapt to new languages. The present paper focuses on the linguistic module, which does not require full syntactic analysis of the text but simply relies on a part-of-speech tagging technique. The results were validated by means of a perception test which showed that listeners did not perceive a significant difference in quality between the sentences synthesized with the original F0 curve (from a recording), and those synthesized with the model-generated curve. The proposed model thus appears to capture a large part of the grammatical information needed to generate F0 .

A0429.pdf

Recordings

TOP


PHONETIC RULES FOR A PHONETIC-TO-SPEECH SYSTEM

Authors: A.A. Sanderman* and R. Collier**

*KPN Research P.O Box 421, 2260 AK Leidschendam, The Netherlands E-mail: A.A.SandermanQresearch.kpn.com **Institute for Perception Research, P.O. Box 513, 5600 MB, Eindhoven, The Netherlands E-mail: collierCipo.tue.nl

Volume 5 pages 2647 - 2650

ABSTRACT

In our previous research we investigated the demarcative function of prosody at the sentence level and the importance of this information for listeners in terms of perception, acceptability and ease of comprehension. In this paper we investigate if the results obtained for utterances spoken in isolation can be generalised to utterances spoken in context.

A0434.pdf

TOP


MULTI-LINGUAL DURATION MODELING

Authors: Jan van Santen, Chilin Shih, Bernd Mobius, Evelyne Tzoukermann, and Michael Tanenblatt

Lucent Technologies – Bell Labs, 600 Mountain Avenue, Murray Hill, NJ 07974, USA http://www.bell-labs.com/project/tts/

Volume 5 pages 2651 - 2654

ABSTRACT

Controlling timing in text-to-speech synthesis systems is complicated, because there are many contextual factors that affect timing; moreover, which factors matter and what their precise effects are varies among languages. We describe here a language-independent approach for duration control. At run time, a language-independent timing module accesses language-specific tables. These tables specify which sub-classes of the feature space (i.e., all combinations of context and phone identity) are homogeneous in the specific sense that the same factors have similar effects on the cases in a sub-class. Within a sub-class, durations are modeled by simple arithmetic models such as multiplicative, additive, or – more generally – sums-of-products models. Exploratory statistical methods (supervised) and parameter estimation techniques (unsupervised) are used for table construction.

A0447.pdf

TOP


A MODEL OF SEGMENT (AND PAUSE) DURATION GENERATION FOR BRAZILIAN PORTUGUESE TEXT-TO-SPEECH SYNTHESIS

Authors: Plinio A. Barbosa

Laboratorio de Fonetica Acnstica e Psicolinguistica Experimental-LAFAPE Instituto de Estudos da Linguagem Universidade Estadual de Campinas CP 6045 - 13081-970 Campinas-SP, Brazil Tel: +55 19 2397784, FAX: +55 19 2391501, E-mail: plinio@iel.unicamp.br

Volume 5 pages 2655 - 2658

ABSTRACT

This work presents and evaluates a model of segmental duration generation for Brazilian Portuguese where the notion of macrorhythmic unit is the starting point to drastically simplify duration assignment and to allow pause insertion as an integrated procedure of generation. This model is preferred to random assignment with the same error distribution. Some aspects of rhythm phonetics and phonology are also discussed that constitute a first step to the understanding of the prosodic component of the language under study.

A0467.pdf

Recordings

TOP


PARSING STRATEGY FOR SPOKEN LANGUAGE INTERFACES WITH A LEXICALIZED TREE GRAMMAR

Authors: Ariane Halber and David Roussel

Thomson-CSF Corporate Research Laboratory F-91404 Orsay Cedex, France E-mail: fariane, rousselg@thomson-lcr.fr

Volume 5 pages 2659 - 2662

ABSTRACT

Our work addresses the integration of speech recognition and natural language processing for spoken dialogue systems. To deal with recognition errors, we investigate two repairing strategies integrated in a parsing based on a Lexicalized Tree Grammar. The first strategy takes its root in the recognition hypothesis, the other in the linguistic expectations. We expose a formal framework to express the grammar, to describe the repairing strategies and to foresee further strategies.

A0577.pdf

TOP


WHAT'S IN A WORD GRAPH EVALUATION AND ENHANCEMENT OF WORD LATTICES

Authors: Jan W. Amtrup Henrik Heine Uwe Jost

University of Hamburg, Computer Science Department, Vogt-Kolln-Str. 30, D-22527 Hamburg, Germany email: amtrup|heine|jost @informatik.uni-hamburg.de

Volume 5 pages 2663 - 2666

ABSTRACT

During the last few years, word graphs have been gaining increasing interest within the speech community as the primary interface between speech recognizers and language processing modules. Both development and evaluation of graph-producing speech decoders require generally accepted measures of word graph quality. While the notion of recognition accuracy can easily be extended to word graphs, a meaningful measure of word graph size has not yet surfaced. We argue, that the number of derivation steps a theoretical parser would need to process all unique subpaths in a graph could provide a measure that is both application oriented enough to be meaningful and general enough to allow a useful comparison of word recognizers across different applications.

A0622.pdf

TOP


ACCELERATED DP BASED SEARCH FOR STATISTICAL TRANSLATION

Authors: C. Tillmann, S. Vogel, H. Ney, A. Zubiaga, H. Sawaf

Lehrstuhl fur Informatik VI, RWTH Aachen D-52056 Aachen, Germany

Volume 5 pages 2667 - 2670

ABSTRACT

In this paper, we describe a fast search algorithm for statistical translation based on dynamic programming (DP) and present experimental results. The approach is based on the assumption that the word alignment is monotone with respect to the word order in both languages. To reduce the search effort for this approach, we introduce two methods: an acceleration technique to eficiently compute the dynamic programming recursion equation and a beam search strategy as used in speech recognition. The experimental tests carried out on the Verbmobil corpus showed that the search space, measured by the number of translation hypotheses, is reduced by a factor of about 230 without affecting the translation performance.

A0766.pdf

TOP


USE OF PITCH PATTERN IMPROVEMENT IN THE CHATR SPEECH SYNTHESIS SYSTEM

Authors: Ken Fujisawa, Toshio Hirai , and Norio Higuchi

e-mail: fujisawa@itl.atr.co.jp ATR Interpreting Telecommunications Research Labs. 2-2 Hikaridai, Seika-cho, Soraku-gun, Kyoto 619-02, JAPAN

Volume 5 pages 2671 - 2674

ABSTRACT

A corpus-based concatenative speech synthesis system using no signal processing can produce intelligible synthetic speech maintaining original voice characteristics, but it can sometimes be difficult to realize natural prosody. In such a concatenative system, it is very important to select appropriate waveform segments that are naturally close to the target prosody. This paper describes some approaches to unit selection for improving the prosody, especially intonation of such synthetic speech. If the unit selection measures for the fundamental frequency (F0) are insuficient, the concatenative system may produce speech having a discontinuous F0 pattern. Our proposed solution to this problem is to add extra measures for selecting units that form a smoother, more continuous F0 contour. Through subjective experiments, we confirmed that each of these measures effectively improved intonation naturalness.

A0799.pdf

TOP


GENERATING SEGMENT DURATIONS IN A TEXT-TO-SPEECH SYSTEM: A HYBRID RULE-BASED/NEURAL NETWORK APPROACH

Authors: G. Corrigan, N. Massey, and O. Karaali

Speech Processing Laboratory Motorola, Inc. 1301 E. Algonquin Rd., Schaumburg, IL 60196, U.S.A. Tel. (847)576-2764, FAX: (847)576-8378, E-mail: corrigan@mot.com

Volume 5 pages 2675 - 2678

ABSTRACT

A combination of a neural network with rule firing information from a rule-based system is used to generate segment durations for a text-to-speech system. The system shows a slight improvement in performance over a neural network system without the rule firing information. Synthesized speech using segment durations was accepted by listeners as having about the same quality as speech generated using segment durations extracted from natural speech.

A0887.pdf

TOP


On the Global FO Shape Model using a Transition Network for Japanese Text-to-Speech Systems

Authors: Yasushi Ishikawa and Takashi Ebihara

Information Technology R&D Center, MITSUBISHI Electric Corporation 5-1-1 Ofuna, Kamakura, Kanagawa 247, Japan Phone: +81-467-41-2077 FAX: +81-467-41-2136 {yasushi,ebi} @media.isl.melco.co jp

Volume 5 pages 2679 - 2682

ABSTRACT

In this paper, we describe a model of fundamental frequency control. In general, a two stage model which consists of a global model and a local model is used as a FO control method for Japanese text-to-speech systems. We propose a model which is represented by transition network as a global model that generates parameters of a local pitch model from linguistic parameters of a sentence. In the proposed model, syntactic analysis and generation of FO parameters are integrated, and the nodes of a network represent the linguistic and prosodic state of a sentence. The parameters of a local model is generated when taking transition. We also propose a training method of the network. The prediction results showed our model can predict the phrasal accent parameters with satisfactory high accuracy. We also describe the model can be applied prediction of pause position.

A0895.pdf

TOP


AN ALTERNATIVE AND FLEXIBLE APPROACH IN ROBUST INFORMATION RETRIEVAL SYSTEMS

Authors: José Colás, Juan M. Montero, Javier Ferreiros, José M. Pardo

Grupo Tecnología del Habla - Dpto. Ingeniería Electrónica E.T.S.I. Telecomunicación - Universidad Politécnica de Madrid Ciudad Universitaria, s/n 28040 Madrid Spain

Volume 5 pages 2683 - 2686

ABSTRACT

In this paper, we present a flexible architecture to implement a robust information retrieval system based on domain and linguistic modelling by means of a set of conceptual probabilistic and non-probabilistic grammars. It allows certain complexity in the functionality of the application, such as applying non-SQL functions to the results of SQL queries in order to retrieve information not explicitly included in the database, or translating certain natural spoken sentences that would produce difficult embedded queries.

A1091.pdf

TOP


A Probabilistic Approach to Analogical Speech Translation

Authors: Keiko Horiguchi & Alexander Franz

D21 Laboratory, Sony Corporation 6-7-35, Kita-Shinagawa, Shinagawa-ku, Tokyo 141, Japan Email: fkeiko,amfg@pdp.crl.sony.co.jp

Volume 5 pages 2687 - 2690

ABSTRACT

Previous work on speech-to-speech translation has suffered from problems of brittleness and low quality (rule-based approaches), or from excessive data requirements and linguistic ineciency (analogical or example-based approaches). In this paper, we present a probabilistic approach to analogical speech translation, and describe its integration with linguistic processing. The evaluation results show that this approach results in high-accuracy translations in limited domains.

A1094.pdf

TOP


DYNAMIC LEXICON FOR A VERY LARGE VOCABULARY VOCAL DICTATION

Authors: Marie-José Caraty, Claude Montacié and Fabrice Lefèvre

LIP6 - Université Pierre et Marie Curie - CNRS 4, place Jussieu - 75252 Paris Cedex 5 - France Tel : (33/0) 1 44 27 26 74, Fax : (33/0) 1 44 27 70 00, E-mail : caraty@laforia.ibp.fr

Volume 5 pages 2691 - 2694

ABSTRACT

For very large vocabulary vocal dictation systems, we present a decoding strategy useful to reduce the lexical decoding cost. For each test-utterance, a sub-lexicon is selected from a very large recognition vocabulary. Such a recognition sub-lexicon is called Dynamic Lexicon (DL). Various algorithms of DL selection are developed and tested in terms of coverage rate of textual corpus. From these experiments, we describe a DL constitution we choose to use in D-DAL, our HMM-based recognizer competing for the first campaign of french vocal dictation supported by AUPELF. The contribution made by this original DL is a posteriori confirmed through the AUPELF-B1 test-dictation.

A1256.pdf

TOP