Session T4D Speech Synthesis: Linguistic Analysis

Chairperson Bjorn Granstrom KTH, Sweden

Home

PARSERS, PROMINENCE, AND PAUSES

Authors: Nick Campbell, Tony Hebert, and Ezra Black

ATR Interpreting Telecommunications Research Laboratories http://www.itl.atr.co.jp/chatr, e-mail: nick@itl.atr.co.jp

Volume 2 pages 979 - 982

ABSTRACT

We present results of a comparison between two prosody prediction algorithms, showing that the incorporation of information from a parser results in significantly improved performance for our text-to- speech synthesiser. We used a stochastic tree-based parser to generate a tagged and bracketed representation of the input text, and then interpreted this higher-level information to produce a ToBI-type prosodic annotation of the text. From this annotation an intonation contour was predicted for use in synthesising the speech. Results show that prediction of prosodic phrasing and focal prominence are improved by 56% and 62% respectively over previous methods compared against a human reading of the same test utterances.

A0192.pdf

TOP

Automatic Assignment of Part-Of-Speech to Out-of-Vocabulary Words for Text-To-Speech Processing

Authors: F. Béchet and M. El-Bèze

Laboratoire Informatique d'Avignon - LIA 339, chemin des Meinajaries BP 1228 - 84911 Avignon Cedex 9 - FRANCE E-mail : (frederic.bechet,marc.elbeze)@univ-avignon.fr

Volume 2 pages 983 - 986

ABSTRACT

Working with large corpora of text highlights the need for the special treatment of Out-Of-Vocabulary (OOV) words. This paper describes a strategy for processing OOV words within a Text-To-Speech (TTS) framework of the French language. A probabilistic module, called "Devin", guesses a Part-Of-Speech (POS) for each OOV word according to the morphological structure of the word and the context in which it occurs. These POS can be either syntactic or semantic. The semantic labels represent the categories of each proper-name (family name, town name, etc.) and its linguistic origin which has a strong influence on its pronunciation. According to these POS, the system chooses the correct set of rules which will be employed by the rule-based grapheme-to-phoneme transcriber of the TTS system.

A0418.pdf

TOP

TEXT-TO-PROSODY PARSING IN AN ITALIAN SYNTHESIZER. RECENT IMPROVEMENTS

Authors: Barbara Gili Fivela Silvia Quazza

Scuola Normale Superiore - P.zza dei Cavalieri, 7 - 56126 Pisa, Italy e-mail:gili@alphalinguistica.sns.it CSELT - Via Reiss Romoli, 274 -10148 Torino, Italy e-mail: silvia.quazza@cselt.it

Volume 2 pages 987 - 990

ABSTRACT

This paper describes recent improvements of a Prosodic Analyzer, designed to provide the CSELT Italian text-to-speech system ELOQUENSâ with a better handling of prosody. Based on lexical tagging, the Analyzer builds up the prosodic structure of the sentence, inserting proper prosodic markers at word boundaries. The approach belongs to the family of TTS-oriented, 'heuristic' strategies, inferring prosody directly from the building blocks of syntax and exploiting lexical and rhythmical language-dependent information. Latest improvements concern the linguistic knowledge base of the Analyzer, which has been enhanced and optimized. A formal evaluation of the Analyzer's performances is also presented in the paper.

A0552.pdf

TOP

TAGGING SYLLABLES

Authors: Brigitte Krenn

Department of Computational Linguistics University of the Saarland, Saarbrucken, Germany krenn@coli.uni-sb.de

Volume 2 pages 991 - 994

ABSTRACT

Syllabification is viewed as a tagging task. Phonemes constituting a syllable are treated like words in a sentence. Each phoneme is annotated with informa- tion representing the phoneme itself, and its posi- tion within a syllable. Within a number of tagging experiments, the specificity of linguistic information represented in the tag set is varied. The annotation scheme which encodes an onset-nucleus-coda model is shown to lead to the best tagging results.

A0723.pdf

TOP

ASSIGNING PHRASE BREAKS FROM PART-OF-SPEECH SEQUENCES

Authors: Alan W Black and Paul Taylor

Centre for Speech Technology Research, University of Edinburgh, 80, South Bridge, Edinburgh, U.K. EH1 1HN http://www.cstr.ed.ac.uk email: awb@cstr.ed.ac.uk, Paul.Taylor@ed.ac.uk

Volume 2 pages 995 - 998

ABSTRACT

One of the important stages in the process of turning unmarked text into speech is the assignment of appropriate phrase break boundaries. Phrase break boundaries are important to later modules including accent assignment, duration control and pause insertion. A number of different algorithms have been proposed for such a task, ranging from the simple to the complex. These different algorithms require different information such as part of speech tags, syntax and even semantic understanding of the text. Obviously these requirements come at differing costs and it is important to trade off difficulty in finding particular input features versus accuracy of the model. The simplest models are deterministic rules. A model simply inserting phrase breaks after punctuation is rarely wrong in assignment, but massively underpredicts as it will allow overly long phrases when the text contains no punctuation. More complex rule-driven models such as [1] involve much more detailed rules and require the input text to be parsed. On the other hand statistically based models offer the advantages of automatic training which make movement to a new domain or language much easier. Simple direct CART models using features such as punctuation, part of speech, accent positions etc. can produce reasonable results [5]. Other more complex stochastic methods optimising assignment over whole utterances (e.g. [8]) have also been developed. An important restriction that sometimes is ignored in these algorithms is that the inputs to the phrase break assignment algorithm have to be available at phrase break assignment time, and themselves be predictable from raw text. For example, some algorithms require accent assignment information but we believe accent assignment can only take place after prosodic boundaries are identified. A second example is the requirement of syntactic parsing of the input without providing a syntactic parser to achieve this. Thus we have ensured that both our phrase break assignment algorithm is properly placed within a full text to speech system and that the prediction of any required inputs is included in our tests. A second requirement for our algorithm was introduced by our observation that many phrase break assignment algorithms attempt to estimate the probability of a break at some point based only on local information. However, what may locally appear as a reasonable position for a break may in fact be less suitable than the position after the next word. That is, assignment should not be locally optimised but globally optimised over the whole utterance. For example in the sentence I wanted to go for a drive in the country. a potential good place for assignment may locally appear to be between "drive" and "in" based on part of speech information. However in the sentence I wanted to go to a drive in. such a position is unsuitable. Another example is a uniform list of nouns. Breaks between nouns are unusual but given a long list of nouns (e.g. the numbers 1 to 10) it then becomes reasonable to insert a phrase break. Thus we wish our model to have reasonable input requirements, use predicted values for the inputs as part of the test and consider global optimisation of phrase break assignment over the whole utterance.

A0792.pdf

TOP

PREDICTION OF WORD PROMINENCE

Authors: Christina Widera, Thomas Portele, and Maria Wolters

Institut für Kommunikationsforschung und Phonetik (IKP), University of Bonn, Poppelsdorfer Allee 47, 53115 Bonn, Germany {cwi, tpo, mwo}@ ikp.uni-bonn.de

Volume 2 pages 999 - 1002

ABSTRACT

Control of prosody is essential for the synthesis of natural sounding speech. Text-to-speech systems tend to accent too many words when taking into account only the distinction between open-class and closed-class words. In the prominence-based approach [1], the degree of accentuation of a syllable is described in terms of a gradual prominence parameter. This paper presents the calculation of the prominence level of words based on their word class, the classes of the surrounding words, and their position in a clause. Rules predicting word prominence are derived from statistical analysis of a prosodic database. The hand-crafted rules are compared with the results of several machine learning algorithms on the same material. Furthermore, a perceptual test and an analysis of the resulting speech signals are carried out

A0864.pdf