Full List of Titles 1: ICSLP'98 Proceedings 2: SST Student Day Author Index A B C D E F G H I J K L M N O P Q R S T U V W X Y Z Multimedia Files |
The Modeling and Realization of Natural Speech Generation SystemAuthors:
Chen Fang, Institute of Information Science,Northern Jiaotong University (China)
Page (NA) Paper number 1034Abstract:The paper gives an overall discussion on problems in Chinese natural speech generation. A Chinese Bi-directional Grammar is developed to suit for Chinese Language understanding and generation. A comprehensive discription about the structure of characteristic network of all ranks in language have been built up. In Natural language generation, text planning is proceeded at first to extract concrete content related to the semantic. Through text organization the internal generation structure is formed. Grammar realization transforms internal structure to natural language. After the text of natural language generated, the next step is to convert the text into speech. We build up a speech characteristic database with speech of 50 thousand phrases and hundreds of pronunciation rules. After recognizing the structure of the input text and abstracting the rhythm characteristics in text, the database gives completely a description from Chinese characters to speech. The whole Chinese character in GB2312-80 can be described to speech. Based on the research all above, a natural speech generation system is established. It can automatically plan and organize the output sentences in natural speech. The synthetic speech has good quality in naturalness and intelligibility.
|
0804_01.WAV(was: 0804_01.WAV.gz) | Example sound file. File type: Sound File Format: NIST/Sphere Tech. description: Sampling rate: 16 kHz, Bits-per-sample: 16, Encoding: Linear PCM Creating Application:: Unknown Creating OS: unix |
0804_02.WAV(was: 0804_02.WAV.gz) | Example sound file. File type: Sound File Format: NIST/Sphere Tech. description: Sampling rate: 16 kHz, Bits-per-sample: 16, Encoding: Linear PCM Creating Application:: Unknown Creating OS: unix |
0804_03.WAV(was: 0804_03.WAV.gz) | Example sound file. File type: Sound File Format: NIST/Sphere Tech. description: Sampling rate: 16 kHz, Bits-per-sample: 16, Encoding: Linear PCM Creating Application:: Unknown Creating OS: unix |
0804_04.WAV(was: 0804_04.WAV.gz) | Example sound file. File type: Sound File Format: NIST/Sphere Tech. description: Sampling rate: 16 kHz, Bits-per-sample: 16, Encoding: Linear PCM Creating Application:: Unknown Creating OS: unix |
0804_05.WAV(was: 0804_05.WAV.gz) | Example sound file. File type: Sound File Format: NIST/Sphere Tech. description: Sampling rate: 16 kHz, Bits-per-sample: 16, Encoding: Linear PCM Creating Application:: Unknown Creating OS: unix |
0804_06.WAV(was: 0804_06.WAV.gz) | Example sound file. File type: Sound File Format: NIST/Sphere Tech. description: Sampling rate: 16 kHz, Bits-per-sample: 16, Encoding: Linear PCM Creating Application:: Unknown Creating OS: unix |
Ismael García-Varea, Instituto Tecnológico de Informática, Universidad Politécnica de Valencia (Spain)
Francisco Casacuberta, Instituto Tecnológico de Informática, Universidad Politécnica de Valencia (Spain)
Hermann Ney, Lerhstuhl für Informatik VI, RWTH Aachen, University of Technology (Germany)
The increasing interest in the statistical approach to Machine Translation is due to the development of effective algorithms for training the probabilistic models proposed so far. However, one of the problems with Statistical Machine Translation is the design of efficient algorithms for translating a given input string. For some interesting models, only (good) approximate solutions can be found. Recently a Dynamic- Programming-like algorithm has been introduced which computes approximate solutions for some models. These solutions can be improved by using an iterative algorithm that refines the successive solutions and uses a smoothing technique for some probabilistic distribution of the models based on an interpolation of different distributions. The technique resulting from this combination has been tested on the "Tourist Task" corpus, which was generated in a semi-automated way. The best results achieved were a translation word-error rate of 9.3% and a sentence-error rate of 44.4%.
Barbara Gawronska, University of Skovde (Sweden)
David House, KTH (Royal Institute of Technology) (Sweden)
This paper describes an experimental dialog system designed to retrieve information and generate summaries of internet news reports related to user queries in Swedish and English. The extraction component is based on parsing and on matching the parsing output against stereotypic event templates. Bilingual text generation is accomplished by filling the templates after which grammar components generate the final text. The interfaces between the templates and the language-specific text generators are marked for prosodic information resulting in a text output where deaccentuation, accentuation, levels of focal accentuation, and phrasing are specified. These prosodic markers, which are primarily dependent on the giveness/newness structure of the text, modify the default prosody rules of the text-to-speech system which then reads the text with subsequent improvement in intonation.
Joris Hulstijn, University of Twente (The Netherlands)
Arjan van Hessen, University of Twente (The Netherlands)
This paper discusses the utterance generation module of a spoken dialogue system for transactions. Transactions are interesting because they involve obligations of both parties: the system should provide all relevant information; the user should feel committed to the transaction once it has been concluded. Utterance generation plays a major role in this. The utterance generation module works with prosodically annotated utterance templates. An appropriate template for a given dialogue act is selected by the following parameters: utterance type, body of the template, given information, wanted and new information. Templates respect rules of accenting and deaccenting.
Kai Ishikawa, ATR Interpreting Telecommunications Research Laboratories (Japan)
Eiichiro Sumita, ATR Interpreting Telecommunications Research Laboratories (Japan)
Hitoshi Iida, ATR Interpreting Telecommunications Research Laboratories (Japan)
In speech translation, recognition errors produced by the speech recognition process can cause parsing and translation errors. Because of this, the development of a robust error handling framework is quite essential to improve the performance of the speech translation system. Previously, a robust translation method was proposed by Wakita, which translates only reliable parts in utterances. In this method, however, the recall of translated parts for a whole utterance is low, and sometimes no translation is output. In this paper, we propose an example-based error recovery method to solve the low recall problem of Wakita's method. The proposed method recovers an unreliable utterance, by repairing the parse-tree of the utterance based on similar example parse-trees in the tree- bank. A recovered translation is generated from the recovered tree.
Emiel Krahmer, IPO, Center for Research on User-System Interaction (The Netherlands)
Mariët Theune, IPO, Center for Research on User-System Interaction (The Netherlands)
Probably the best current algorithm for generating definite descriptions is the Incremental Algorithm due to Dale and Reiter. If we want to use this algorithm in a Concept-to-Speech system, however, we encounter two limitations: (1) the algorithm is insensitive to the linguistic context and thus always produces the same description for an object, (2) the output is a list of properties which uniquely determine one object from a set of objects: how this list is to be expressed in spoken natural language is not addressed. We propose a modification of the Incremental Algorithm based on the idea that a definite description refers to the most salient element in the current context satisfying the descriptive content. We show that the modified algorithm allows for the context-sensitive generation of both distinguishing and anaphoric descriptions, while retaining the attractive properties of Dale and Reiter's original algorithm.
Lori Levin, Carnegie Mellon University (USA)
Donna Gates, Carnegie Mellon University (USA)
Alon Lavie, Carnegie Mellon University (USA)
Alex Waibel, Carnegie Mellon University (USA)
This paper describes an interlingua for spoken language translation that is based on domain actions in the travel planning domain. Domain actions are composed of speech acts (e.g., request-information), attributes (e.g., size, price), and objects (e.g., hotel, flight) and can take arguments. Development of the interlingua is guided by a database containing travel dialogues in English, Korean, Japanese, and Italian. There are currently 423 domain actions that cover hotel reservation and transportation. The interlingua will soon be extended to cover tours, tourist attractions, and events. The interlingua is used by the C-STAR speech translation consortium for translating travel planning dialogues in six languages: English, Japanese, German, Korean, Italian, and French. The paper also addresses the role of the interlingua in Carnegie Mellon's JANUS translation system.
Sandra Williams, Microsoft Research Institute (Australia)
This paper describes a concept-to-speech system for generating spoken descriptions of routes between places within Macquarie University Computing Department. The Natural Language Generation (NLG) component of the system generates a textual route description marked with intonational information. The discourse structure of the route description is related closely to the knowledge representation of the route. The NLG component includes a pitch accenting algorithm which places appropriate pitch accents on elements of the utterance requiring particular emphasis or stress. Our pitch accenting algorithm uses a domain knowledge base and a discourse history. From these it determines whether information selected to form the content of the utterance is shared mutual domain knowledge, given information, or new information. It can then assign an appropriate pitch accent to one word in each prosodic phrase. The text-to-speech component then determines the appropriate syllable to be accented in the word.
Tobias Ruland, Siemens AG (Germany)
C. J. Rupp, University of the Saarland (Germany)
Jörg Spilker, University of Erlangen-Nürnberg (Germany)
Hans Weber, University of Erlangen-Nürnberg (Germany)
Karsten L. Worm, University of the Saarland (Germany)
This paper describes ongoing research on robust spoken language understanding in the context of the Verbmobil speech-to-speech machine translation project. We focus on recent developments in the processing steps which map a word lattice to a semantic representations. The approach described firstly applies speech repair correction to word lattices. Four analysis methods of varying depth are then applied in parallel to the normalized word lattices, producing output for sub-portions of the lattice in the same semantic description language, the VIT format. These fragmentary analyses are stored and combined by a further processing component, which finally selects a sequence of semantic representations as a result.
Jon R.W. Yi, MIT Laboratory for Computer Science (USA)
James R. Glass, MIT Laboratory for Computer Science (USA)
The goal of this work was to develop a speech synthesis system which concatenates variable-length units to create natural-sounding speech. Our initial work showed that by careful design of system responses to ensure consistent intonation contours, natural-sounding speech synthesis was achievable with word- and phrase-level concatenation. In order to extend the flexibility of this framework, we focused on generating novel words from a corpus of sub-word units. The design of the corpus was motivated by perceptual experiments that investigated where speech could be spliced with minimal audible distortion and what contextual constraints were necessary to maintain in order to produce natural-sounding speech. From this sub-word corpus, a Viterbi search selects a sequence of units based on how well they match the input specification and concatenation constraints. This concatenative speech synthesis system, ENVOICE, has been used in a conversational system in two application domains to convert meaning representations into speech waveforms.
1151_01.WAV(was: 1151_01.WAV) | 1st of 3 example waveforms from section 2 File type: Sound File Format: Sound File: WAV Tech. description: 16000 Hz, 16 bits/sample, mono, PCM Creating Application:: Unknown Creating OS: Unknown |
1151_02.WAV(was: 1151_02.WAV) | 2nd of 3 example waveforms from section 2 File type: Sound File Format: Sound File: WAV Tech. description: 16000 Hz, 16 bits/sample, mono, PCM Creating Application:: Unknown Creating OS: Unknown |
1151_03.WAV(was: 1151_03.WAV) | 3rd of 3 example waveforms from section 2 File type: Sound File Format: Sound File: WAV Tech. description: 16000 Hz, 16 bits/sample, mono, PCM Creating Application:: Unknown Creating OS: Unknown |
1151_04.WAV(was: 1151_04.WAV) | 1st of 2 example waveforms from section 6 File type: Sound File Format: Sound File: WAV Tech. description: 16000 Hz, 16 bits/sample, mono, PCM Creating Application:: Unknown Creating OS: Unknown |
1151_05.WAV(was: 1151_05.WAV) | 2nd of 2 example waveforms from section 6 File type: Sound File Format: Sound File: WAV Tech. description: 16000 Hz, 16 bits/sample, mono, PCM Creating Application:: Unknown Creating OS: Unknown |