Full List of Titles 1: ICSLP'98 Proceedings 2: SST Student Day Author Index A B C D E F G H I J K L M N O P Q R S T U V W X Y Z Multimedia Files |
How To Handle "Foreign" Sounds in Swedish Text-to-Speech Conversion: Approaching the 'Xenophone' ProblemAuthors:
Robert Eklund, Telia Research AB (Sweden)
Page (NA) Paper number 514Abstract:This paper discusses the problem of handling 'foreign' speech sounds in Swedish speech technology systems, in particular speech synthesis. A production study is made, where it is shown that Swedish speakers add foreign speech sounds, here termed 'xenophones', to their phone repertoire when reading Swedish sentences with embedded English names and words. As a result of the observations, the phone set of a Swedish concatenative synthesizer is extended, and it is shown (by example) that this produces more natural-sounding synthetic speech.
|
0514_01.WAV(was: 0514_01.WAV) | Speech synthesis example. File type: Sound File Format: OTHER Tech. description: Sampling rate: 16 kHz, bits-per-sample: 16, Mono., Encoding: Linear PCM, Creating Application:: Unknown Creating OS: unix |
0514_02.WAV(was: 0514_02.WAV) | Speech synthesis example. File type: Sound File Format: OTHER Tech. description: Sampling rate: 16 kHz, Bits-per-sample: 16, Mono., Encoding: Linear PCM Creating Application:: Unknown Creating OS: unix |
Nick Campbell, ATR-ITL (Japan)
This paper describes a method of concatenative speech synthesis that makes use of 3-dimensional labelling of speech, and shows how this can be applied to the synthesis of both mono-lingual and foreign-language speech. The dimensions encode phonetic, prosodic, and voice-quality information in order to fully describe the acoustic characteristics of each speech segment.
0024_01.WAV(was: 0024_01.wav) | Since CHATR produces speech in the recognisable voice of a known
person, it offers the potential to extend that person's apparent
abilities into the realm of multi-linguality. By offering this
ability to the voice of a young child, we are perhaps meeting Furui's
expectations [3]. [SOUND 0024.01.WAV][SOUND 0024.02.WAV] File type: Sound File Format: Sound File: WAV Tech. description: Unknown Creating Application:: Unknown Creating OS: Unknown |
0024_02.WAV(was: 0024_02.wav) | Since CHATR produces speech in the recognisable voice of a known
person, it offers the potential to extend that person's apparent
abilities into the realm of multi-linguality. By offering this
ability to the voice of a young child, we are perhaps meeting Furui's
expectations [3]. [SOUND 0024.01.WAV][SOUND 0024.02.WAV] File type: Sound File Format: Sound File: WAV Tech. description: Unknown Creating Application:: Unknown Creating OS: Unknown |
0024_03.WAV(was: 0024_03.wav) | By mapping from the phone sequence predicted for synthesis in one
language to the phone-set used to label the speech of another, we can
produce foreign-language speech using the voice of any speaker. In
these examples we use the voice of a small Japanese child to speak in
English ([SOUND 0024.03.WAV][SOUND 0024.04.WAV] greeting) and Korean
([SOUND 0024.05.WAV] [SOUND 0024.06.WAV] explaining the technical
processing within CHATR). File type: Sound File Format: Sound File: WAV Tech. description: Unknown Creating Application:: Unknown Creating OS: Unknown |
0024_04.WAV(was: 0024_04.wav) | By mapping from the phone sequence predicted for synthesis in one
language to the phone-set used to label the speech of another, we can
produce foreign-language speech using the voice of any speaker. In
these examples we use the voice of a small Japanese child to speak in
English ([SOUND 0024.03.WAV][SOUND 0024.04.WAV] greeting) and Korean
([SOUND 0024.05.WAV] [SOUND 0024.06.WAV] explaining the technical
processing within CHATR). File type: Sound File Format: Sound File: WAV Tech. description: Unknown Creating Application:: Unknown Creating OS: Unknown |
0024_05.WAV(was: 0024_05.wav) | By mapping from the phone sequence predicted for synthesis in one
language to the phone-set used to label the speech of another, we can
produce foreign-language speech using the voice of any speaker. In
these examples we use the voice of a small Japanese child to speak in
English ([SOUND 0024.03.WAV][SOUND 0024.04.WAV] greeting) and Korean
([SOUND 0024.05.WAV] [SOUND 0024.06.WAV] explaining the technical
processing within CHATR). File type: Sound File Format: Sound File: WAV Tech. description: Unknown Creating Application:: Unknown Creating OS: Unknown |
0024_06.WAV(was: 0024_06.wav) | By mapping from the phone sequence predicted for synthesis in one
language to the phone-set used to label the speech of another, we can
produce foreign-language speech using the voice of any speaker. In
these examples we use the voice of a small Japanese child to speak in
English ([SOUND 0024.03.WAV][SOUND 0024.04.WAV] greeting) and Korean
([SOUND 0024.05.WAV] [SOUND 0024.06.WAV] explaining the technical
processing within CHATR). File type: Sound File Format: Sound File: WAV Tech. description: Unknown Creating Application:: Unknown Creating OS: Unknown |
0024_07.PDF(was: 0024_01.GIF) | Section 5.2: To reduce the `accent', we adopt the following two-stage
process: ([IMAGE 0024\_01.GIF] schematic). File type: Image File Format: GIF Tech. description: Unknown Creating Application:: Unknown Creating OS: Unknown |
Takashi Saito, Tokyo Research Laboratory, IBM Japan Ltd. (Japan)
This paper focuses on a method for automatically dividing speech utterances into phonemic segments, which are used for constructing synthesis unit inventories for speech synthesis. Here, we propose a new segmentation parameter called, "F0 dynamics (DF0)." In the fine structures of F0 contours, there exist phonemic events which are observed as local dips at phonemic transition regions, especially around voiced consonants. We apply this observation about F0 contours to a speech segmentation method. The DF0 segmentation parameter is used in the final stage of the segmentation procedure to refine the phonemic boundaries roughly obtained by DP alignment. We conduct experiments on the proposed automatic segmentation with a speech database prepared for unit inventory construction, and compare the obtained boundaries with those of manual segmentation to show the effectiveness of the proposed method. We also discuss the effects of the boundary refinement on the synthesized speech.
Atsuhiro Sakurai, Dep. of Information and Communication Engineering, The Univ. of Tokyo and Tsukuba R&D Center, Texas Instruments (Japan)
Takashi Natsume, Dep. of Information and Communication Engineering, The Univ. of Tokyo (Japan)
Keikichi Hirose, Dep. of Information and Communication Engineering, The Univ. of Tokyo (Japan)
We propose a method to generate a database that contains a parametric representation of F0 contours associated with linguistic and acoustic information, to be used by data-driven Japanese text-to-speech (TTS) systems. The configuration of the database includes recorded speech, F0 contours and their parametric labels, phonetic transcription with durations, and other linguistic information such as orthographic transcription, part-of-speech (POS) tags, and accent types. All information that is not available by dictionary lookup is obtained automatically. In this paper, we propose a method to automatically obtain parametric labels that describe F0 contours based on a superpositional model. Preliminary tests on a small data set show that the method can find the parametric representation of F0 contours with acceptable accuracy, and that accuracy can be improved by introducing additional linguistic information.
Alexander Kani, Oregon Graduate Institute of Science and Technology (USA)
Michael W. Macon, Oregon Graduate Institute of Science and Technology (USA)
Voice adaptation describes the process of converting the output of a text-to-speech synthesizer voice to sound like a different voice after a training process in which only a small amount of the desired target speaker's speech is seen. We employ a locally linear conversion function based on Gaussian mixture models to map bark-scaled line spectral frequencies. We compare performance for three different estimation methods while varying the number of mixture components and the amount of data used for training. An objective evaluation revealed that all three methods yield similar test results. In perceptual tests, listeners judged the converted speech quality as acceptable and fairly successful in adapting to the target speaker.
Gregor Möhler, University of Stuttgart (Germany)
In this study a data-based approach to intonation modeling is presented. The model incorporates knowledge from intonation theories like the expected types of F0 movements and syllable anchoring. The knowledge is integrated into the model using an appropriate approximation function for F0 parametrization. The F0 parameters that result from the parametrization are predicted from a set of features using neural nets. The quality of the generated contours is assessed by means of numerical measures and perception tests. They show that the basic hypotheses about intonation description and modeling are in principle correct and that they have the potential to be successfully applied to speech synthesis. We argue for a clear interface with a linguistic description (using pitch-accent and boundary labels as input) and discourse structure (using pitch-range normalized F0 parameters).