Session WMC Databases, Tools and Evaluations

Chairperson Khalid Choukri ELRA

Home


THE BAVARIAN ARCHIVE FOR SPEECH SIGNALS: RESOURCES FOR THE SPEECH COMMUNITY

Authors: F. Schiel, Ch. Draxler, H.G. Tillmann

Institut fur Phonetik und Sprachliche Kommunikation, Schellingstr. 3, 80799 Munchen, Germany bas@phonetik.uni-muenchen.de

Volume 4 pages 1687 - 1690

ABSTRACT

This paper gives an overview of the activities at the Bavarian Archive of Speech Signals (BAS) that was founded as a non-profit organization in 1995. The main purpose of BAS is the development of aComplete Phonetic Theory (CPT) of German based on the empirical exploitation of very large databases of spoken German. However, on our way to that goal BAS will act as a focal point for all computer readable speech resources in the German language and distribute these resources to the speech community. These resources are intended to cover the speech part of the German language, i.e. speech data, labeling and segmentations, knowledge about pronunciation. In the following we give a concise overview of what resources are presently available at BAS, how they were produced, how they can be obtained from BAS, how we use these resources in various scientific activities and a brief summary of ongoing projects.

A0374.pdf

TOP


WWWTranscribe - A MODULAR TRANSCRIPTION SYSTEM BASED ON THE WORLD WIDE WEB

Authors: Christoph Draxler

IPSK – Department of Phonetics and Speech Communication University of Munich Schellingstr. 3, D 80799 Munich, Germany Tel. +49/89/2866 9968, Fax +49/89/280 0362, E-mail draxler@phonetik.uni-muenchen.de

Volume 4 pages 1691 - 1694

ABSTRACT

WWWTranscribe is a transcription system based on the WWW. It is platform independent and allows network access to speech databases. Its modular structure make it flexible, and it connects easily to existing signal processing applications or database management systems. WWWTranscribe consists of static HTML documents containing forms. To these forms CGI applications are attached that perform data processing and that dynamically create subsequent HTML documents. The system has been developed for the orthographic annotation of the German SpeechDat(II) telephone speech database. In its current implementation, it automatically creates SAM-PA annotation files according to the SpeechDat(II) database specifications [5], [4]. Variants of the system are being used for transcription by other SpeechDat(II) partners.

A0375.pdf

TOP


DESIGN, RECORDING AND VERIFICATION OF A DANISH EMOTIONAL SPEECH DATABASE

Authors: Inger S. Engberg, Anya V. Hansen, Ove Andersen and Paul Dalsgaard

Center for PersonKommunikation Aalborg University, Frederik Bajers Vej 7 A2, 9220 Aalborg Řst, Denmark. Tel. +45 96 35 86 78, FAX: +45 98 15 15 83, E-mail: ise@cpk.auc.dk

Volume 4 pages 1695 - 1698

ABSTRACT

A database of recordings of Danish Emotional Speech, DES, has been recorded and analysed. DES has been collected in order to evaluate how well the emotional state in emotional speech is identified by humans. The results sets a standard for identifying Danish emotional speech. DES contains recordings from four actors, two of each gender. Actors were used for the recordings as they were believed to be able to realistically convey a number of emotions, namely: neutral, surprise, happiness, sadness and anger. The recordings from each actor consist of two isolated words, nine sentences and two passages. The complete database comprises approximately 30 minutes of speech. A listening test with 20 listeners was conducted. The emotions were on the average identified correctly in 67,3% of the cases, with a [66,0 - 68,6] 95% confidence interval. An analysis reveals that most confusion occurred between surprise and happiness and between neutral and sadness.

A0385.pdf

TOP


ISSUES IN DATABASE CREATION: RECORDING NEW POPULATIONS, FASTER AND BETTER LABELLING

Authors: M. Eskenazi, C. Hogan, J. Allen, R Frederking

Language Technologies Institute Cyert Hall Carnegie Mellon University 5000 Forbes Ave. Pittsburgh, Pa. 15213 USA. Tel. +1 412 268 3858, FAX: +1 412 268 6298, E-mail: max@cs.cmu.edu

Volume 4 pages 1699 - 1702

ABSTRACT

As speech recognition systems become more accurate, they are used for more diverse applications. These applications often involve populations who never used a recogniser before and for whom the standard data for adult male, adult female, or mixed adult speech is not very representative. This paper will deal with issues concerning the collection and processing of data from those new speaker populations and from speakers of different languages. It deals with data collected for various projects, such as the KIDS database [1] and the Diplomat project [2]. It specifically discusses issues related to obtaining quantitatively and qualitatively sufficient amounts of speech from diverse speaker populations. Since the speech of these individuals is very different from the speech collected in the past, we assume that some hand labelling may be necessary and therefore also address the issue of ameliorating the labelling process.

A0626.pdf

TOP


Design and Analysis of a German Telephone Speech Database for Phoneme Based Training

Authors: Stefan Feldes*, Bernhard Kaspar* & Denis Jouvet**

email: {feldes, kaspar}@tzd.telekom.de * Research Group Speech Processing, Deutsche Telekom Berkom, 64295 Darmstadt, Germany ** France Télécom - CNET- LAA/TSS/RCP, 2 Avenue Pierre Marzin, 22307 Lannion, France

Volume 4 pages 1703 - 1706

ABSTRACT

Based on the Sotscheck text corpus, we developped a new corpus that was specifically optimised for training phoneme-based recognition systems. Particular attention was payed on good coverage of phone transitions. Even though the resulting corpus is only slightly enlarged, it shows an increased phonetic coverage while maintaining a good phonetic balance. Results of phonetic statistical analysis and of experiments for training an allophone-based recognizer are reported here.

A0652.pdf

TOP


THE DESIGN OF A LARGE VOCABULARY SPEECH CORPUS FOR PORTUGUESE

Authors: Joao P. Neto, Ciro A. Martins, Hugo Meinedo and Luis B. Almeida

INESC - IST R. Alves Redol, 9 1000 Lisboa - Portugal E-Mails: jpn@inesc.pt, cam@inesc.pt, hdsm@inesc.pt, lba@inesc.pt

Volume 4 pages 1707 - 1710

ABSTRACT

The last years show a great development of large vocabulary, speaker-independent continuous speech recognition systems and some research in multilingual aspects. To allow that development to also be extended to the European Portuguese language we decided to develop and collect a large database of continuous speech based on a large amount oftext. In the development of this new Portuguese database our aim was to create a corpus equivalent in size to WSJ0. We selected the database texts from the PUBLICO newspaper, which is characterized by a broad coverage of matters and different writing styles. The recording population was selected from a large engineering school, assuring a large variability of speakers. The recordings are being done as we write this paper and we expect to release the database in CD format in September 1997.

A0654.pdf

TOP


CONTINUED INVESTIGATIONS OF LARYNGECTOMEE SPEECH IN NOISE - MEASUREMENTS AND INTELLIGIBILITY TESTS

Authors: Lennart Nord, Britta Hammarberg*, and Elisabet Lundström

Department of Speech, Music and Hearing, KTH, STOCKHOLM, SWEDEN *also Dept of Logopedics and Phoniatrics, Karolinska Institute, Huddinge University Hospital Phone +46 8 790 7874, Fax +46 8 790 7854, e-mail: lennord@speech.kth.se

Volume 4 pages 1711 - 1714

ABSTRACT

The speech of nine laryngectomized persons is analysed. Both tracheo-esophageal and esophageal speakers are included. The speech performance of the subjects is evaluated while they are reading texts aloud with varying amounts of noise in their ears. One hypothesis is that the amount of noise will influence the articulation skill and voice behaviour of the speakers. Preliminary results show that some of the tracheo-esophageal speakers were able to raise their voice level as much as the normal laryngeal speakers. The esophageal speakers on the other hand were usually not able to produce as strong voice levels during the text readings. Acoustic speech parameters, such as sound pressure and spectral characteristics were measured and compared among the subjects. when the speakers were forced to use a great deal of effort, they spent air rapidly and had to use shorter stretches of speech with many pauses for inhalation.

A0674.pdf

TOP


AN APPRECIATION STUDY OF AN ASR INQUIRY SYSTEM

Authors: L.J.M. Rothkrantz and W.A.Th. Manintveld and M.M.M. Rats R.J. van Vark and J.P.M. de Vreught and H. Koppelaar

Knowledge Based Systems Technical Computer Science Delft University of Technology alparon@kgs.twi.tudelft.nl

Volume 4 pages 1715 - 1718

ABSTRACT

Human factors play an important role in the applications of speech technology. Ina Wizard of Oz experiment, 64 telephonic inquiry systems were simulated by systematic manipulation of 6 human factors. To assess the impact of those factors on the appreciation, an appreciation scale was developed based on an questionnaire. In an experiment 414 respondents were requested to call one of the simulated systems and a system operated by human operators. The respondents rated these systems on an appreciation scale. In this paper a description of the experiment is given and the results of a statistical analysis of the appreciation scores is presented.

A0708.pdf

TOP


OBJECT-ORIENTED MODELING OF ARTICULATORY DATA FOR SPEECH RESEARCH INFORMATION SYSTEMS

Authors: K.Bensaber, P.Munteanu (1), JF.Serignat (1), P.Perrier

Institut de la Communication Parlée Tel. (33) 04 76 57 45 41, FAX: (33) 04 76 57 47 10, E-mail: bensaber@icp.grenet.fr (1) Now at CLIPS/IMAG

Volume 4 pages 1719 - 1722

ABSTRACT

In this paper we present a general framework, based on the Object-oriented paradigm, for modeling and designing a model of speech data representation, and we propose a particular use of it for cineradiographic data, including sagittal views of the vocal tract, frontal pictures of the lips, and acoustic signals. We introduced semantics to represent relationships between speech objects. Thus we adopted the concepts of primary data, that means either the raw data (recorded signals and images) or their related descriptive data (information on speakers, corpora and recording conditions), and of derived data, such as vocal-tract's contours, sagittal distances, area functions, or any other possible measurements taken from X-rays pictures. Indeed, the notion of derived data model has been useful for users, to manage raw data and results of data analysis in the same way.

A0774.pdf

TOP


A KOREAN SPEECH CORPUS FOR TRAIN TICKET RESERVATION AID SYSTEM BASED ON SPEECH RECOGNITION

Authors: Woosung Kim and Myoung--Wan Koo

Multimedia Technology Research Laboratory Korea Telecom 17 Umyon-dong, Seocho-gu, Seoul, 137 -- 792, Korea. Tel. +82--2--3290--5020, Fax : +82--2--3290--5007, E-mail:fsung,mwkoog@smm.kotel.co.kr

Volume 4 pages 1723 - 1726

ABSTRACT

This paper describes the Korean speech corpus for train ticket reservation aid system based on speech recognition. Two sets of speech corpus were collected. One was based on human-human(H-H) dia- logues and the other was based on human-computer(H- C) dialogues. WOZ(Wizard of Oz) experiment was carried out to collect speech corpus based on H-C spoken dialogue. A total of 298 speaker data was col- lected for H-C corpus and a total of 100 speaker data was collected for H-H corpus. Since the basic unit of grammar in Korean is a morpheme, Korean-language model based on a morpheme was designed in addition to a word-based language model. Linguistic analysis results show that people respond differently when they are talking to a computer compared to when talking to a human. Also language-model analysis results reveal that a morpheme-based language model gives 50% reduction in perplexity(PP) over a word-based one.

A0967.pdf

TOP


RECALL MEMORY FOR EARCONS

Authors: Dawn Dutton, Candace Kamm, Susan Boyce

AT&T 2K-330, 101 Crawfords Corner Road, Holmdel, NJ 07733 dldutton@att.com cak@research.att.com sjboyce@att.com

Volume 4 pages 1727 - 1730

ABSTRACT

Our voice enabled telecommunications service, Annie, is a prototyping system that gives users the ability to access a variety of telephone-based services by voice. The user interface of Annie uses an anthropomorphic "personal assistant" metaphor. The user can maintain a "conversation-like" dialog with Annie, but user input is limited by the grammar-constrained automatic speech recognition (ASR) technology used in the service. Because the grammars change depending on the state the user is in, the system must provide clear recognition feedback and orienting information throughout the dialog. Verbal recognition feedback is tedious and time-consuming for the frequent, expert user. This paper describes an experiment that explores the feasibility of providing non-verbal recognition feedback and orienting information through the use of earcons, or auditory icons. Users of Annie were exposed to five earcons presented in parallel with verbal recognition feedback for a minimum of five days. Subsequently, users were asked to recall the identity of each of the five earcons alone. Subjects were able to reliably recall each of the earcons. Since users could recall the earcons, it is feasible that the non-verbal earcons could replace the lengthier verbal recognition feedback.

A0974.pdf

TOP


SEMI-AUTOMATIC PHONETIC LABELLING OF LARGE CORPORA

Authors: O. Mella and D. Fohr

CRIN-CNRS & INRIA Lorraine Batiment LORIA, B.P. 239 F54506 Vandoeuvre-lčs-Nancy, France Tel. +33 3.83.59.20.80 Fax. +33 3.83.41.30.79 E-mail: {mella,fohr}@loria.fr

Volume 4 pages 1731 - 1734

ABSTRACT

The aim of the present paper is to present a methodology to semi-automatically label large corpora. This methodology is based on three main points: using several concurrent automatic stochastic labellers, decomposing the labelling of the whole corpus into an iterative refining process and building a labelling comparison procedure which takes into account phonologic and acoustic-phonetic rules to evaluate the similarity of the various labelling of one sentence. After having detailed these three points, we describe our HMM-based labelling tool and we describe the application of that methodology to the Swiss French POLYPHON database.

A1037.pdf

TOP


CORPORA - SPEECH DATABASE FOR POLISH DIPHONES

Authors: S.Grocholewski

Institute of Computing Science Poznań University of Technology ul.Piotrowo 3a, 60 -965 Poznań, Poland email: grocholew@pozn1v.put.poznan.pl

Volume 4 pages 1735 - 1738

ABSTRACT

In the paper the attempts for creating the first databases for Polish are presented. Among two databases, supported by The Polish National Research Committee, and COPERNICUS project ( 1304 "BABEL: a Multi-Language Database" for Polish, Bulgarian, Estonian, Hungarian, Romanian) the first of them is presented in detail. The speech material contains 365 utterances (alphabet letters, digits, 200 first names, 114 sentences) uttered by 45 speakers. In the paper the design ideas, recording conditions, annotation rules, the method of automatic segmentation and labelling used in CORPORA are presented.

A1053.pdf

TOP


Multilingual Speech Interfaces (MSI) and Dialogue Design Environments for Computer Telephony Services

Authors: Christel Müller and Thomas Ziem

Deutsche Telekom Berkom GmbH Research Group Speech Systems and Computer Telephony 10589 Berlin, Goslarer Ufer 35, Germany Tel. +49 30 3497 2310, FAX: +49 30 3497 2967, E-mail: c.mueller@berkom.de

Volume 4 pages 1739 - 1742

ABSTRACT

Today voice processing systems especially in the area of telecommunication are used by a wide range of customers in more and more countries. The most important facts for people who create applications based on voice processing systems are rapid prototyping, updating, modification and extension of dialogues as well as their standardised components. At Deutsche Telekom research lab of speech systems and computer telephony a graphical user interface for designing computer telephony applications was developed as part of a CTI (Computer Telephony Integration) model on top of standardised interfaces for speech and other CT components in a client server architecture. The approach to this CTI model ensures the independence of different resources e.g. ASR, TTS and other. The most complicated part of this CTI model was the realisation of the resource management layer for multilingual speech interfaces because different ASR technologies do not support unified recognition functions and resource parameters.

A1108.pdf

TOP


Getting Started with SUSAS: A Speech Under Simulated and Actual Stress Database

Authors: John H.L. Hansen and Sahar E. Bou-Ghazale

Robust Speech Processing Laboratory Duke University, Department of Electrical Engineering Box 90291, Durham, North Carolina 27708-0291, U.S.A. http://www.ee.duke.edu/Research/Speech

Volume 4 pages 1743 - 1746

ABSTRACT

It is well known that the introduction of acoustic background distortion and the variability resulting from environmentally induced stress causes speech recognition algorithms to fail. In this paper, we discuss SUSAS: a speech database collected for analysis and algorithm formulation of speech recognition in noise and stress. The SUSAS database refers to Speech Under Simulated and Actual Stress, and is intended to be employed in the study of how speech production and recognition varies when speaking during stressed conditions. This paper will discuss (i) the formulation of the SUSAS database, (ii) baseline speech recognition using SUSAS data, and (iii) previous research studies which have used the SUSAS data base. The motivation for this paper is to familiarize the speech community with SUSAS, which was released April 1997 on CD-ROM through the NATO RSG.10.

A1145.pdf

TOP


A MARKUP LANGUAGE FOR TEXT-TO-SPEECH SYNTHESIS

Authors: Richard Sproat , Paul Taylor , Michael Tanenblatt , Amy Isard

(1) Bell Laboratories, Lucent Technologies 700 Mountain Avenue, Room 2d-451 Murray Hill, NJ 07974, USA (2) Centre for Speech Technology Research, University of Edinburgh 80 South Bridge, Edinburgh, U.K

Volume 4 pages 1747 - 1750

ABSTRACT

Text-to-speech synthesizers must process text, and therefore require some knowledge of text structure. While many TTS systems allow for user control by means of ad hoc 'escape sequences', there remains to date no adequate and generally agreed upon system-independent standard for marking up text for the purposes of synthesis. The present paper is a collaborative effort between two speech groups aimed at producing such a standard, in the form of an SGML-based markup language that we call STML — Spoken Text Markup Language. The primary purpose of this paper is not to present STML as a fait accompli, but rather to interest other TTS research groups to collaborate and contribute to the development of this standard.

A1174.pdf

TOP


SEVERAL MEASURES FOR SELECTING SUITABLE SPEECH CORPORA

Authors: Shuichi ITAHASHI, Naoko UEDA and Mikio YAMAMOTO

Institute of Information Sciences & Electronics University of Tsukuba 1-1-1 Tennodai, Tsukuba, Ibaraki, 305 JAPAN Fax: +81-298-53-5206, E-mail: itahashi@milab.is.tsukuba.ac.jp

Volume 4 pages 1751 - 1754

ABSTRACT

We make statistical investigations of various speech corpora to extract useful information re ecting the contents of the corpus so that we can create a sort of guidelines for selecting the most suitable corpus. A word is not separated by spaces in the Japanese text. Accordingly, we adopt n-gram counting methods to extract frequent mora sequences instead of words. A mora roughly corresponds to a syllable. By investigating the frequencies of 1 to 10-mora sequences in the existing six corpora, we can find the distinction between the written and the spoken languages, keywords and topics of dialogues. This paper shows that the simple statistical investigation makes it possible to represent the contents of the corpus to some extent without conducting a complicated job such as morphological analysis.

A1227.pdf

TOP


GREEK SPEECH DATABASE FOR CREATION OF VOICE DRIVEN TELESERVICES

Authors: I. Chatzi(1), N. Fakotakis (2), G. Kokkinakis (2)

(1) KNOWLEDGE. S.A., Human-Machine Communication Dept. N.E.O. Patron-Athinon 37, 264 41 Patras, Greece Tel: +;0.61.452.820, Fax: +30.61.453.819, E-mail: echatzi@patra.hol.gr (2) Wire Communications Laboratory (WCL), Electrical & Computer Engineering Dept. University of Patras, 261 10 Patras, Greece. Tel. +30 61 991 ?22, FAX: +30 61 991 855, E-mail: fakotaki@wcl.ee.upatras.gr

Volume 4 pages 1755 - 1758

ABSTRACT

In this paper we present the collection of Greek speech data over the telephone network from 5;000 speakers in order to form a speech database (SpeechDatII.GR). This work is embedded in the Language Engineering Project LE2-4001 SpeechDat, in which all official European languages and some major dialectal variants are represented. The design of the speech database allows the development of word, phoneme and syllable based speech recognizers that can be used for a large variety of real speaker independent applications. In particular it will provide a realistic basis for training and assessment of both isolated and continuous speech recognizers for telephone speech, which is a prerequisite for developing voice driven teleservices.

A1510.pdf

TOP