Session Th1C Assessment Methods

Chairperson John Makhoul BBN Systems and Techs, USA

Home


THE DET CURVE IN ASSESSMENT OF DETECTION TASK PERFORMANCE

Authors: A. Martin*, G. Doddington#, T. Kamm+, M. Ordowski+, M. Przybocki*

*National Institute of Standards and Technology, Bldg. 225-Rm. A216, Gaithersburg, MD 20899, USA #SRI International/Department of Defense, 1566 Forest Villa Lane, McLean, VA 22101, USA +Department of Defense, Ft. Meade, MD 20755, USA

Volume 4 pages 1895 - 1898

ABSTRACT

We introduce the DET Curve as a means of representing performance on detection tasks that involve a tradeoff of error types. We discuss why we prefer it to the traditional ROC Curve and offer several examples of its use in speaker recognition and language recognition. We explain why it is likely to produce approximately linear curves. We also note special points that may be included on these curves, how they are used with multiple targets, and possible further applications.

A0086.pdf

TOP


Speech Quality Evaluation of Hands-Free Terminals

Authors: H. Klaus, E. Diedrich, A. Dehnel, J. Berger

Deutsche Telekom Berkom GmbH Goslarer Ufer 35, 10589 Berlin, Germany Tel. (+49 30) 34 97-23 82, Fax: (+49 30) 34 97-29 62, E-mail: H.Klaus@Berkom.De

Volume 4 pages 1899 - 1902

ABSTRACT

This paper describes a new methodology for the speech quality assessment of hands-free terminals and discusses the results of a pilot study performed in 1996 at the Berlin laboratories for speech quality assessment at the Technology Centre of Deutsche Telekom. Up to now, critical speech quality aspects of hands-free terminals are usually assessed with conversational tests. With the test method proposed here, much more efficient listening only tests can be applied to evaluate various speech quality aspects of hands-free terminals. In the pilot study, a series of conversational tests, specific double talk tests and listening only experiments were performed. The paper descibes the recording environment and equipment, the auditory test methodology and the results of the listening only experiments.

A0282.pdf

TOP


USE OF BROADCAST NEWS MATERIALS FOR SPEECH RECOGNITION BENCHMARK TESTS

Authors: David S. Pallett, Jonathan G. Fiscus, William M. Fisher, and John S. Garofolo

Spoken Natural Language Group, Information Technology Laboratory Room A 216 Technology Building National Institute of Standards and Technology (NIST) Gaithersburg, MD 20899 E-mail: david.pallett@nist.gov

Volume 4 pages 1903 - 1906

ABSTRACT

This paper reports on the use of materials derived from radio and television news broadcasts for research and testing purposes for large vocabulary Continuous Speech Recognition (CSR) technology. Tests using these materials have been implemented by NIST on behalf of the DARPA-funded speech recognition research community in 1995 and 1996, and are expected to continue for the next several years. Four research groups participated in the 1995 tests, and nine groups (at eight sites) participated in the 1996 tests. This paper documents properties of the training and test materials, describes a detailed annotation and transcription protocol that has been used for more than 100 hours of recorded data that has been made available through the Linguistic Data Consortium (LDC), and discusses test protocols and results of both the 1995 and 1996 Benchmark Tests.

A0922.pdf

TOP


SPOKEN DIALOGUE SYSTEM EVALUATION: A FIRST FRAMEWORK FOR REPORTING RESULTS

Authors: Norman M. Fraser

Department of Linguistic and International Studies University of Surrey, Guildford, Surrey GU2 SXH, United Kingdom E-mail: n.fraser@surrey.ac.uk

Volume 4 pages 1907 - 1910

ABSTRACT

There are no agreed standards for reporting the performance of spoken dialogue systems. This paper proposes a core set of metrics to be used for this purpose. For this set, operational definitions are supplied, to regularise their application. The intention in proposing this framework is not that it should be exhaustive, nor that it should be perfect, but rather that it should provide a practical starting point, thereby allowing initial system comparison to be achieved quickly and with some measure of confidence.

A1047.pdf

TOP


GENERALITY AND TRANSFERABILITY. TWO ISSUES IN PUT-TING A DIALOGUE EVALUATION TOOL INTO PRACTICAL USE

Authors: Niels Ole Bernsen, Hans Dybkjaer, Laila Dybkjoer and Vytautas Zinkevicius

The Maersk Mc-Kinney Moller Institute for Production Technology Odense University, Campusvej 55, 5230 Odense M, Denmark emails: nob@mip.ou.dk, dybkjaer@mip.ou.dk, laila@mip.ou.dk, vytasz@ktl.mii.lt phone: (+45) 65 57 35 44 fax: (+45) 66 15 76 97

Volume 4 pages 1911 - 1914

ABSTRACT

This paper presents a first set of test results on the generality and transferability of an evaluation tool which can ensure the habitability and usability of spoken dialogues. Building on the assumption that most, if not all, dialogue design errors can be viewed as problems of non-cooperative system behaviour, the tool has two closely related aspects to its use. Firstly, it may be used for the diagnostic evaluation of spoken human-machine dialogue. Secondly, it can be used to guide early dialogue design in order to prevent dialogue design errors from occurring in the implemented system. We describe the development and in-house testing of the tool, and present results of ongoing work on testing its generality and transferability on an external corpus, i.e. an early Wizard of Oz corpus from the development of the Sundial spoken language dialogue system.

A1157.pdf

TOP


WITHIN-SPEAKER VARIABILITY OF THE WORD ERROR RATE FOR A CONTINUOUS SPEECH RECOGNITION SYSTEM

Authors: David A. van Leeuwen and Herman J. M. Steeneken

Electronic mail: fvanLeeuwen; Steenekeng@tm.tno.nl TNO Human Factors Research Institute. Postbus 23, 3769 ZG Soesterberg, The Netherlands.

Volume 4 pages 1915 - 1918

ABSTRACT

The variance of the performance of a continuous speech recognition system subjected to replica utterances of the same sentence spoken by the same speaker has been investigated. In an experiment with three different speech recognition systems in three different languages with two different grammar conditions it is shown that the sentence word error rate has a variance that can be described in terms of binomial statistics. The distribution of the measured variance shows a remarkable correspondence to the parameter- free theoretical distribution. It is therefore concluded that for the word error rate of a continuous speech recognition system binomial statistics apply.

A1218.pdf

TOP