We introduce the DET Curve as a means of representing performance on detection tasks that involve a tradeoff of error types. We discuss why we prefer it to the traditional ROC Curve and offer several examples of its use in speaker recognition and language recognition. We explain why it is likely to produce approximately linear curves. We also note special points that may be included on these curves, how they are used with multiple targets, and possible further applications.
This paper describes a new methodology for the speech quality assessment of hands-free terminals and discusses the results of a pilot study performed in 1996 at the Berlin laboratories for speech quality assessment at the Technology Centre of Deutsche Telekom. Up to now, critical speech quality aspects of hands-free terminals are usually assessed with conversational tests. With the test method proposed here, much more efficient listening only tests can be applied to evaluate various speech quality aspects of hands-free terminals. In the pilot study, a series of conversational tests, specific double talk tests and listening only experiments were performed. The paper descibes the recording environment and equipment, the auditory test methodology and the results of the listening only experiments.
This paper reports on the use of materials derived from radio and television news broadcasts for research and testing purposes for large vocabulary Continuous Speech Recognition (CSR) technology. Tests using these materials have been implemented by NIST on behalf of the DARPA-funded speech recognition research community in 1995 and 1996, and are expected to continue for the next several years. Four research groups participated in the 1995 tests, and nine groups (at eight sites) participated in the 1996 tests. This paper documents properties of the training and test materials, describes a detailed annotation and transcription protocol that has been used for more than 100 hours of recorded data that has been made available through the Linguistic Data Consortium (LDC), and discusses test protocols and results of both the 1995 and 1996 Benchmark Tests.
There are no agreed standards for reporting the performance of spoken dialogue systems. This paper proposes a core set of metrics to be used for this purpose. For this set, operational definitions are supplied, to regularise their application. The intention in proposing this framework is not that it should be exhaustive, nor that it should be perfect, but rather that it should provide a practical starting point, thereby allowing initial system comparison to be achieved quickly and with some measure of confidence.
This paper presents a first set of test results on the generality and transferability of an evaluation tool which can ensure the habitability and usability of spoken dialogues. Building on the assumption that most, if not all, dialogue design errors can be viewed as problems of non-cooperative system behaviour, the tool has two closely related aspects to its use. Firstly, it may be used for the diagnostic evaluation of spoken human-machine dialogue. Secondly, it can be used to guide early dialogue design in order to prevent dialogue design errors from occurring in the implemented system. We describe the development and in-house testing of the tool, and present results of ongoing work on testing its generality and transferability on an external corpus, i.e. an early Wizard of Oz corpus from the development of the Sundial spoken language dialogue system.
The variance of the performance of a continuous speech recognition system subjected to replica utterances of the same sentence spoken by the same speaker has been investigated. In an experiment with three different speech recognition systems in three different languages with two different grammar conditions it is shown that the sentence word error rate has a variance that can be described in terms of binomial statistics. The distribution of the measured variance shows a remarkable correspondence to the parameter- free theoretical distribution. It is therefore concluded that for the word error rate of a continuous speech recognition system binomial statistics apply.