ABSTRACT
Blackboards allow various knowledge sources to be triggered in an opportunistic way, but does not allow higher modules to feedback information to lower level modules. The solution presented here remedies this shortcoming, since our Sketchboard implements reactive feedback loops. Within the Sketchboard, modules are considered from two points of view: either they build a result (a sketch, possibly rough and vague) or they give back a response to the modules from which they received their input data. This response signals the degree of confidence the module has towards its own result. These relations are generalized across all the modules that interact when solving a problem. As higher and higher level modules are triggered, the initial sketch become more and more precise, taking into account the higher modules knowledge. Conceived for natural language processing, the Sketchboard is also useful for spoken language understanding as shown by a detailed example.
ABSTRACT
We present a generic speech technology integration platform for application development and research across different domains. The goal of the design is two-fold: On the application development side, the system provides an intuitive developer's interface defined by a high level application definition language and a set of convenient speech application building tools. It allows a novice developer to rapidly deploy and modify a spoken language dialogue application. On the system research and development side, the system uses a thin, 'broker' layer to separate the system application programming interface from the service provider interface. It makes the system easy to incorporate new technologies and new functional components. We also use a domain independent acoustic model set to cover US English phones for general speech applications. The system grammar and lexicon engine creates grammars and lexicon dictionaries on the fly to enable a practically unrestricted vocabulary for many recognition and synthesis applications.
ABSTRACT
SPHERIC is a new IC that has been designed specially for automatic speech recognition applications in Consumer Electronics with a vocabulary of up to 126 words. It allows real-time recognition of both speaker dependent and independent words spoken continuously or in an isolated way. Key word spotting and playback of coded messages and user trained words are additional features. After a short system overview the hardware architecture and software structure are presented in this paper. The techniques for reducing computation time and necessary memory size are examined in more detail. Finally, the implemented speech recognition al- gorithm is described.
ABSTRACT
Speech research is a complex endeavor, as reflected in the numerous tools and specialized languages the modern researcher needs to learn. These tools, while adequate for what they have been designed for, are difficult to customize or extend in new directions, even though this is often required. We feel this situation can be improved and propose a new scripting language, MUSE, designed explicitly for speech research, in order to facilitate exploration of new ideas. MUSE is designed to support many modes of research from interactive speech analysis through compute-intensive speech understanding systems, and has facilities for automating some of the more difficult requirements of speech tools: user interactivity, distributed computation, and caching. In this paper we describe the design of the MUSE language and our current prototype MUSE interpreter.
ABSTRACT
This work presents methods of assessing non-native speech to aid computer-assisted pronunciation teaching. These methods are based on automatic speech recognition (ASR) techniques using Hidden Markov Models. Confidence scores at the phoneme level are calculated to provide detailed information about the pronunciation quality of a foreign language student. Experimental results are given based on both artificial data and a database of non-native speech, the latter being recorded specifically for this purpose. The presented results demonstrate the metrics' capability to locate and assess mispronunciations at the phoneme level.
ABSTRACT
Speech recognition applications always face the problem of changing vocabulary and functionality. The use of speech recognition systems will become more attractive if the system user is able to define or redefine the task himself in a suitable manner. Modelling a new task normally requires the experience of a human expert and a lot of time. Aditionally, the expert always has to be contacted if system changes become necessary. In this paper we present a fully operational system for continuous speech recognition with a powerful user interface. Most of the internal aspects of the speech recognition system are hidden. The task may be divided into different subtasks corresponding to dialogue states. Each subtask is defined by a set of expected user utterances based on sentence templates. This definition is automatically transformed into a lexicon and a language model used by the speech recognition system.
ABSTRACT
Speech speed is measured and displayed with our specific algorithm TEMAX (Temporal Evaluation and Measurement Algorithm by KS). The TEMAX-gram, a sonagraphic output of speech envelope, the DFT using a 1-second window is convenient to set off isosyllabic characteristics. For Japanese traces 2 dark bars, called rhythmic formants: RF1 and RF2: the first one, around 8 Hz, and the second one, at halfway. RF1 corresponds to speech rate, RF2 represents the bimoraic rhythmic foot. As far as English, its isochronic characteristics are observable with a 2-seconds window as RF1. Furthermore, using a 1-second window the periodicity of syllables between stress is displayed as RF2.
ABSTRACT
The aim of the work described in this paper is to develop methods for automatically assessing the pronunciation quality of specific phone segments uttered by students learning a foreign language. From the phonetic time alignments generated by SRI's Decipher^TM HMM-based speech recognition system, we use various probabilistic models to produce pronunciation scores for the phone utterance. We evaluate the performance of the proposed algorithms by measuring how well the machine-produced scores correlate with human judgments on a large database. Of the various algorithms considered, the one based on phone log-posterior-probability produced the highest correlation (r xy = 0.72) with the human ratings, which was comparable with correlations between human raters.
ABSTRACT
This work is part of a project aimed at developing a speech recognition system for language instruction that can assess the quality of pronunciation, identify pronunciation problems, and provide the student with accurate feedback about specific mistakes. Previous work was mainly concerned with scoring the quality of pronunciation. In this work we focus on automatic detection of mispronunciation. While scoring quantifies the mispronunciation, detection identifies the occurrence of a specific problem. Detecting pronunciation problems is necessary for providing feedback to the student. We use pronunciation scoring techniques to evaluate the performance of our mispronunciation model.
ABSTRACT
Through the present paper, a methodology to create Visual Representations of Speech for Speech Perception Enhancement Applications, based on the use of a Continuous Formant- Tracking Algorithm, is presented. The specific mathematical and computational issues introduced for such treatment are given, and a specific case for Computer-Aided Language Learning oriented to the Phonetic Specificities of English for Spanish Speakers is also presented. This specific technique may also be used in statistically normalizing Speech Data for Speech Recognition Systems. In this context, an example of a Robust to Noise Speech Recognizer, which uses Eormant Dynamic Information is shown.
ABSTRACT
We developed a CALL (computer-aided language learning) system for teaching the pronunciation of Japanese long vowels, the mora nasal and mora obstruents to non-native speakers of Japanese. Long vowels and short vowels are spectrally almost identical but their phone durations differ significantly. Similar conditions exist between mora nasals and non-mora nasals, and between mora and non-mora obstruents. Our system uses speech recognition to measure the durations of each phone and compares them with distributions of native speakers while correcting for different speech rates. Results show that learners quickly capture the relevant duration cues. The amount of learning time spent on acquiring these durational skills is well within the time constraints of TJSL (teaching Japanese as a second language) curricula.
ABSTRACT
In the article the focus is put on educational aspects of the speech processing science. A set of tools that have been developed with the aim at presenting, visualizing and explaining basic topics of speech recognition is described. The set consists of programs, like a signal analysis unit, a dynamic time warping algorithm (DTW) explorer and hidden Markov model (HMM) investigation tools, that are integrated into a single environment and allow for easy and highly illustrative learning through experiments with real speech data.
ABSTRACT
This paper presents a bit-rate conversion system for an efficient communication between two CVSD systems with different bit-rates. To ensure the robustness to external noises, the presented system is implemented in digital domain using a general purpose digital signal processor (DSP). In order to overcome the problems caused by different bit-rate and time-constants, several methods are considered in this study. In addition, a significant simplification of the system complexity is obtained by introducing the IIR filter to the decimation/interpolation process. The use of the IIR filter provides comptational advantages over the conversion system employing FIR filters, because the linear phase is not a critical issue in this application. By modifying the algorithm based on the IIR filter, a 3-channel full-duplex conversion algorithm was successfully implemented on a single DSP. Experimentals results are presented to exihibit the consistent and reliable performance of the bit-rate conversion system.
ABSTRACT
The Prosodic Module of SLIM has been created in order to solve problems related to segmental and suprasegmental features of spoken English in a courseware for computer-assisted foreign language learning called SLIM - an acronym for Multimedia Interactive Linguistic Software, developed at the University of Venice. It is composed of two different sets of Learning Activities, the first one dealing with phonetic and prosodic problems at word segmental level, the second one dealing with prosodic problems at utterance suprasegmental level. The main goal of Prosodic Activities is to ensure feed-back to the student intending to improve his/her pronunciation in a foreign language. The programme works by comparing two signals, the master and the student ones, where the master has been previously edited by a human tutor inserting orthographic syllabic information at segmentation marks automatically computed by the underlying acoustic segmenter called Prosodics(see 1). When a student, after listening and evaluating the master signal tries to mimic the original utterance or word the system assigns a score and, if needed spots a mistake and indicates what it consists of. The elements of comparison are constituted by the acoustic correlates of prosodic features such as intonational contour, sentence accent and word stress, rhythm and duration at word and sentence level.
ABSTRACT
We consider speech dialogues, allowing for simultaneous input (via speech recognition) and output (via speech synthesis or pre-recorded prompts), often referred to as "barge in". We start with a collection of dialogue situations, where simultaneous input and output is useful. It is argued, that a variety of possible system behaviour is necessary in order to take into account these situations adequately. We then define a formalism, that allows to control this system behaviour. We end up with reporting some experience gathered both in lab tests and a in real world pilot.
ABSTRACT
This paper presents a new interactive speech processing environment designed for Microsoft Window platforms. It will be shown that how an integrated speech processing environment was made following Windows Interface Design Guidelines. The environment integrates many traditional time and frequency domain analysis algorithms as well as basic functions like recording, listening and labeling. Choosing Component Object Model (COM) as the architectural framework assures high maintainability, scripting capability and further expandability of this environment. Extensive use of the system in laboratory has shown how this interactive environment improves users performance in their every day speech processing tasks.
ABSTRACT
Subarashii is a system that uses automatic speech recognition (ASR) to offer first-level, computer-based exercises in the Japanese language for beginning high school students. Building the Subarashii system has identified strengths and limitations of ASR technology and has led to some novel methods in the development of materials for computer-based interactive spoken language education.
ABSTRACT
At Digital Equipment Corporation's Cambridge Research Lab (CRL), the Speech Interaction Group has been focusing on building speech applications for deployment over the World-Wide Web. Web-based speech applications require the browser to capture and transmit speech to remote servers for back-end processing, maintain application state, and present multi-media responses. This paper describes the group's strategy for delivering speech applications built around a mechanism, the digital Voice Plugin, for capturing and transmitting audio from a browser. It describes a conversational application implemented within this framework and discusses the problems of delivering these systems on the Web. In addition, we brie y touch upon some other Web-based speech applications that have been developed at CRL.
ABSTRACT
The CSLU shell (CSLUsh), is a collection of modular building blocks which aim to provide the user with a powerful, extendible, research, development and implementation environment. Implemented in C with standardized Tcl/Tk interfaces to provide a scripting and visualization environment, it allows a exible cast for both research algorithms and system deployment. This shell is the architecture on which the CSLU Toolkit is built and may be downloaded for non-commercial use from http://www.cse.ogi.edu/CSLU/toolkit.
ABSTRACT
The efficiency of the development of CTSlTTS systems is inJluenced by the features and services of the software development tools used in the development process. A development system should be highly flexible, informative and user friendly to fulfil all or alrnost all the requirernents the researcher could have. In this paper we present a development systenr, MVoxDev, that can provide an inforrnative and Jlexible environment for the developrnent of rnultilingual CTS/TTS systents. Zhe developrnent systern gives aid to inspect and modifv all the constituent parts of the CTS/TTS system as a client of the developed CTS/TTS system.
ABSTRACT
A new block recursive algorithm is introduced for effective FAMlet transform implementation. When the Fourier transform is combined with the algorithm a nonuniform resolution filterbank is created. The algorithm allows to approximate frequency resolutions of any type, the ERB-rate scale included. The signals can be vector based critically down sampled which allows a perfect reconstruction.
ABSTRACT
Acoustic realization of word accent differs among languages. While, in Japanese, it is fully represented by an F 0 contour of a word, English word accent is characterized by power, duration, F 0 , vowel quality and so forth. In addition to the difference in syllable structure between the two languages, that in word accent makes it even more difficult for Japanese students to master correct pronunciation of English words. It indicates that the development of an automatic evaluation method of English word accent, as one of English teaching tools, will be helpful especially to Japanese students. In this paper, as the first step to the development, a detection method of accent in English words spoken by Japanese is proposed, where syllable-size HMMs are built using positional information of the syllables and adequately detected syllable boundaries are used for the detection. Results of accent detection experiments show 90 % and 93 % as detection rates of Japanese students and native speakers respectively.
ABSTRACT
This paper describes an English conversation and pronunciation CAI using speech recognition techniques. This system was intended to recognize user's utterances and to respond to him properly according to the recognized results. In the case of a learner with unskilled pronunciation, because of differences in the phonemic system between his mother tongue and the second language, the speech recognition system cannot run normally. After this improvement, evaluation experiments were conducted. The results indicate that learners' ability in speaking and in listening to English is improved by using the system.
ABSTRACT
Currently, there are few opportunities for people to learn about and experiment with the latest spoken language technology. Furthermore, most research and development activities are restricted to a handful of academic and industrial labs. In order to make the technology less exclusive, it must become more accessible to the general population. This is now feasible with the development of the CSLU Toolkit which combines easy-to-use authoring tools with state-of- the-art human language technology. In this paper, we focus on the educational role of the toolkit and describe how it is being used in several local schools.
ABSTRACT
The aim of the research reported on here is to develop a system for automatic assessment of foreign speakers' pronunciation of Dutch. In this paper similar studies carried out for English are first examined. Subsequently, suggestions are made for partly improving the methodology that is usually adopted in research on automatic pronunciation assessment. Finally, an experiment is presented in which automatic scores of telephone speech produced by native and nonnative speakers are compared with scores assigned by human raters. The approach used in this experiment is compared with those of previous studies.
ABSTRACT
Very low power electromagnetic (EM) wave sensors are being used to measure speech articulator motions such as the vocal fold oscillations, jaw, tongue, and the soft palate. Data on vocal fold motions, that correlate well with established laboratory techniques, as well as data on the jaw, tongue and soft palate are shown The vocal fold measurements together with a volume air flow model are being used to perform pitch synchronous estimates of the voiced transfer functions using ARMA techniques.
ABSTRACT
The formants of speech sounds are usually attributed to resonances of the vocal tract. Formant frequencies are usually estimated by inspection of spectrograms or by automated techniques such as linear prediction. In this paper we measure the frequencies of the first two resonances of the vocal tract directly, in real time, using acoustic impedance spectrometry. The vocal tract is excited by a carefully calibrated, broad band, acoustic current signal applied outside the lips while the subject is speaking. The sound pressure response is analysed to give the resonant frequencies. We compare this new method (Real-time Acoustic Vocal tract Excitation or RAVE) with linear prediction and we report the vocal tract resonances for eleven vowels of Australian English. We also report preliminary results of using feedback from vocal tract excitation as a speech trainer, and its effect on improving the pronunciation of foreign vowel sounds by monolingual anglophones.