ICSLP'98 Segmentation, Labelling and Speech Corpora 2

Segmentation, Labelling and Speech Corpora 2
Home Full List of Titles 1: ICSLP'98 Proceedings Keynote Speeches Text-To-Speech Synthesis 1 Spoken Language Models and Dialog 1 Prosody and Emotion 1 Hidden Markov Model Techniques 1 Speaker and Language Recognition 1 Multimodal Spoken Language Processing 1 Isolated Word Recognition Robust Speech Processing in Adverse Environments 1 Spoken Language Models and Dialog 2 Articulatory Modelling 1 Talking to Infants, Pets and Lovers Robust Speech Processing in Adverse Environments 2 Spoken Language Models and Dialog 3 Speech Coding 1 Articulatory Modelling 2 Prosody and Emotion 2 Neural Networks, Fuzzy and Evolutionary Methods 1 Utterance Verification and Word Spotting 1 / Speaker Adaptation 1 Text-To-Speech Synthesis 2 Spoken Language Models and Dialog 4 Human Speech Perception 1 Robust Speech Processing in Adverse Environments 3 Speech and Hearing Disorders 1 Prosody and Emotion 3 Spoken Language Understanding Systems 1 Signal Processing and Speech Analysis 1 Spoken Language Generation and Translation 1 Spoken Language Models and Dialog 5 Segmentation, Labelling and Speech Corpora 1 Multimodal Spoken Language Processing 2 Prosody and Emotion 4 Neural Networks, Fuzzy and Evolutionary Methods 2 Large Vocabulary Continuous Speech Recognition 1 Speaker and Language Recognition 2 Signal Processing and Speech Analysis 2 Prosody and Emotion 5 Robust Speech Processing in Adverse Environments 4 Segmentation, Labelling and Speech Corpora 2 Speech Technology Applications and Human-Machine Interface 1 Large Vocabulary Continuous Speech Recognition 2 Text-To-Speech Synthesis 3 Language Acquisition 1 Acoustic Phonetics 1 Speaker Adaptation 2 Speech Coding 2 Hidden Markov Model Techniques 2 Multilingual Perception and Recognition 1 Large Vocabulary Continuous Speech Recognition 3 Articulatory Modelling 3 Language Acquisition 2 Speaker and Language Recognition 3 Text-To-Speech Synthesis 4 Spoken Language Understanding Systems 4 Human Speech Perception 2 Large Vocabulary Continuous Speech Recognition 4 Spoken Language Understanding Systems 2 Signal Processing and Speech Analysis 3 Human Speech Perception 3 Speaker Adaptation 3 Spoken Language Understanding Systems 3 Multimodal Spoken Language Processing 3 Acoustic Phonetics 2 Large Vocabulary Continuous Speech Recognition 5 Speech Coding 3 Language Acquisition 3 / Multilingual Perception and Recognition 2 Segmentation, Labelling and Speech Corpora 3 Text-To-Speech Synthesis 5 Spoken Language Generation and Translation 2 Human Speech Perception 4 Robust Speech Processing in Adverse Environments 5 Text-To-Speech Synthesis 6 Speech Technology Applications and Human-Machine Interface 2 Prosody and Emotion 6 Hidden Markov Model Techniques 3 Speech and Hearing Disorders 2 / Speech Processing for the Speech and Hearing Impaired 1 Human Speech Production Segmentation, Labelling and Speech Corpora 4 Speaker and Language Recognition 4 Speech Technology Applications and Human-Machine Interface 3 Utterance Verification and Word Spotting 2 Large Vocabulary Continuous Speech Recognition 6 Neural Networks, Fuzzy and Evolutionary Methods 3 Speech Processing for the Speech-Impaired and Hearing-Impaired 2 Prosody and Emotion 7 2: SST Student Day SST Student Day - Poster Session 1 SST Student Day - Poster Session 2 Author Index A B C D E F G H I J K L M N O P Q R S T U V W X Y Z Multimedia Files	An Efficient Labeling Tool for the QuickSig Speech Database Authors: Matti Karjalainen, Helsinki University of Technology (Finland) Toomas Altosaar, Helsinki University of Technology (Finland) Miikka Huttunen, Helsinki University of Technology (Finland) Page (NA) Paper number 885 Abstract: An automated speech signal labeling tool, developed for the QuickSig speech database environment, is described. It is based primarily on the use of neural networks as diphone event detectors. For robustness, only coarse categories of diphones, such as stop-vowel and vowel-nasal, are used. 64 such detectors are implemented to cover all of the Finnish diphones. The preprocessing of speech signals is carried out using warped linear prediction and the diphone events from neural network outputs are matched to the given text transcription using a simple rule-based parser. In the case of isolated word labeling of single speaker signals a well trained system makes about 1-2 % of coarse labeling errors and the deviation of boundary positions, compared to careful manual labeling, is on average about 10 ms. Generalization ability to label other speakers shows promising. SL980885.PDF (From Author) SL980885.PDF (Rasterized) TOP Collection and Detailed Transcription of a Speech Database for Development of Language Learning Technologies Authors: Harry Bratt, SRI International (USA) Leonardo Neumeyer, SRI International (USA) Elizabeth Shriberg, SRI International (USA) Horacio Franco, SRI International (USA) Page (NA) Paper number 926 Abstract: We describe the methodologies for collecting and annotating a Latin-American Spanish speech database. The database includes recordings by native and nonnative speakers. The nonnative recordings are annotated with ratings of pronunciation quality and detailed phonetic transcriptions. We use the annotated database to investigate rater reliability, the effect of each phone on overall perceived nonnativeness, and the frequency of specific pronunciation errors. SL980926.PDF (From Author) SL980926.PDF (Rasterized) TOP Resegmentation of SWITCHBOARD Authors: Neeraj Deshmukh, Institute for Signal and Information Processing, Mississippi State University (USA) Aravind Ganapathiraju, Institute for Signal and Information Processing, Mississippi State University (USA) Andi Gleeson, Institute for Signal and Information Processing, Mississippi State University (USA) Jonathan Hamaker, Institute for Signal and Information Processing, Mississippi State University (USA) Joseph Picone, Institute for Signal and Information Processing, Mississippi State University (USA) Page (NA) Paper number 685 Abstract: The SWITCHBOARD (SWB) corpus is one of the most important benchmarks for recognition tasks involving large vocabulary conversational speech (LVCSR). The high error rates on SWB are largely attributable to an acoustic model mismatch, the high frequency of poorly articulated monosyllabic words, and large variations in pronunciations. It is imperative to improve the quality of segmentations and transcriptions of the training data to achieve better acoustic modeling. By adapting existing acoustic models to only a small subset of such improved transcriptions, we have achieved a 2% absolute improvement in performance. SL980685.PDF (From Author) SL980685.PDF (Rasterized) TOP Automatic Generation of Visual Scenarios for Spoken Corpora Acquisition Authors: Demetrio Aiello, Fondazione Ugo Bordoni (Italy) Cristina Delogu, Fondazione Ugo Bordoni (Italy) Renato De Mori, Université d'Avignon et des Pays de Vaucluse (France) Andrea Di Carlo, Fondazione Ugo Bordoni (Italy) Marina Nisi, Fondazione Ugo Bordoni (Italy) Silvia Tummeacciu, Fondazione Ugo Bordoni (Italy) Page (NA) Paper number 499 Abstract: The paper describes a system, in JAVA, for written and visual scenario generation to collect speech corpora in the framework of a Tourism Information System. Methods and experimental results are also presented for evaluating the degree of understanding of the proposed scenarios. The corpus generated from visual scenarios appears to be much richer than the one generated from textual descriptions. SL980499.PDF (From Author) SL980499.PDF (Rasterized) TOP Automatic Detection of Semantic Boundaries Based on Acoustic and Lexical Knowledge Authors: Mauro Cettolo, ITC-IRST (Italy) Daniele Falavigna, ITC-IRST (Italy) Page (NA) Paper number 333 Abstract: In spoken dialogue systems, the minimal unit of analysis does not necessarily correspond to a full sentence. A possible approach for language processing is that of splitting the sentence in a sequence of units that can be successively processed by linguistic modules. The goal of the Semantic Boundary (SB) detector is to locate boundaries inside a sentence in order to obtain such minimal units. Useful information for SB detection can be extracted both from the acoustic signal of the utterance and from its corresponding word sequence. In the paper techniques for semantic boundary prediction, based on both acoustic and lexical knowledge, will be presented. Furthermore, a method for combining the two knowledge sources will be proposed. Finally, performance obtained on a corpus of hundreds of person-to-person dialogues will be provided. Best result gives 62.8% recall and 71.8% precision. SL980333.PDF (From Author) SL980333.PDF (Rasterized) TOP A New Fast Algorithm for Automatic Segmentation of Continuous Speech Authors: Iman Gholampour, Electrical Engineering Department, Sharif University of Technology (Iran) Kambiz Nayebi, Electrical Engineering Department Sharif University of Technology (Iran) Page (NA) Paper number 182 Abstract: In this paper a new method for automatic segmentation of continuous speech into phone-like units is addressed. Our method is based on a very fast presegmentation algorithm which uses a new statistical modeling of speech and searching in a multilevel structure, called Dendrogram, for decreasing insertion rate. Performance of algorithms have been tested over a large set of TIMIT sentences. According to these tests, our final segmentation algorithm is capable of detecting nearly 97% of segments with an average boundary position error of less than 7 msec and average insertion rate of less than 12.7%. In addition to acceptable precision, our overall segmentation scheme has very low computation cost and it can be implemented in real time on an average Pentium PC. The major advantage of presented algorithms is that no training or threshold estimation is needed in realizing them. Details of proposed algorithms and their performance results are included in the paper. SL980182.PDF (From Author) SL980182.PDF (Rasterized) TOP Acoustic Nature and Perceptual Testing of Corpora of Emotional Speech Authors: Akemi Iida, Graduate School at Media and Governance, Keio University (Japan) Nick Campbell, ATR Interpreting Telecommunications Research Laboratories (Japan) Soichiro Iga, Graduate School at Media and Governance, Keio University (Japan) Fumito Higuchi, Graduate School at Media and Governance, Keio University (Japan) Michiaki Yasumura, Graduate School at Media and Governance, Keio University (Japan) Page (NA) Paper number 818 Abstract: This paper proposes three corpora of emotional speech in Japanese that maximize the expression of each emotion (expressing joy, anger, and sadness) for use with CHATR, the concatenative speech synthesis system being developed at ATR. A perceptual experiment was conducted using the synthesized speech generated from each emotion corpus and the results proved to be significantly identifiable. Authors' current work is to identify the local acoustic features relevant for specifying a particular emotion type. F0 and duration showed significant differences among emotion types. AV (amplitude of voicing source) and GN (glottal noise) also showed differences. This paper reports on the corpus design, the perceptual experiment, and the results of the acoustic analysis. SL980818.PDF (From Author) SL980818.PDF (Rasterized) TOP Korean Prosodic Break Index Labelling by a New Mixed Method of LDA and VQ Authors: Pyungsu Kang, DSP Lab, Electronics Of Engineering, Chonnam National University (Korea) Jiyoung Kang, DSP Lab, Electronics Of Engineering, Chonnam National University (Korea) Jinyoung Kim, DSP Lab, Electronics Of Engineering, Chonnam National University (Korea) Page (NA) Paper number 264 Abstract: We present a new mixed method of LDA-VQ to predict Korean prosodic break index(PBI) for a given utterance. PBI can be used as an important cue of syntactic discontinuity in continuous speech recognition(CSR). Our proposed method, LDA-VQ model, consists of three steps. At the first step, PBI was predicted with the information of syllable and pause duration through the linear discriminant analysis(LDA) method. At the second step, syllable tone information was used to estimate PBI. In this step we used vector quantization(VQ) for coding the syllable tones and PBI is estimated by tri-tone model. In the last step, two PBI predictors were integrated by a weight factor. The LDA-VQ method was tested on 200 literal style spoken sentences. The experimental results showed 72% accuracy. SL980264.PDF (From Author) SL980264.PDF (Rasterized) TOP MOOSE: Management Of Otago Speech Environment Authors: Mark Laws, University of Otago (New Zealand) Richard Kilgour, University of Otago (New Zealand) Page (NA) Paper number 1116 Abstract: With the advent of spoken language computer interface systems, the storage and management of speech corpora is becoming more of an issue in the development of such systems. Until recently, even large corpora were stored as individual text and speech files, or as a single, monolithic file. The issues involved in management and retrieval of the data have been, to a large extent, overlooked. Relational database management systems (RDBMS) are proposed as an ideal tool for the management of speech corpora. Relationships between words and phonemes, and the realisations of these, can be stored and retrieved efficiently. RDBMS may be constructed with various levels, to store speaker, language, label transcription, and phonetic information, plus speech as isolated words, and derived segmented units. An implementation of such a system is presented to manage the Otago Speech Corpus, currently called Management Of Otago Speech Environment (MOOSE). The ability of the MOOSE to be applied to other corpora is currently under investigation. SL981116.PDF (From Author) SL981116.PDF (Rasterized) TOP Phonetic Alignment: Speech Synthesis Based vs. Hybrid HMM/ANN Authors: Fabrice Malfrère, Faculté Polytechnique de Mons (Belgium) Olivier Deroo, Faculté Polytechnique de Mons (Belgium) Thierry Dutoit, Faculté Polytechnique de Mons (Belgium) Page (NA) Paper number 354 Abstract: In this paper we compare two different methods for phonetically labeling a speech database. The first approach is based on the alignment of the speech signal on a high quality synthetic speech pattern, and the second one uses a hybrid HMM/ANN system. Both systems have been evaluated on French read utterances from a speaker never seen in the training stage of the HMM/ANN system and manually segmented. This study outlines the advantages and drawbacks of both methods. The high quality speech synthetic system has the great advantage that no training stage is needed, while the classical HMM/ANN system easily allows multiple phonetic transcriptions. We deduce a method for the automatic constitution of phonetically labeled speech databases based on using the synthetic speech segmentation tool to bootstrap the training process of our hybrid HMM/ANN system. The importance of such segmentation tools will be a key point for the development of improved speech synthesis and recognition systems. SL980354.PDF (From Author) SL980354.PDF (Rasterized) TOP Customisation And Quality Assessment Of Spoken Language Description Authors: J. Bruce Millar, Australian National University (Australia) Page (NA) Paper number 705 Abstract: This paper describes a world-wide web based system which allows a speech data corpus developer to interact with a system for the comprehensive description of a spoken language data corpus. The interface at the browser allows the user to define the speaking environment that is to be described and then progressively to describe it and all the data files arising from it in a modular fashion. The quality of the description can be tested incrementally and the feedback generated used to further update the description. The user is effectively guided through the necessary modules in a way that reminds but does not demand adherence to a strict pattern. The complete data description can be displayed on screen and also downloaded in a simple text form to the user. Users may register as clients of the system in which case they can store descriptions on the system and upgrade these descriptions at any time. SL980705.PDF (From Author) SL980705.PDF (Rasterized) TOP A Silence/Noise/Music/Speech Splitting Algorithm Authors: Claude Montacié, Laboratoire d'Informatique de Paris 6 (France) Marie-José Caraty, Laboratoire d'Informatique de Paris 6 (France) Page (NA) Paper number 1141 Abstract: In this paper, we present techniques to warp audio data of a video movie on its movie script. In order to improve this script warping, a new algorithm has been developed to split audio data into silence, noise, music and speech segments without training step. This segments splitting uses multiple techniques such as voiced/unvoiced segmentation, pitch detection, pitch tracking, speaker and speech recognition techniques. The 102.47 minutes of the film movie "Contes de Printemps" produced by E. Rohmer have been indexed with these techniques with an average shifting lower than one second between the time-code script and audio data. SL981141.PDF (From Author) SL981141.PDF (Rasterized) TOP Audio-Visual Segmentation for Content-Based Retrieval Authors: David Pye, ORL (U.K.) Nicholas J. Hollinghurst, ORL (U.K.) Timothy J. Mills, ORL (U.K.) Kenneth R. Wood, ORL (U.K.) Page (NA) Paper number 517 Abstract: This paper reports recent work at ORL on segmentation of digital audio/video recordings. Firstly, we describe an audio segmentation algorithm that partitions a soundtrack into manageably sized segments for speech recognition. Secondly, we present an algorithm for detecting camera shot-break locations in the video. The output of these two algorithms is combined to produce a semantically meaningful segmentation of audio/video content, appropriate for information retrieval. We report the success of the algorithms in the context of television news retrieval. SL980517.PDF (From Author) SL980517.PDF (Rasterized) TOP Same News is Good News: Automatically Collecting Reoccurring Radio News Stories Authors: Stefan Rapp, Sony International (Europe) GmbH (Germany) Grzegorz Dogil, Institut f. Maschinelle Sprachverarbeitung, Univ. Stuttgart (Germany) Page (NA) Paper number 906 Abstract: We present methods for finding same or almost same news stories in the hourly radio news broadcasts spoken by the same or different announcers. They allow to establish a large database of repeated and professionally read speech at low costs that is especially interesting for prosody research, but also, e.g., for concept-to-speech and socio-linguistic studies. An automatically recorded complete radio news broadcast is first segmented into individual news stories using HMM recognition. Then, the word sequence estimates of the stories are either compared directly (naive method) or realigned with the signal of other stories (realignment method) to find out which stories were read before and which not. Both methods can be further improved by computing ``meta distances'' that also take into account distances to other stories. We find that the realignment method combined with meta distances is the most reliable of the methods on real life data. SL980906.PDF (From Author) SL980906.PDF (Rasterized) TOP An Annotation System for Melodic Aspects of German Spontaneous Speech Authors: Christel Brindöpke, University Bielefeld (Germany) Brigitte Schaffranietz, University Bielefeld (Germany) Page (NA) Paper number 352 Abstract: This article presents a phonetically defined annotation system for German speech melody, whose descriptive units cover the perceptive relevant pitch movements. The units for the melodic annotation are based on a model for read German utterances. The implementation of our descriptive melodic units as part of a recently developed labelling and testing environment for melodic aspects of speech allows a comfortable and intersubjective application of the melodic units for the annotation of speech. An experimental evaluation by a rating-experiment secures that the melodic units describe spontaneous speech as adequately as read speech SL980352.PDF (From Author) SL980352.PDF (Rasterized) TOP Additional Use of Phoneme Duration Hypotheses in Automatic Speech Segmentation Authors: Karlheinz Stöber, IKP, Bonn University (Germany) Wolfgang Hess, IKP, Bonn University (Germany) Page (NA) Paper number 239 Abstract: We describe a new approach for speaker independent automatic phoneme alignment. Typical algorithms for this task use only phoneme-to-frame similarity measures which are somehow maximised or minimised. In addition to such similarity measures, we use phoneme duration hypotheses generated by the speech synthesis system HADIFIX. For algorithms based on dynamic programming, it is difficult to use these duration hypotheses, so we create a cost-function consisting of phoneme-to-frame and segment-to-duration hypotheses similarity measures and minimise this cost-function by a Genetic Algorithm. The results show that the accuracy of automatically determined phoneme boundaries increases. This accounts especially for speakers not used in the training phase. SL980239.PDF (From Author) SL980239.PDF (Rasterized) TOP Towards a Minimal Standard for Dialogue Transcripts: a New SGML Architecture for the HCRC Map Task Corpus Authors: Amy Isard, Human Communication Research Centre, University of Edinburgh (Scotland) David McKelvie, Human Communication Research Centre, University of Edinburgh (Scotland) Henry S. Thompson, Human Communication Research Centre, University of Edinburgh (Scotland) Page (NA) Paper number 322 Abstract: The rapid growth in availability of high-quality recordings of natural spoken dialogue (and natural spoken material more generally) has encouraged us to to improve the interchange of transcripts of such material, in order that these resources be easy to exploit by the scientific community as a whole. In this paper, we describe a new SGML architecture which we have recently adopted for the HCRC Map Task corpus (a corpus of spontaneous task-oriented dialogues) with precisely these issues in view. This architecture is oriented towards ease of processing and update. SL980322.PDF (From Author) SL980322.PDF (Rasterized) TOP