ICSLP'98 SST Student Day - Poster Session 2

SST Student Day - Poster Session 2
Home Full List of Titles 1: ICSLP'98 Proceedings Keynote Speeches Text-To-Speech Synthesis 1 Spoken Language Models and Dialog 1 Prosody and Emotion 1 Hidden Markov Model Techniques 1 Speaker and Language Recognition 1 Multimodal Spoken Language Processing 1 Isolated Word Recognition Robust Speech Processing in Adverse Environments 1 Spoken Language Models and Dialog 2 Articulatory Modelling 1 Talking to Infants, Pets and Lovers Robust Speech Processing in Adverse Environments 2 Spoken Language Models and Dialog 3 Speech Coding 1 Articulatory Modelling 2 Prosody and Emotion 2 Neural Networks, Fuzzy and Evolutionary Methods 1 Utterance Verification and Word Spotting 1 / Speaker Adaptation 1 Text-To-Speech Synthesis 2 Spoken Language Models and Dialog 4 Human Speech Perception 1 Robust Speech Processing in Adverse Environments 3 Speech and Hearing Disorders 1 Prosody and Emotion 3 Spoken Language Understanding Systems 1 Signal Processing and Speech Analysis 1 Spoken Language Generation and Translation 1 Spoken Language Models and Dialog 5 Segmentation, Labelling and Speech Corpora 1 Multimodal Spoken Language Processing 2 Prosody and Emotion 4 Neural Networks, Fuzzy and Evolutionary Methods 2 Large Vocabulary Continuous Speech Recognition 1 Speaker and Language Recognition 2 Signal Processing and Speech Analysis 2 Prosody and Emotion 5 Robust Speech Processing in Adverse Environments 4 Segmentation, Labelling and Speech Corpora 2 Speech Technology Applications and Human-Machine Interface 1 Large Vocabulary Continuous Speech Recognition 2 Text-To-Speech Synthesis 3 Language Acquisition 1 Acoustic Phonetics 1 Speaker Adaptation 2 Speech Coding 2 Hidden Markov Model Techniques 2 Multilingual Perception and Recognition 1 Large Vocabulary Continuous Speech Recognition 3 Articulatory Modelling 3 Language Acquisition 2 Speaker and Language Recognition 3 Text-To-Speech Synthesis 4 Spoken Language Understanding Systems 4 Human Speech Perception 2 Large Vocabulary Continuous Speech Recognition 4 Spoken Language Understanding Systems 2 Signal Processing and Speech Analysis 3 Human Speech Perception 3 Speaker Adaptation 3 Spoken Language Understanding Systems 3 Multimodal Spoken Language Processing 3 Acoustic Phonetics 2 Large Vocabulary Continuous Speech Recognition 5 Speech Coding 3 Language Acquisition 3 / Multilingual Perception and Recognition 2 Segmentation, Labelling and Speech Corpora 3 Text-To-Speech Synthesis 5 Spoken Language Generation and Translation 2 Human Speech Perception 4 Robust Speech Processing in Adverse Environments 5 Text-To-Speech Synthesis 6 Speech Technology Applications and Human-Machine Interface 2 Prosody and Emotion 6 Hidden Markov Model Techniques 3 Speech and Hearing Disorders 2 / Speech Processing for the Speech and Hearing Impaired 1 Human Speech Production Segmentation, Labelling and Speech Corpora 4 Speaker and Language Recognition 4 Speech Technology Applications and Human-Machine Interface 3 Utterance Verification and Word Spotting 2 Large Vocabulary Continuous Speech Recognition 6 Neural Networks, Fuzzy and Evolutionary Methods 3 Speech Processing for the Speech-Impaired and Hearing-Impaired 2 Prosody and Emotion 7 2: SST Student Day SST Student Day - Poster Session 1 SST Student Day - Poster Session 2 Author Index A B C D E F G H I J K L M N O P Q R S T U V W X Y Z Multimedia Files	Automated Captioning of Television Programs: Development and Analysis of a Soundtrack Corpus Authors: Ingrid Ahmer, University of South Australia (Australia) Robin W. King, University of South Australia (Australia) Page (NA) Paper number 419 Abstract: The purpose of this research is to investigate methods for applying speech recognition techniques to improve the productivity of off-line captioning for television. We posit that existing corpora for training continuous speech recognisers are unrepresentative of the acoustic conditions of television soundtracks. To evaluate the use of application specific models to this task we have developed a soundtrack corpus (representing a single genre of television programming) for acoustic analysis and a text corpus (from the same genre) for language modelling. These corpora are built from components of the manual captioning process. Captions were used to automatically segment and label the acoustic soundtrack data at sentence level, with manual post-processing to classify and verify the data. The text corpus was derived using automatic processing from approximately 1 million words of caption text. The results confirm the acoustic profile of the task to be characteristically different to that of most other speech recognition tasks (with the soundtrack corpus being almost devoid of clean speech). The text corpus indicates that application specific language modelling will be effective for the chosen genre, although a lexicon providing complete lexical coverage is unattainable. There is a high correspondence between captions and soundtrack speech for the chosen genre, confirming that closed-captions can be a useful data source for generating labelled acoustic data. The corpora provide a high quality resource to support further research into automated speech recognition. SL980419.PDF (From Author) SL980419.PDF (Rasterized) TOP On the Influence of the Delta Coefficients in a HMM-based Speech Recognition System Authors: Fabrice Lefèvre, LIP6 (France) Claude Montacié, LIP6 (France) Marie-José Caraty, LIP6 (France) Page (NA) Paper number 573 Abstract: The delta coefficients are a conventional method to include temporal information in the speech recognition systems. In particular, they are widely used in the gaussian HMM-based systems. Some attempts were made to introduce the delta coefficients in the K-Nearest Neighbours (K-NN) HMM-based system that we recently developed. An introduction of the delta coefficients directly in the representation space is shown not to be suitable with the K-NN probability density function (pdf) estimator. So, we investigate whether the delta coefficient could be used to improve the K-NN HMM-based system in other ways. In this purpose, an analysis of the delta coefficients in the gaussian HMM-based systems is proposed. It leads to the conclusion that the delta coefficients influence also the recognition process. SL980573.PDF (From Author) SL980573.PDF (Rasterized) TOP Speech Recognition Using the Probabilistic Neural Network Authors: Raymond Low, The University of Western Australia (Australia) Roberto Togneri, The University of Western Australia (Australia) Page (NA) Paper number 645 Abstract: A novel technique for speaker independent automated speech recognition is proposed. We take a segment model approach to Automated Speech Recognition (ASR), considering the trajectory of an utterance in vector space, then classify using a modified Probabilistic Neural Network (PNN) and maximum likelihood rule. The system performs favourably with established techniques. Our system achieves in excess of 94% with isolated digit recognition, 88% with isolated alphabetic letters, and 83% with the confusable /e/ set. A favourable compromise between recognition accuracy and computer memory and speech can also be reached by performing clustering on the training data for the PNN. SL980645.PDF (From Author) SL980645.PDF (Rasterized) TOP A Language Modeling Based on a Hierarchical Approach: M_n^v Authors: Imed Zitouni, LORIA / INRIA-Lorraine (France) Page (NA) Paper number 727 Abstract: In contrast to conventional n-gram approches, which are the most used language model in continuous speech recognition system, the multigram approach models a stream of variable-length sequences. To overcome the independence assumption in classical multigram, we propose in this paper a hierarchical model which successively relaxes this assumption. We called this model: Mnv. The estimation of the model parameters can be formulated as a Maximum Likelihood estimation problem from incomplete data used at different levels (j in 1...v). We show that estimates of the model parameters can be computed through an iterative Expectation-Maximization algorithm. A few experimental tests were carried out on a corpus extracted from the French ``Le Monde''. Results show that Mnv outperforms based multigram and interpolated bigram but are comparable to the interpolated trigram model. SL980727.PDF (From Author) SL980727.PDF (Rasterized) TOP Temporal Variables in Lectures in the Japanese Language Authors: Michiko Watanabe, University of Tokyo (Japan) Page (NA) Paper number 812 Abstract: In second language input studies, speaking speed is regarded as one of the most influential factors in comprehension. However, research in this area has mainly been conducted on written texts read aloud. The present study investigated temporal variables, such as articulation rate and ratio and frequency of fillers and silent pauses, in three university lectures given in Japanese. It was found that the total duration ratio of fillers was as great as that of silent pauses. It also became clear that, for individual speakers, articulation rate and frequency of fillers are relatively constant, while frequency of silent pauses varies depending on discourse section. Of total pause ratio, pause frequency and articulation rate, the latter correlated best with listener ratings of speech speed. The findings suggest that spontaneous speech requires methods of speech speed measurement different from those for read speech. SL980812.PDF (From Author) SL980812.PDF (Rasterized) TOP Building a Statistical Model of the Vowel Space for Phoneticians Authors: Matthew Aylett, Human Communication Research Centre, University of Edinburgh (U.K.) Page (NA) Paper number 823 Abstract: Vowel space data (A two dimensional F1/F2 plot) is of interest to phoneticians for the purpose of comparing different accents, languages, speaker styles and individual speakers. Current automatic methods used by speech technologists do not generally produce traditional vowel space models; instead they tend to produce hyper dimensional code books covering the entire speakers speech stream. This makes it difficult to relate results generated by these methods to observations in laboratory phonetics. In order to address these problems a model was developed based on a mixture Gaussian density function fitted using expectation maximisation on F1/F2 data producing a probability distribution in F1/F2 space. Speech was pre-processed using voicing to automatically excerpt vowel data without any need for segmentation and a parametric fit algorithm was applied to calculate likely vowel targets. The result was a clear visualisation of a speaker's vowel space requiring no segmented or labelled speech. SL980823.PDF (From Author) SL980823.PDF (Rasterized) TOP Computer-Mediated Input And The Acquisition Of L2 Vowels Authors: Michelle Minnick Fox, Department of Linguistics, University of Pennsylvania (USA) Page (NA) Paper number 911 Abstract: Programs for testing and training of difficult vowel distinctions in American English were created for subjects to access via the Internet using a web browser. The testing and training data include many likely vowel confusions for speakers of different L1s. The training program focuses on one distinction at a time, and adjusts to concentrate on particular contexts or exemplars that are difficult for the individual subject. In the current study, 52 subjects participated in testing and 2 subjects participated in training. In the testing portion, results indicate that the L1 and the fluency level in English, as well as individual variability, have an effect on perceptual ability. In the training portion, subjects showed improvement on the contrasts on which they trained. Because these programs make extensive data collection over large populations and large distances easy, this method of research will facilitate further investigation of questions regarding second language acquisition. SL980911.PDF (From Author) SL980911.PDF (Rasterized) TOP Speech Analysis By Subspace Methods Of Spectral Line Estimation Authors: Najam Malik, School of Electrical Engineering, The University of New South Wales, Sydney (Australia) W. Harvey Holmes, School of Electrical Engineering, The University of New South Wales, Sydney (Australia) Page (NA) Paper number 1026 Abstract: Over frames of short time duration, filtered speech may be described as a finite linear combination of sinusoidal components. In the case of a frame of voiced speech the frequencies are considered to be harmonics of a fundamental frequency. It can be assumed further that the speech samples are observed in additive white noise of zero mean, resulting in a standard signal-plus-noise model. This model has a nonlinear dependence on the frequencies of the sinusoids but is linear in their coefficients. We use subspace line spectral estimation methods of Pisarenko and Prony type to estimate the frequencies and use the results in voiced-unvoiced classification and pitch estimation, followed by analysis of the speech waveform into its sinusoidal components. SL981026.PDF (From Author) SL981026.PDF (Rasterized) TOP Pausing in Swedish Spontaneous Speech Authors: Petra Hansson, Lund University (Sweden) Page (NA) Paper number 1042 Abstract: Pauses in spontaneous speech have a less restricted distribution than pauses in read discourse; however, they are not distributed in a haphazard way. The majority of the perceived pauses in the examined Swedish spontaneous speech material, 73%, occurred in one of the following positions: between sentences, after discourse markers and conjunctions, and before accented content words. There is a range of acoustic correlates of perceived pauses in spontaneous speech, such as silent intervals, hesitation sounds, prepausal lengthening, glottalization and specific F0 patterns. The acoustic manifestation of a pause, e.g. the duration of the pause and the F0 pattern associated with the pause, is to some extent dependent on the pause's position and function. SL981042.PDF (From Author) SL981042.PDF (Rasterized) TOP Prosody And Voice Quality In The Expression Of Emotions Authors: Elisabeth Zetterholm, Lund university, Dept of Linguistics and Phonetics (Sweden) Page (NA) Paper number 1043 Abstract: Terms for voice quality or phonation types for use in normal speech often come from studies of pathological speech (laryngeal settings) and it is hard to describe voice quality, especially the variations of a normal voice. In normal speech we use different voice qualities both for linguistic distinctions in some languages, prosodically as a boundary signal, socially depending on social and regional variants and paralinguistically in attitudes and emotions. This paper shows some reference types of voice qualities, recorded by a trained phonetician, and their acoustic correlates. In a pilot study a male actor recorded four attitudinally neutral sentences using five different emotions which are being compared to his neutral voice. It is evident that voice quality, as well as rhythm and intonation, plays an important role in giving the impression of different emotions. SL981043.PDF (From Author) SL981043.PDF (Rasterized) TOP Acoustic Analysis of /l/ in Glossectomees Authors: Julie Lunn, Queen Margaret College (U.K.) Alan A. Wrench, Queen Margaret College (U.K.) Janet Mackenzie Beck, Queen Margaret College (U.K.) Page (NA) Paper number 1118 Abstract: The production of /l/ is examined for pre- and post-operative patients who have undergone surgery in three distinct areas (anterior, posterior or lateral tongue) followed by radiotherapy and reconstruction. Results show F1 and F2 to be raised after surgery in all cases. Normalised measures of tongue height (F1-F0) and extension (F2-F1) revealed no significant change after surgery to the side of the tongue but in the other two categories, results indicated a change normally associated with both raising and fronting of the tongue. The paper compares these results with findings from other studies and considers possible mechanisms for the observed changes. SL981118.PDF (From Author) SL981118.PDF (Rasterized) TOP