Session ThAD Auditory Modelling and Psychoacoustics, Neural Networks for Speech Processing and Recognition

Chairperson Phil D. Green Univ. of Sheffield, UK

Home


A PROBABILISTIC MODEL OF DOUBLE-VOWEL SEGREGATION

Authors: Laurent Varin and Frédéric Berthommier

Institut de la Communication Parlée/INPG Grenoble, FRANCE {varin,bertho}@icp.grenet.fr

Volume 5 pages 2791 - 2794

ABSTRACT

The decomposition principle was first proposed by Varga and Moore [] and applied to Automatic Speech Recognition (ASR) in noise. We show a new adaptation of this principle to model the schema-based streaming process which was inferred after psychoacoustical studies []. We address here the classical problem of double vowel segregation. The signal decomposition is allowed by an internal and statistical model of vowel spectra. We apply this decomposition model able to reconstruct the spectra of superimposed signals after identification of only the dominant or of both members of the pair. Three stages are invoked. The first one is a module performing identification when the input is a mixture of interfering signals. Prior identification of the dominant spectra prevents combinatorial reconstruction. The second step is an evaluation of the mixture coefficient also based on an internal representation of spectra. Finally, the reconstruction of spectra is probabilistic, by the way of likelihood maximisation. It uses labels and mixture coefficient. This is tested on a large database of synthetic vowels.

A0113.pdf

TOP


STIMULUS SIGNAL ESTIMATION FROM AUDITORY-NEURAL TRANSDUCTION INVERSE PROCESSING

Authors: Habibzadeh V. Houshang , Kitazawa Shigeyoshi

Graduate School of Science and Engineering, Shizuoka University 3-5-1 Johoku, Hamamatsu 432, JAPAN E-mail: houshang@cs.inf.shizuoka.ac.jp

Volume 5 pages 2795 - 2798

ABSTRACT

Auditory models reverse processing techniques would have very useful applications in speech perception and auditory models evaluation. This paper examines how we can be benefit an Inner Hair Cell (IHC) model as a compression and envelope detection section, in the cochlear model inverse processing. Our proposed inversion method, combines the reverse of the Meddis's auditory neural transduction model with Lyon's cochlear model to estimate the input signal to the inner ear from its auditory nerve firings, with the acceptable quality. Since this method uses neural firings or cleft contents as an input and re-generates the original acoustic stimulus, it is useful with any system generating auditory neural firings. For example, using this method, we are able to estimate the stimulus signal of the Nucleus Cochlear Implant systems to investigate the transferred speech quality without using the real patients.

A0139.pdf

TOP


FDVQ Based Keyword Spotter Which Incorporates A Semi-Supervised Learning for Primary Processing

Authors: Chakib Tadj (1), Pierre Dumouchel (2) Franck Poirier (3)

(1) Ecole de Technologie Superieure 1100 rue Notre Dame Ouest Montreal (Qc) - H3C 1K3 - Canada (2) Centre de Recherche Informatique de Montreal 1801, avenue McGill College, bureau 800 Montreal (Qc) - H3A 2N4 - Canada (3) Institut Universitaire Professionnalise 8, rue Montaigne BP 1104 Vannes - 56014 - France

Volume 5 pages 2799 - 2802

ABSTRACT

In this paper, we present a novel hybrid keyword spotting system that combines supervised and semi-supervised competitive learning algorithms. The first stage is a S-SOM (Semi-supervised Self- Organizing Map) module which is specifically designed for discrimination between keywords (KWs) and non-keywords (NKWs). The second stage is an FDVQ (Fuzzy Dynamic Vector Quantization) module which consists of discriminating between KWs detected by the first stage processing. The experiment on Switchboard database has show an improvement of about 6% on the accuracy of the system comparing to our best keyword-spotter one.

A0267.pdf

TOP


The initial time span of auditory processing used for speaker attribution of the speech signal.

Authors: V.V. Lublinskaja Ch. Sappok

Pavlov Institute of Physiology, Saint-Petersburg Tel. +7 812 529 09 58, Fax: +7 812 218 05 01, E-mail:chi@physiology.spb.su Institute of Slavonic Studies, Ruhr Universität Bochum Tel. +49 234 700 6664, Fax: +49 234 7094 337, E-mail: sappokc@slf.ruhr-uni-bochum.de

Volume 5 pages 2803 - 2806

ABSTRACT

Research on the temporal organisation of speech perception is focussed mostly on the linguistic categories of the input. What is the role of non-grammatical categories for this processes? What kind of mechanisms integrate both kinds of features within the online process of perception? Individual voice qualities and the position of the sentence within the text were chosen to test the time interval where decisions as to speaker belongingness are made. The results favour a model with a relatively fixed time span within which a familiar voice or a deviation from an inherent context expectancy are detected.

A0270.pdf

TOP


sparse connection and pruning in large dynamic artificial neural networks

Authors: Nikko Ström

Department of Speech, Music and Hearing KTH (Royal Institute of Technology), Stockholm, Sweden Tel. +46 8 790 75 63, FAX: +46 8 790 78 54, E-mail: nikko@speech.kth.se

Volume 5 pages 2807 - 2810

ABSTRACT

This paper presents new methods for training large neural networks for phoneme probability estimation. A combination of the time-delay architecture and the recurrent network architecture is used to capture the important dynamic information of the speech signal. Motivated by the fact that the number of connections in fully connected recurrent networks grows super-linear with the number of hidden units, schemes for sparse connection and connection pruning are explored. It is found that sparsely connected networks outperform their fully connected counterparts with an equal or smaller number of connections. The networks are evaluated in a hybrid HMM/ANN system for phoneme recognition on the TIMIT database. The achieved phoneme error-rate, 28.3%, for the standard 39 phoneme set on the core test-set of the TIMIT database is not far from the lowest reported. All training and simulation software used is made freely available by the author, making reproduction of the results feasible.

A0316.pdf

TOP


A MODULAR INITIALIZATION SCHEME FOR BETTER SPEECH RECOGNITION PERFORMANCE USING HYBRID SYSTEMS OF MLPs/HMMs

Authors: Roxana Teodorescu, Dirk Van Compernolle and Ioannis Dologlou

K. U. Leuven - E.S.A.T., Kardinaal Mercierlaan 94, B-3001 Heverlee, Belgium E-mail: Roxana.Teodorescu@esat.kuleuven.ac.be

Volume 5 pages 2811 - 2814

ABSTRACT

This paper proposes a novel modular initialization scheme of Multilayer Perceptrons (MLPs) trained for phoneme classification. Small MLPs are trained to discriminate between a phoneme and all the others. In the next step they are merged using our novel initialization scheme in broad classes and trained further. In the last step we merge the broad phonetic MLPs using the same scheme to generate the final phonetic MLP. Experiments done on a Dutch language isolated word database showed that the scheme gives faster and better estimates of Bayesian a posteriori probabilities compared to random initialization. Moreover, given its modularity, the method offers the possibility to deal with high dimensional problems.

A0367.pdf

TOP


LATERALIZATION FOR AUDITORY PERCEPTION OF FOREIGN WORDS

Authors: Tatiana V.Chernigovskaya

I.M.Sechenov Institute of Evolutionary Physiology, Russian Academy of Sciences, 194223 St. Petersburg FAX:7 812 552 30 12, E-mail: chern@ief.spb.su

Volume 5 pages 2815 - 2818

ABSTRACT

This paper presents the experimental study of cerebral hemispheric engagement in auditory recognition of words depending on a set of linguistic factors. Words were native and foreign to the subjects. Listeners were normal right-handed adults with symmetrical hearing, native speakers of Russian; English was acquired as a second language at school. The stimuli were linguistically balanced lists of natural Russian and English words presented monaurally, white noise being contralateral masking. The data show strong overall left hemispheric advantage. The most significant factor for both hemispheres appeared to be 'frequency of usage' (contrary to 'word length'- characterizing the perception of native words). The second important factor was 'consonant ratio' for the RH and 'word length' for the LH.'Part of speech' was shown to be of minimal importance for both the hemispheres, 'stress position' -slightly more significant.

A0463.pdf

TOP


THE STRUCTURAL WEIGHTED SETS METHOD FOR CONTINUOUS SPEECH AND TEXT RECOGNITION

Authors: Yuri Kosarev, Pavel Jarov, Alexander Osipov

Russian Academy of Sciences Institute for Informatics and Automation St. Petersburg E-mail: kosarev@mail.jias.spb.su

Volume 5 pages 2819 - 2822

ABSTRACT

In known approaches to speech recognition based on Dynamic Programming (DP) or Hidden Markov Modelling (HMM) time sequences of elements (feature vectors, sounds, letters, etc.) as objects of evaluating or matching are used directly. Both of these approaches have the same demerit: they both can be realised only in the course of the recurrent sequential process and can't be realised in parallel. In addition, the complexity of them are relatively high. In proposed below Structural Weighted Sets (SWS) method such sequence are reflected first into some structure as a set from relations between its elements and then a recognition is reduced to matching corresponding sets. So in this case a words matching can be realised as a finding an intersection of two sets and evaluating its relative weight. The possibility to carry out a processing in parallel is arisen. The results of simulation are represented.

A0473.pdf

TOP


Lateral Inhibitory Networks for Auditory Processing

Authors: C.J. Sumner and D.F. Gillies

Department of Computing, Imperial College of Science Technology and Medicine, 180 Queens Gate, London SW7 2BZ, United Kingdom Tel: +44 171 589 5111 ext 58378, E-mail: cjs2@doc.ic.ac.uk

Volume 5 pages 2823 - 2826

ABSTRACT

A neural-network model is described that produces a rate-place representation from auditory nerve output that is of considerably higher frequency resolution than that from a standard auditory peripheral model. The neural circuits used are called Lateral Inhibitory Networks. They have long been known to be responsible for early spatio-temporal processing in the visual system. Here we investigate the use of such networks for early auditory processing. We describe the analytical basis, problems with various variants of the model, and show some initial results yielded by the research.

A0641.pdf

TOP


MISSING FUNDAMENTALS: A PROBLEM OF AUDITORY OR MENTAL PROCESSING?

Authors: Henning Reetz Allgemeine Sprachwissenschaft

University of Konstanz D-78464 Konstanz, Germany Phone: +49 7531 882928, FAX: +49 7531 883095, E-mail: henning.reetz@uni-konstanz.de

Volume 5 pages 2827 - 2830

ABSTRACT

Subjects were presented with signal pairs with different musical intervals. Signals were sine tones, complex tones with a fundamental, and complex tones without a fundamental. Subjects had to decide, which signal pairs form a specific musical interval. Reaction times indicate that the perception of the 'missing fundamental' is a sort of musical processing and not necessarily a part of normal auditory processing in pitch perception.

A0647.pdf

TOP


PREDICTIVE NEURAL NETWORKS APPLIED TO PHONEME RECOGNITION

Authors: F. Freitag, E. Monte, J. Salavedra

Polytechnic University of Catalunya Department of Signal Theory and Communications C/Gran Capità, s/n, E - 08034 Barcelona E-mail: felix@gps.tsc.upc.es Fax: 34-3-4016447 Phone: 34-3-4016435

Volume 5 pages 2831 - 2834

ABSTRACT

In this paper a phoneme recognition system based on predictive neural networks is proposed. Neural networks are used to predict observation vectors of speech frames. The obtained prediction error is used for phoneme recognition as 1) distortion measure on the frame level and 2) as feature, which is statistically modeled by the Rayleigh distribution. Continuous speech phoneme recognition experiments are performed different settings of the system are evaluated.

A0777.pdf

TOP


EMPIRICAL COMPARISON OF TWO MULTILAYER PERCEPTRON-BASED KEYWORD SPEECH RECOGNITION ALGORITHMS

Authors: Suhardi, Klaus Fellbaum

Institute for Telecommunication and Theoretical Electrical Engineering Technical University of Berlin, Germany suhardi@ft.ee.tu-berlin.de Communication Engineering Brandenburg Technical University of Cottbus, Germany fellbaum@kt.tu-cottbus.de

Volume 5 pages 2835 - 2838

ABSTRACT

In this paper, an empirical comparison of two mul- tilayer perceptron (MLP)-based techniques for key- word speech recognition (wordspotting) is described. The techniques are the predictive neural model (PNM)-based wordspotting, in which the MLP is applied as a speech pattern predictor to compute a local distance between the acoustic vector and the phone model, and the hybrid HMM/MLP-based wordspotting, where the MLP is used as a state (phone) probability estimator given acoustic vectors. The comparison was performed with the same database. According to our experiments, the hybrid HMM/MLP-based technique excels the PNM-based techniques (~6.2 %).

A0794.pdf

TOP


SEGMENT BOUNDARY ESTIMATION USING RECURRENT NEURAL NETWORKS

Authors: Toshiaki Fukada, Sophie Aveline, Mike Schuster, Yoshinori Sagisaka

ATR Interpreting Telecommunications Research Laboratories 2{2 Hikaridai, Seika-cho, Soraku-gun, Kyoto 619-02 Japan Tel: +81 774 95 1301, FAX: +81 774 95 1308, E-mail: fukada@itl.atr.co.jp

Volume 5 pages 2839 - 2842

ABSTRACT

This paper describes a segment (e.g. phoneme) boundary estimation method based on recurrent neural networks (RNNs). The proposed method only requires acoustic observations to accurately estimate segment boundaries. Experimental results show that the proposed method can estimate segment boundaries signi cantly better than an HMM based method. Furthermore, we incorporate the RNN based segment boundary estimator into the HMM based and segment based recognition systems. As a result, the segment boundary estimates give useful information for reducing computational complexity and improving recognition performance.

A0797.pdf

TOP


INCORPORATION OF HMM OUTPUT CONSTRAINTS IN HYBRID NN/HMM SYSTEMS DURING TRAINING

Authors: Mike Schuster

ATR, Interpreting Telecommunications Research Lab. 2-2 Hikari-dai, Seika-cho, Soraku-gun, Kyoto 619-02, JAPAN gustl@itl.atr.co.jp http://www.itl.atr.co.jp/

Volume 5 pages 2843 - 2846

ABSTRACT

This paper describes a method to incorporate the HMM output constraints in frame based hybrid NN/HMM systems during training. While usually the NN parameters are adjusted to maximize the cross-entropy between the frame target probabilities and the network predictions assuming statistically independent outputs in time, in the approach described here the full likelihood of the utterance(s) using also the HMM output constraints, like for conventional HMM systems, is maximized. This is achieved by maximizing the state occupation probabilities after a forward/backward pass using the scaled likelihoods coming from the network. Making a simplifying approximation for the derivative for the back-propagation through the forward/backward pass, tests show that the proposed method gives consistently higher string (phoneme) recognition rates than the conventional approach that aims at maximizing cross-entropy at the frame level.

A0802.pdf

TOP


PRINCIPLES OF THE HEARING PERIPHERY FUNCTIONING IN NEW METHODS OF PITCH DETECTION AND SPEECH ENHANCEMENT

Authors: Ludmila Babkina*, Sergey Koval**, Alexander Molchanov*

* Research Institute of Ear Nose, Throat and Speech disorders, St. Petersburg, Russia ** Speech Technology Center, St. Petersburg, Russia Tel./fax: +7(812)3279297, E-mail: master@stc.rus.net

Volume 5 pages 2847 - 2850

ABSTRACT

Spent researches show that one of mechanisms of human auditory system ensuring high noise resistance of vocal speech sounds recognition is an electromechanical envelope feedback, effecting in structures of inner ear in man. Digital modeling of hearing system peripheral section with a similar multichannel envelopes feedback has shown to be useful for pitch determination of vowels in noisy environment. The offered model provides robust pitch detection tor signal/noise relation up to -12 - -14 dB. In number of cases such a noiseproof feature is better than for other existing methods and systems.

A0835.pdf

TOP


THE LOCUS OF THE SYLLABLE EFFECT: PRELEXICAL OR LEXICAL?

Authors: Christine Meunier (1), Alain Content (1), (2), Uli H. Frauenfelder (1), Ruth Kearns (3)

(1) Laboratory of Experimental Psycholinguistics, University of Geneva, Switzerland email: meunier@fapse.unige.ch, frauenfe@uni2a.unige.ch (2) Laboratoire de Psychologie Expérimentale, Université Libre de Bruxelles, acontent@ulb.ac.be (3) Medical Research Council, Applied Psychology Unit, Cambridge, ruth.kearns@mrc-apu.cam.ac.uk

Volume 5 pages 2851 - 2854

ABSTRACT

The claim that the syllable constitutes a basic perceptual unit in French is commonly accepted. It is based in part on the syllable effect [1] obtained with words. The present study extends these syllable detection experiments to pseudowords. Four experiments failed to replicate the syllable effect observed on words. Detection responses in pseudowords are made as soon as sufficient information becomes available in the signal. The different pattern of results obtained with words and pseudowords suggests that the syllable effect is post-lexical rather than pre-lexical.

A0879.pdf

TOP


ON NOT REMEMBERING DISFLUENCIES

Authors: E. G. Bard and R. J. Lickley

Human Communication Research Centre and Department of Linguistics University of Edinburgh, Edinburgh EH8 9LL, UK Tel. +44 131 650 3951, E-mail: ellen@ling.ed.ac.uk

Volume 5 pages 2855 - 2858

ABSTRACT

Disfluencies - repetitions and reformulations mid-sentence in normal spontaneous speech - are problematic for both psychological and computational models of speech understanding. Much effort is being applied to finding ways of adapting computational systems to detect and delete disfluencies. The input to such systems is usually an accurate transcription. We present results of an experiment in which human listeners are asked to give verbatim transcriptions of disfluent and fluent utterances. These suggest that listeners are seldom able to identify all the words "deleted" in disfluencies. While all types suffer, identification rates for repetitions are even worse than for other types. We attribute the results to difficulties in recall or coding for recall items which can not be identified with certainty. This inability seems to make human speech recognition more robust than current computational models.

A0981.pdf

TOP


Using an Auditory Model and Leaky Autocorrelators to Tune In to Speech

Authors: T. Andringa

tjeerd@bcn.rug.nl Department of biophysics University of Groningen Postbus 72 9700 AB Groningen The Netherlands

Volume 5 pages 2859 - 2862

ABSTRACT

This paper introduces a method to esitimate the spectrum of voiced speech in noise, based on an estimate of the fundamental frequency. The method uses the output of an auditory model that imitates the mechanics of the basilar membrane. The output of the segments of the model is used as an input to a set of leaky autocorrelator units (as simple neuron models) sensitive to a certain periodicity (delay). If a noisy vowel is presented to the system, the units sensitive to the fundamental period of that vowel respond most actively. The activity of the responding autocorrelator units as a function of segment number is a direct measure of the spectrum of the vowel. This technique is very robust and can, like humans, estimate the existence of a vowel in a SNR of -10 dB aperiodic speech-noise and formant frequencies in -3 to -6 dB. With this technique it is possible to split a mixture of sound sources in auditory entities (percepts) on the basis of pitch.

A0987.pdf

TOP