ABSTRACT
The decomposition principle was first proposed by Varga and Moore [] and applied to Automatic Speech Recognition (ASR) in noise. We show a new adaptation of this principle to model the schema-based streaming process which was inferred after psychoacoustical studies []. We address here the classical problem of double vowel segregation. The signal decomposition is allowed by an internal and statistical model of vowel spectra. We apply this decomposition model able to reconstruct the spectra of superimposed signals after identification of only the dominant or of both members of the pair. Three stages are invoked. The first one is a module performing identification when the input is a mixture of interfering signals. Prior identification of the dominant spectra prevents combinatorial reconstruction. The second step is an evaluation of the mixture coefficient also based on an internal representation of spectra. Finally, the reconstruction of spectra is probabilistic, by the way of likelihood maximisation. It uses labels and mixture coefficient. This is tested on a large database of synthetic vowels.
ABSTRACT
Auditory models reverse processing techniques would have very useful applications in speech perception and auditory models evaluation. This paper examines how we can be benefit an Inner Hair Cell (IHC) model as a compression and envelope detection section, in the cochlear model inverse processing. Our proposed inversion method, combines the reverse of the Meddis's auditory neural transduction model with Lyon's cochlear model to estimate the input signal to the inner ear from its auditory nerve firings, with the acceptable quality. Since this method uses neural firings or cleft contents as an input and re-generates the original acoustic stimulus, it is useful with any system generating auditory neural firings. For example, using this method, we are able to estimate the stimulus signal of the Nucleus Cochlear Implant systems to investigate the transferred speech quality without using the real patients.
ABSTRACT
In this paper, we present a novel hybrid keyword spotting system that combines supervised and semi-supervised competitive learning algorithms. The first stage is a S-SOM (Semi-supervised Self- Organizing Map) module which is specifically designed for discrimination between keywords (KWs) and non-keywords (NKWs). The second stage is an FDVQ (Fuzzy Dynamic Vector Quantization) module which consists of discriminating between KWs detected by the first stage processing. The experiment on Switchboard database has show an improvement of about 6% on the accuracy of the system comparing to our best keyword-spotter one.
ABSTRACT
Research on the temporal organisation of speech perception is focussed mostly on the linguistic categories of the input. What is the role of non-grammatical categories for this processes? What kind of mechanisms integrate both kinds of features within the online process of perception? Individual voice qualities and the position of the sentence within the text were chosen to test the time interval where decisions as to speaker belongingness are made. The results favour a model with a relatively fixed time span within which a familiar voice or a deviation from an inherent context expectancy are detected.
ABSTRACT
This paper presents new methods for training large neural networks for phoneme probability estimation. A combination of the time-delay architecture and the recurrent network architecture is used to capture the important dynamic information of the speech signal. Motivated by the fact that the number of connections in fully connected recurrent networks grows super-linear with the number of hidden units, schemes for sparse connection and connection pruning are explored. It is found that sparsely connected networks outperform their fully connected counterparts with an equal or smaller number of connections. The networks are evaluated in a hybrid HMM/ANN system for phoneme recognition on the TIMIT database. The achieved phoneme error-rate, 28.3%, for the standard 39 phoneme set on the core test-set of the TIMIT database is not far from the lowest reported. All training and simulation software used is made freely available by the author, making reproduction of the results feasible.
ABSTRACT
This paper proposes a novel modular initialization scheme of Multilayer Perceptrons (MLPs) trained for phoneme classification. Small MLPs are trained to discriminate between a phoneme and all the others. In the next step they are merged using our novel initialization scheme in broad classes and trained further. In the last step we merge the broad phonetic MLPs using the same scheme to generate the final phonetic MLP. Experiments done on a Dutch language isolated word database showed that the scheme gives faster and better estimates of Bayesian a posteriori probabilities compared to random initialization. Moreover, given its modularity, the method offers the possibility to deal with high dimensional problems.
ABSTRACT
This paper presents the experimental study of cerebral hemispheric engagement in auditory recognition of words depending on a set of linguistic factors. Words were native and foreign to the subjects. Listeners were normal right-handed adults with symmetrical hearing, native speakers of Russian; English was acquired as a second language at school. The stimuli were linguistically balanced lists of natural Russian and English words presented monaurally, white noise being contralateral masking. The data show strong overall left hemispheric advantage. The most significant factor for both hemispheres appeared to be 'frequency of usage' (contrary to 'word length'- characterizing the perception of native words). The second important factor was 'consonant ratio' for the RH and 'word length' for the LH.'Part of speech' was shown to be of minimal importance for both the hemispheres, 'stress position' -slightly more significant.
ABSTRACT
In known approaches to speech recognition based on Dynamic Programming (DP) or Hidden Markov Modelling (HMM) time sequences of elements (feature vectors, sounds, letters, etc.) as objects of evaluating or matching are used directly. Both of these approaches have the same demerit: they both can be realised only in the course of the recurrent sequential process and can't be realised in parallel. In addition, the complexity of them are relatively high. In proposed below Structural Weighted Sets (SWS) method such sequence are reflected first into some structure as a set from relations between its elements and then a recognition is reduced to matching corresponding sets. So in this case a words matching can be realised as a finding an intersection of two sets and evaluating its relative weight. The possibility to carry out a processing in parallel is arisen. The results of simulation are represented.
ABSTRACT
A neural-network model is described that produces a rate-place representation from auditory nerve output that is of considerably higher frequency resolution than that from a standard auditory peripheral model. The neural circuits used are called Lateral Inhibitory Networks. They have long been known to be responsible for early spatio-temporal processing in the visual system. Here we investigate the use of such networks for early auditory processing. We describe the analytical basis, problems with various variants of the model, and show some initial results yielded by the research.
ABSTRACT
Subjects were presented with signal pairs with different musical intervals. Signals were sine tones, complex tones with a fundamental, and complex tones without a fundamental. Subjects had to decide, which signal pairs form a specific musical interval. Reaction times indicate that the perception of the 'missing fundamental' is a sort of musical processing and not necessarily a part of normal auditory processing in pitch perception.
ABSTRACT
In this paper a phoneme recognition system based on predictive neural networks is proposed. Neural networks are used to predict observation vectors of speech frames. The obtained prediction error is used for phoneme recognition as 1) distortion measure on the frame level and 2) as feature, which is statistically modeled by the Rayleigh distribution. Continuous speech phoneme recognition experiments are performed different settings of the system are evaluated.
ABSTRACT
In this paper, an empirical comparison of two mul- tilayer perceptron (MLP)-based techniques for key- word speech recognition (wordspotting) is described. The techniques are the predictive neural model (PNM)-based wordspotting, in which the MLP is applied as a speech pattern predictor to compute a local distance between the acoustic vector and the phone model, and the hybrid HMM/MLP-based wordspotting, where the MLP is used as a state (phone) probability estimator given acoustic vectors. The comparison was performed with the same database. According to our experiments, the hybrid HMM/MLP-based technique excels the PNM-based techniques (~6.2 %).
ABSTRACT
This paper describes a segment (e.g. phoneme) boundary estimation method based on recurrent neural networks (RNNs). The proposed method only requires acoustic observations to accurately estimate segment boundaries. Experimental results show that the proposed method can estimate segment boundaries signicantly better than an HMM based method. Furthermore, we incorporate the RNN based segment boundary estimator into the HMM based and segment based recognition systems. As a result, the segment boundary estimates give useful information for reducing computational complexity and improving recognition performance.
ABSTRACT
This paper describes a method to incorporate the HMM output constraints in frame based hybrid NN/HMM systems during training. While usually the NN parameters are adjusted to maximize the cross-entropy between the frame target probabilities and the network predictions assuming statistically independent outputs in time, in the approach described here the full likelihood of the utterance(s) using also the HMM output constraints, like for conventional HMM systems, is maximized. This is achieved by maximizing the state occupation probabilities after a forward/backward pass using the scaled likelihoods coming from the network. Making a simplifying approximation for the derivative for the back-propagation through the forward/backward pass, tests show that the proposed method gives consistently higher string (phoneme) recognition rates than the conventional approach that aims at maximizing cross-entropy at the frame level.
ABSTRACT
Spent researches show that one of mechanisms of human auditory system ensuring high noise resistance of vocal speech sounds recognition is an electromechanical envelope feedback, effecting in structures of inner ear in man. Digital modeling of hearing system peripheral section with a similar multichannel envelopes feedback has shown to be useful for pitch determination of vowels in noisy environment. The offered model provides robust pitch detection tor signal/noise relation up to -12 - -14 dB. In number of cases such a noiseproof feature is better than for other existing methods and systems.
ABSTRACT
The claim that the syllable constitutes a basic perceptual unit in French is commonly accepted. It is based in part on the syllable effect [1] obtained with words. The present study extends these syllable detection experiments to pseudowords. Four experiments failed to replicate the syllable effect observed on words. Detection responses in pseudowords are made as soon as sufficient information becomes available in the signal. The different pattern of results obtained with words and pseudowords suggests that the syllable effect is post-lexical rather than pre-lexical.
ABSTRACT
Disfluencies - repetitions and reformulations mid-sentence in normal spontaneous speech - are problematic for both psychological and computational models of speech understanding. Much effort is being applied to finding ways of adapting computational systems to detect and delete disfluencies. The input to such systems is usually an accurate transcription. We present results of an experiment in which human listeners are asked to give verbatim transcriptions of disfluent and fluent utterances. These suggest that listeners are seldom able to identify all the words "deleted" in disfluencies. While all types suffer, identification rates for repetitions are even worse than for other types. We attribute the results to difficulties in recall or coding for recall items which can not be identified with certainty. This inability seems to make human speech recognition more robust than current computational models.
ABSTRACT
This paper introduces a method to esitimate the spectrum of voiced speech in noise, based on an estimate of the fundamental frequency. The method uses the output of an auditory model that imitates the mechanics of the basilar membrane. The output of the segments of the model is used as an input to a set of leaky autocorrelator units (as simple neuron models) sensitive to a certain periodicity (delay). If a noisy vowel is presented to the system, the units sensitive to the fundamental period of that vowel respond most actively. The activity of the responding autocorrelator units as a function of segment number is a direct measure of the spectrum of the vowel. This technique is very robust and can, like humans, estimate the existence of a vowel in a SNR of -10 dB aperiodic speech-noise and formant frequencies in -3 to -6 dB. With this technique it is possible to split a mixture of sound sources in auditory entities (percepts) on the basis of pitch.