Chair: Shigeru Katagiri, ATR, Japan
Ni Ma, South China University of Technology (China)
Gang Wei, South China University of Technology (China)
A new signal process based on a nonlinear local prediction model(NLLP) is presented and applied to speech coding. With the same implemention, the speech coding based on the NLLP gives improved performance compared to reference versions of the standard ITU-T G.728 and linear local scheme. The computational efforts for the NLLP analysis does not increase over the conventional linear prediction(LP), and the NLLP supplies better prediction performance over the LP and linear local prediction.
Klaus Reinhard, Cambridge University (U.K.)
Mahesan Niranjan, Cambridge University (U.K.)
In this paper we report on attempting to capture segmental transition information for speech recognition tasks. The slowly varying dynamics of spectral trajectories carries much discriminant information that is very crudely modelled by traditional approaches such as HMMs. in attempts such as recurrent neural networks there is the hope, but not convincing demonstration, that such transitional information could be captured. We start from the very different position of explicitlycapturing the trajectory of short time spectral parameter vectors on a subspace in which the temporal sequence information is preserved (Time Constrained Principal Component Analysis). On this subspace, we attempt a parametric modelling of the trajectory, compute a distance metric to perform classification of diphones. Much of the discriminant information is still retained in this subspace. This is illustrated on the isolated transitions /bee/, /dee/ and /gee/.
Elaine Tsiang, Monowave Corporation (U.S.A.)
The proposed neural architecture consists of an analytic lower net, and a synthetic upper net. This paper focuses on the upper net. The lower net performs a 2D multiresolution wavelet decomposition of an initial spectral representation to yield a multichannel representation of local frequency modulations at multiple scales. From this representation, the upper net synthesizes increasingly complex features, resulting in a set of acoustic observables at the top layer with multiscale context dependence. The upper net also provides for invariance under frequency shifts, dilatations in tone intervals and time intervals, by building these transformations into the architecture. Application of this architecture to the recognition of gross and fine phonetic categories from continuous speech of diverse speakers shows that it provides high accuracy and strong generalization from modest amounts of training data.
Hossein Sedarat, Stanford University (U.S.A.)
Rasool Khadem, Stanford University (U.S.A.)
Horacio Franco, SRI International (U.S.A.)
Recent studies suggest that a hybrid speech recognition system based on a hidden Markov model (HMM) with a neural network (NN) subsystem as the estimator of the state conditional observation probability may have some advantages over the conventional HMMs with Gaussian mixture models for the observation probabilities. The HMM and NN modules are typically treated as separate entities in a hybrid system. This paper, however, suggests that the a priori knowledge of HMMstructure can be beneficial in the design of the NN subsystem. A case of isolated word recognition is studied to demonstrate that a substantially simplified NN can be achieved in a structured HMM by applying a Bayesian factorization and pre-classification. The results indicate a similar performance to that obtained with the classical approach with much less complexity in NN structure.
Søren Kamaric Riis, Technical University of Denmark (Denmark)
In this paper we evaluate the Hidden Neural Network HMM/NN hybrid presented at last years ICASSP on two speech recognition benchmark tasks; 1) task independent isolated word recognition on the PHONEBOOK database, and 2) recognition of broad phoneme classes in continuous speech from the TIMIT database. It is shown how Hidden Neural Networks (HNNs) with much fewer parameters than conventional HMMs and other hybrids can obtain comparable performance, and for the broad class task it is illustrated how the HNN can be applied as a purely transition based system, where acoustic context dependent transition probabilities are estimated by neural networks.
Yuan-Fu Liao, National Chaio-Tung University (Taiwan)
Sin-Horng Chen, National Chaio-Tung University (Taiwan)
A new MRNN-based method for continuous Mandarin speech recognition is proposed. The system uses five RNNs to accomplish many subtasks separately and then combine them to integrally solve the problem . They include two RNNs for the discriminations of the two sub-syllable groups of 100 RFD initials and 39 CI finals, two RNNs for the generations of dynamic weighting functions for sub-syllable*s integrations, and one RNN for syllable boundary detection. All RNN modules are combined using a delay-decision Viterbi search. The method differs from the ANN/HMM hybrid approach on using ANNs to perform not only sub-syllables discrimination but also temporal structure modeling of speech signal. The system is trained using a three-stage training method embedding with the MCE/GPD algorithms. Besides, fast recognition method using multi-level pruning is also proposed. Experimental results showed that it outperforms the HMM method on both the recognition accuracy and the computational complexity.
Liqing Zhou, Beijing University of Posts & Telecommunications (China)
This paper introduces an off-line working speech recognition hardware system. A new compound structure of neural networks is proposed and fuzzy logic is adopted to implement the system. So the system is able to perform speaker-independent real time speech recognition in actual environments where there are heavier noises.
Kevin R. Farrell, T-NETIX Inc. (U.S.A.)
Ravi P Ramachandran, Rowan University (U.S.A.)
Richard J Mammone, Rutgers University (U.S.A.)
In this paper, we analyze the diversity of information as provided by several modeling approaches for speaker verification. This information is used to facilitate the fusion of the individual results into an overall result that provides advantages in accuracy over the individual models. The modeling methods that are evaluated consist of the neural tree network (NTN), Gaussian mixture model (GMM), hidden Markov model (HMM), and dynamic time warping (DTW). With the exception of DTW, all methods utilize subword-based approaches. The phrase-level scores for each modeling approach are used for combination. Several data fusion methods are evaluated for combining the model results, including the linear and log opinioin pool approaches along with voting. The results of the above analysis have been integrated into a system that has been tested with several databases collected within landline and cellular environments. We have found the linear and log opinion pool methods to consistently reduce the error rate from that obtained when the models are used individually.