Tim Fingscheidt, RTWH Aachen (Germany)
Peter Vary, RTWH Aachen (Germany)
In digital mobile communication systems there is the need for reducing the subjective effects of residual bit errors which have not been eliminated by channel decoding by the use of error concealment techniques. Due to the fact that most standards do not specify these algorithms bit exactly, there is room for new solutions to improve the speech quality. This contribution develops a new approach for optimum estimation of speech codec parameters. It can be applied to any speech codec standard if a bit reliability information is provided by the demodulator (e.g. DECT), or by the channel decoder (e.g. soft-output Viterbi algorithm -- SOVA in GSM). The proposed method includes an inherent muting mechanism leading to a graceful degradation of speech quality in case of adverse transmission conditions. Particularly the additional exploitation of residual source redundancy, i.e. some a priori knowledge about codec parameters gives a significant enhancement of the output speech quality. In the case of an error free channel, bit exactness as required by the standards can be preserved.
Bastiaan Kleijn, Delft University of Technology (The Netherlands)
In the quantization of a signal in speech coding, dependencies between its samples are often neglected. Generally, these dependencies are then also neglected at the decoder. However, usually a priori information about these dependencies is available, making it possible to improve decoder performance by means of enhanced decoding. An attractive feature of enhanced decoding is that it can be applied to existing coding standards. This paper describes several enhanced decoding methods, including a vector decoding method and a method which aims at reducing the differential entropy rate of the decoded signal. Experimental results are used to confirm that both these decoding procedures can provide better performance than conventional decoding for common signal/encoder combinations.
Sassan Ahmadi, ASU (U.S.A.)
Andreas S. Spanias, ASU (U.S.A.)
A new phase modeling algorithm for sinusoidal analysis and synthesis of speech signals is presented. Short-time sinusoidal phases are efficiently approximated by incorporating linear prediction, spectral sampling, delay compensation, and phase correction techniques. The algorithm is different than phase compensation methods proposed for multi-pulse LPC in that it has been tailored to sinusoidal transform coding of speech signals. Performance analysis on a large speech database indicates considerable improvement in temporal and spectral matching between the original and reconstructed signals as compared to other sinusoidal phase models as well as improved subjective quality of the reproduced speech.
Kazuo Nakata, Chiba Institute of Technology (Japan)
Kin-ich Higure, Chiba Institute of Technology (Japan)
A new alorithm of speech coding 'recursive and adaptive prediction' is proposed and tested. An adaptive linear prediction of input is carried out at sample by sample, and only predictive residuals are quantized and transmitted in binary codes. Predictive coefficients are adaptively controlled by quantized prediction error. Segmental SNR of almost 22 dB is obtained at 16 kb/s by the cascade connection of 2 stages of prediction. The algorithm can handle mixed voices as well, and easy be implemented by single DSP.
Takahiro Unno, Asahi Chemical (Japan)
Thomas P. Barnwell, Georgia Institute of Technology (U.S.A.)
Mark A. Clements, Georgia Institute of Technology (U.S.A.)
This paper presents a new high-quality, variable-rate vocoder in which the average bit-rate is parametrically controllable. The new vocoder is intended for use with data-voice simultaneous channel(DVSC) applications, in which the speech data is transmitted simultaneously with video and other types of data. The vocoder presented in this paper achieves state-of-the-art quality at several different bit-rates between 5.5 Kbps and 10 Kbps. Further, it achieves this performance at acceptable levels of complexity and delay.
Manohar N. Murthi, University of California, San Diego (U.S.A.)
Bhaskar D. Rao, University of California, San Diego (U.S.A.)
In this paper we propose the MVDR method, which is based upon the Minimum Variance Distortionless Response (MVDR) spectrum estimation method, for modeling voiced speech. Developed to overcome some of the shortcomings of Linear Prediction models, the MVDR method provides better models for medium and high pitch voiced speech. The MVDR model is an all-pole model whose spectrum is easily obtained from a modest non-iterative computation involving the Linear Prediction coefficients thereby retaining some of the computational attractiveness of LPC methods. With the proper choice of filter order, which is dependent on the number of harmonics, the MVDR spectrum models the formants and spectral powers of voiced speech exactly. An efficient reduced model order MVDR method is developed to further enhance its applicability. An extension of the reduced order MVDR method for recovering the correct amplitudes of the harmonics of voiced speech is also presented.
Xiaoqin Sun, University of Liverpool (U.K.)
Fabrice Plante, University of Liverpool (U.K.)
Barry M.G. Cheetham, University of Liverpool (U.K.)
Kenneth W.T. Wong, B.T. Laboratories (U.K.)
Sinusoidal transform coding (STC) techniques model speech as the sum of sine-waves whose frequencies, amplitudes and phases are specified at regular intervals. To achieve a low-bit rate representation, only the spectral envelope is encoded and the phases are regenerated according to a minimum phase assumption. In this paper, the inaccuracy of the minimum phase model is demonstrated. It is shown that the phase spectra of decoded speech segments may be corrected using either the parameters of a Rosenberg pulse model or a second order all-pass filter. Experiments have shown that by applying this correction, the phase accuracy increases and the speech quality improves.
John E. Kleider, Motorola SSTG, Speech and Signal Processing Lab (U.S.A.)
William M. Campbell, Motorola SSTG, Speech and Signal Processing Lab (U.S.A.)
Current digital voice communication systems allow only modest levels of protection of the coded speech and often do not follow the dynamic changes that occur in the transmission channel. We present a method that provides optimal voice quality and intelligibility for any given transmission channel condition. The approach is performed via adaptive rate voice (ARV) coding using an adaptive- rate modem, channel coding, and a multimode sinusoidal transform coder. In general, the receiver utilizes channel state information to not only optimally demodulate and decode the currently corrupted symbols from the channel, but also to inform the transmitter, via a feedback channel, of the optimal strategy for voice/channel coding and modulation format. We compare several source-channel coding schemes at multiple transmission symbol rates and compare the performance to fixed aggregate-rate channel- controlled variable rate voice coding systems.
Mohammad Reza Zad-Issa, McGill University (Canada)
Peter Kabal, McGill University (Canada)
Linear prediction (LP) coefficients are used to describe the formant structure of a speech waveform. Many factors contribute to the frame-to-frame fluctuation of these parameters. These variations adversely affect the performance of the LP quantizer and the quality of the synthesized speech. For voiced speech, efficient coding of the pitch pulses at the output of the inverse formant filter relies on the similarity of successive pitch waveforms. The performance of this coding stage is also jeopardized by LP variations. In this paper, we propose a new method which smoothes the evolution of the LP parameters. Our algorithm is based on matching the output of the formant predictor to a target signal constructed using smoothed pitch pulses. With this approach we have successfully reduced the frame-to-frame variation of LP coefficients, while increasing the similarity of pitch pulses.
Shahrokh Ghaemmaghami, Queensland University of Technology (Australia)
Mohamed Deriche, Queensland University of Technology (Australia)
Boualem Boashash, Queensland University of Technology (Australia)
Temporal decomposition (TD) is an effective technique to compress the spectral information of speech through orthogonalization of the matrix of spectral parameters leading to an efficient rate reduction in speech coding applications. The performance of TD is basically function of properties of the parameters set used. Although ``decomposition suitability'' of a parameter set is typically defined on the basis of ``phonetic relevance'' criterion, it can not be directly used in speech coding. Instead, quality evaluation of reconstructed speech is more appropriate. In this paper, we extend our earlier work in this area and attempt to assess several ``popular'' spectral parameter sets from the viewpoint of decomposition suitability in very low-rate speech coding using parametric, perceptually-based spectral, and energy distance measures.
Sara Grassi, University of Neuchâtel (Switzerland)
Alain Dufaux, University of Neuchâtel (Switzerland)
Michael Ansorge, University of Neuchâtel (Switzerland)
Fausto Pellandini, University of Neuchâtel (Switzerland)
Line Spectrum Pair (LSP) representation of Linear Predictive Coding (LPC) parameters is widely used in speech coding applications. An efficient method for LPC to LSP conversion is Kabal's method. In this method the LSPs are the roots of two polynomials $P'_{p}(x)$ and $Q'_{p}(x)$, and are found by a zero crossing search followed by successive bisections and interpolation. The precision of the obtained LSPs is higher than required by most applications, but the number of bisections cannot be decreased without compromising the zero crossing search. In this paper, it is shown that, in the case of $10^{th}$-order LPC, five intervals containing each only one zero crossing of $P'_{10}(x)$ and one zero crossing of $Q'_{10}(x)$ can be calculated, avoiding the zero crossing search. This allows a trade-off between LSP precision and computational complexity resulting in considerable computational saving.
John Leis, USQ (Australia)
Mark Phythian, USQ (Australia)
Sridha Sridharan, Queensland University of Technology (Australia)
Although much effort has been directed recently towards speech compression at rates below 4 kb/s, the primary metric for comparison has, understandably, been the amount of spectral distortion in the decompressed speech. However, an aspect which is becoming important in some applications is the ability to identify the original speaker from the coded speech algorithmically. We investigate here the effect of speech compression using multistage vector quantization of the short-term (formant) filter parameters on text-independent speaker identification. It is demonstrated that in cases where the speech is stored in a compressed database for retrieval, the speaker model should be constructed from the raw speech before spectral compression. Additionally, Gaussian models of sufficiently high order are able to reduce the negative effects of spectral vector quantization upon speaker identification accuracy.
Juha Backman, Nokia Mobile Phones (Finland)
The paper presents a method for measuring the transmission of speech transmitted through a channel with linear or nonlinear distortion and arbitrary noise. The method is a generalization of the well-established method of measuring speech intelligibility using modulation transmission function, but instead of measuring only the amount of the modulation in the received signal and comparing it against the amount of modulation in the transmitted signals in given carrier and modulation frequency band, the proposed method cross-correlates the envelopes of the transmitted and received signal.
Vincent Van de Laar, Delft University of Technology (The Netherlands)
Bastiaan Kleijn, Delft University of Technology (The Netherlands)
Ed F. Deprettere, Delft University of Technology (The Netherlands)
We estimated the perceptual entropy rate of the phonemes of American English and found that the upper limit of the perceptual entropy of voiced phonemes is approximately 1.4 bit/sample, whereas the perceptual entropy of unvoiced phonemes is approximately 0.9 bit/sample. Results indicate that a simple voiced/unvoiced classification is suboptimal when trying to minimize bit rate. We used two different methods for the entropy estimation, and the results of both methods show that short segments of unvoiced speech are approximately Gaussian.