Chair: R. Salami, University of Sherbrooke, Canada
Sean A Ramprashad, Bell Labs (U.S.A.)
A two stage hybrid embedded speech/audio coding structure is proposed. The structure uses a speech coder as a core to provide the minimal bitrate and an acceptable performance on speech inputs. The second stage is transform coder using a MDCT and perceptual coding principles. This stage is itself embedded both in complexity and bitrate, and provides various levels of enhancement of the core output, particularly for general audio signals like music. Informal A-B comparison tests show that the performance of the structure at 16 kb/s is between that of the GSM Enhanced Full Rate coder at 12.2 kb/s, and the G.728 LD-CELP coder at 16 kb/s.
Toshiyuki Nomura, NEC Corp - C&C Media Research Labs (Japan)
Masahiro Iwadare, NEC Corp - C&C Media Research Labs (Japan)
Masahiro Serizawa, NEC Corp - C&C Media Research Labs (Japan)
Kazunori Ozawa, NEC Corp - C&C Media Research Labs (Japan)
This paper proposes a bitrate and bandwidth scalable CELP speech coder. The proposed coder is based on multi-pulse-based CELP coding and consists of a bitrate scalable base-band coder and a bandwidth extension tool. The bitrate scalable base-band CELP coder employs multi-stage excitation coding based on an embedded-coding approach.The multi-pulse excitation codebook at each stage is adaptively produced depending on the selected excitation signal at the previous stage. The bandwidth scalability is realized by bandwidth-conversion from base-band CELP parameters to those for wideband without a widely used subband structure. The bandwidth-conversion improves base-band coding quality and expands bandwidth, simultaneously. The comparison test results show that the bitrate scalable coder is equivalent in speech quality to the fixed-bitrate CELP coder at the same bitrate for the narrowband speech. In the MOS tests, the proposed 16 kbit/s coder with the bandwidth scalability achieves equivalent coding quality to ITU-T G.722 at 56 kbit/s.
Marcos Faundez-Zanuy, Escola Universitaria Politecnica de Mataro (Spain)
Francesc Vallverdu, Signal Theory & Communications Dept. (Spain)
Enric Monte, Signal Theory & Communications Dept. (Spain)
In the last years there has been a growing interest for nonlinear speech models. Several works have been published revealing the better performance of nonlinear techniques, but little attention has been dedicated to the implementation of the nonlinear model into real applications. This work is focussed on the study of the behaviour of a nonlinear predictive model based on neural nets, in a speech waveform coder. Our novel scheme obtains an improvement in SEGSNR between 1 and 2 dB for an adaptive quantization ranging from 2 to 5 bits.
Michele M Covell, Interval Research Corporation (U.S.A.)
Margaret M Withgott, Electric Planet (U.S.A.)
Malcolm G. Slaney, Interval Research Corporation (U.S.A.)
We propose a new approach to nonuniform time compression, called Mach1, designed to mimic the natural timing of fast speech. At identical overall compression rates, listener comprehension for Mach1-compressed speech increased between 5 and 31 percentage points over that for linearly compressed speech, and response times dropped by 15%. For rates between 2.5 and 4.2 times real time, there was no significant comprehension loss with increasing Mach1 compression rates. In A-B preference tests, Mach1-compressed speech was chosen 95% of the time. This paper describes the Mach1 technique and our listener-test results. Audio examples can be found on http://www.interval.com/papers/1997-061/.
Philippe Lemmerling, Katholieke Universiteit Leuven (Belgium)
Ioannis Dologlou, Katholieke Universiteit Leuven (Belgium)
Sabine Van Huffel, Katholieke Universiteit Leuven (Belgium)
We present a new speech coding algorithm, based on an all-pole model of the vocal tract. Whereas current Auto Regressive (AR) based modeling techniques (e.g. CELP, LPC-10) minimize a prediction error, approach determines the closest (in L2 norm) signal, which exactly satisfies an all-pole model. Each frame is then encoded by storing the parameters of the complex damped exponentials deduced from the all-pole model and its initial conditions. Decoding is performed by adding the complex damped exponentials based on the transmitted parameters. The new algorithm is demonstrated on a speech signal. The quality is compared with that of a standard coding algorithm at comparable compression ratios, by using the segmental Signal-to-Noise Ratio (SNR).
David F Marston, Ensigma Ltd (U.K.)
Speech coders that are optimized to the characteristics of a particular set of speakers will outperform a speech coder which caters for all speakers; providing that thespeaker using it is one of that particular set. This paper describes how speech coders that are optimized to either male or female speech can be an improvement over unoptimised coders. These improvements are bit-rate reduction, speech quality and robustness. A reliable gender identifier is described, which would be practical for the most demanding applications, achieving 95% accuracy after 1 second of speech. The improvements in terms of gender specific speech coding are shown in LSF quantisation with bit-saving, and pitch detection with both bit-saving and robustness.
Mikael Skoglund, Royal Institute of Technology (Sweden)
Jan Skoglund, Chalmers University of Technology (Sweden)
This paper presents an approach to vector quantization of sources exhibiting intervector dependency. We present the optimal decoder based on a collection of received indices. We also present the optimal encoder for such decoding. The optimal decoder can be implemented as a table look-up decoder, however the size of the decoder codebook grows very fast with the size of the collection of utilized indices. This leads us to introduce a method for storing an approximation to the set of optimal decoder vectors, based on linear mapping of a block code vector quantization. In this approach a heavily reduced set of parameters is employed to represent the codebook. Furthermore, we illustrate that the proposed scheme has an interpretation as nonlinear predictive quantization. Numerical results indicate high gain over memoryless coding and memory quantization based on linear predictive coding. The results also show that the sub-optimal approach performs close to the optimal.
Jongseo Sohn, Seoul National University (Korea)
Wonyong Sung, Seoul National University (Korea)
In this paper, a voice activity detector (VAD) for variable rate speech coding is decomposed into two parts, a decision rule and a background noise statistic estimator, which are analysed separately by applying a statistical model. A robust decision rule is derived from the generalized likelihood ratio test by assuming that the noise statistics are known a priori. To estimate the time-varying noise statistics, allowing for the occasional presence of the speech signal, a novel noise spectrum adaptation algorithm using the soft decision information of the proposed decision rule is developed. The algorithm is robust, especially for the time-varying noise such as babble noise.
Manohar N Murthi, University of California, San Diego (U.S.A.)
Bhaskar D. Rao, University of California, San Diego (U.S.A.)
In this paper, we propose some new modeling techniques that provide a more synergistic approach to multistage time-domain speech compression. In particular, we propose a new error criterion for determining all-pole filters, and a unique method for jointly coding the pulse information in excitation vectors. The new error criterion for determining all-pole filters is based upon minimizing the sum of the residual signal's absolute values raised to a power less than one. It is shown to be a desirable cost function for yielding residual signals that are more sparse, and consequently better suited for multistage compression than Linear Prediction residuals. Statistical reasons supporting the new criterion are also provided. Furthermore, exploiting the properties of, and the relationship between, the Linear Prediction and Minimum Variance spectra, we propose a novel parameter set for jointly coding the excitation vector's pulse position, sign, and gain information.
Tim Fingscheidt, Aachen University of Technology (Germany)
Peter Vary, Aachen University of Technology (Germany)
Jesus A. Andonegui, Aachen University of Technology (Germany)
Digital speech transmission systems use source coding to reduce the bit rate and channel coding to correct transmission errors. Furthermore, in periods of a very poor channel quality error concealment of residual bit errors becomes necessary as channel decoding fails. However, if the channel is clear, channel coding would not be required at all and the speech quality could be improved by allowing a higher bit rate for source encoding. Usually a compromise is taken between speech quality in case of clear channel and error robustness in case of poor channel quality. This paper addressesthe problem of a joint optimization of error concealment and source/channel coding. Under the premise of a minimum mean square error criterion for signal reconstruction it turns out that error concealment instead of error correction may be the best choice if source coding leaves sufficient residual parameter correlations by less bit rate reduction.