Charalampos Papanastasiou, University of Manchester (U.K.)
Costas Xydeas, University of Manchester (U.K.)
This paper presents a new and efficient method for modelling voiced, mixed excitation spectra in Sinusoidal (SC) and Prototype Interpolation Coding (PIC) systems. Speech harmonics are classified as weak-voiced or strong-voiced by simply examining the short-term residual magnitude spectrum. This information is encoded effectively in terms of fixed width frequency bands and is used to control sets of periodic and random sine wave oscillators which model the short-term mixed excitation nature of speech. In this way the model allows the mixing of periodic and random signal energy on a harmonic basis. The proposed methodology has been used in a 2.4Kbits/sec speech coder, whose recovered speech quality is better than that of the 4.8Kbits/sec DoD standard.
Ian Atkinson, University of Surrey Centre for Satellite Eng. Research (U.K.)
Suat Yeldener, University of Surrey Centre for Satellite Eng. Research (U.K.)
Ahmet Kondoz, University of Surrey Centre for Satellite Eng. Research (U.K.)
LPC based speech coders operating at bit rates below 3.0 kbits/sec are usually associated with buzzy or metallic artefacts in the synthetic speech. These are mainly attributable to the simplifying assumptions made about the excitation source, which are usually required to maintain such low bit rates. In this paper a new LPC vocoder is presented which splits the LPC excitation into two frequency bands using a variable cut-off frequency. The lower band is responsible for representing the voiced parts of speech, whilst the upper band represents unvoiced speech. In doing so the coders performance during both mixed voicing speech and speech containing acoustic noise is greatly improved, producing soft natural sounding speech. The paper also describes new parameter determination and quantisation techniques vital to the operation of this coder at such low bit rates.
Hui Li, University of Leeds (U.K.)
Gordon B. Lockhart, University of Leeds (U.K.)
Two non-linear interpolation techniques are introduced for enhancing speech reproduction in Prototype Waveform Interpolation (PWI) and similar encoders. A Temporal Differential Rate (TDR) vector is used to characterise the non- uniform evolution of pitch cycle temporal structure during interpolation. Experimental results show a clear improvement in the accuracy of decoded pitch cycle lengths and in the reproduction of periodicity in general. It is also shown that waveform reproduction can be significantly improved by vector quantising sets of Optimal Combination Coefficients (OCC) aimed at maximising the similarity between interpolated and target signal segments. Both time domain waveform similarity and frequency domain spectral envelope similarity derived OCC are tested. Subjective assessment suggests a general preference for non-linear interpolation methods and the scheme using frequency domain derived OCC with perceptual weighting provided the best subjective preference.
Ian S. Burnett, University of Wollongong (Australia)
Duong H. Pham, University of Wollongong (Australia)
A new mechanism for using Analysis-by-Synthesis techniques in low rate Waveform Interpolation based coders is introduced. The algorithm, implemented as part of a Multi-Prototype Waveform coder, exploits the high quality speech produced by interpolating unquantised speech-domain Prototype Waveforms. In the new scheme, a frame of Prototype Waveforms is quantised using two sets of codebook searches, one representing the slowly evolving prototype shape and the other the rapid, noisy components. The scheme offers performance advantages over the previous open-loop Multi-Prototype Waveform coder, particularly when perceptual weighting is incorporated in the search. Reductions in search complexity and the use of the scheme for quantisation at higher rates are also considered. This results in a generalised Analysis-by-Synthesis Waveform Interpolation architecture with closed-loop optimisation of all Prototype Waveform properties.
Khashayar Yaghmaie, University of Surrey (U.K.)
Ahmet Kondoz, University of Surrey (U.K.)
Prototype waveform interpolation is one of the most efficient compression techniques for coding the speech signal at bit rates below 4 kb/s. Most of the PWI coders employ prototype waveforms of the linear predictive residual signal for coding purpose. In the latest PWI systems, decomposition methods are used to separate the voiced and unvoiced components of the prototype waveforms prior to coding. This has resulted in high quality speech at very low bit rates. This paper presents a novel combination of the Multiband voicing analysis and PWI coding system in which the Multiband analysis is exploited to identify the voiced and unvoiced spectral components of the prototype waveforms of the original speech signal. To produce a high quality synthetic speech, energy variation of the original signal is recovered by transmitting its energy envelope. This method resulted in a high quality and low complexity coder operating at 2.55 kb/s.
Parham Zolfaghari, Cambridge University (U.K.)
Tony Robinson, Cambridge University (U.K.)
This paper describes a new low bit-rate formant vocoder. The formant parameters are represented by Gaussian mixture distributions, which are estimated from the discrete Fourier transform (DFT) magnitude spectrum of the speech signal. A voiced/unvoiced classification mechanism has been developed based on the harmonic nature of each formant in the DFT spectrum modulated by the Gaussian Mixture distribution. Using a magnitude-only sinusoidal synthesiser, intelligible synthetic speech has been obtained. Vector quantisation of the vocal tract parameters enables this formant vocoder to operate at bit-rates down to 1248 bps.
Engin Erzin, Lucent Technologies (U.S.A.)
Arun Kumar, UC, Santa Barbara (U.S.A.)
Allen Gersho, UC, Santa Barbara (U.S.A.)
We propose new techniques for natural quality variable rate spectral speech coding at an average rate of 2.2 kbps for dialog speech and 2.8 kbps for monolog speech. The coder models the Fourier spectrum of each frame and it builds on recent enhancements to the classical multiband excitation (MBE) approach. New techniques for robust pitch estimation and tracking, for efficient quantization of voiced and unvoiced spectra and encoding of partial phase information are the key features that result in improved quality over earlier spectral vocoders. Subjective performance results are reported which show that the coder is very close in quality to the ITU-T G.723.1 algorithm at 5.3 kbps.
Yuusuke Hiwasaki, NTT Human Interface Labs (Japan)
Kazunori Mano, NTT Human Interface Labs (Japan)
Speech coding at very low bitrate is useful for purposes such as voice communication over computer networks. However, speech coding at around 2.0 kbit/s is difficult for CELP coders while maintaining a high quality. In this paper, a speech coding model called `normalized pitch waveform' and its quantization scheme are presented, aiming for effective compression coding of the `voiced' speech. Listening tests has proven that an efficient and high quality coding has been achieved at bitrate 2.0 kbit/s, less than half of the FS1016. Furthermore, this paper discusses the disadvantage of the normalized pitch waveform and presents an alternative method of using non-normalized pitch waveform. Encoding of a transitional `mixed' state between the `voiced' and the `unvoiced' state is discussed for further improvements.
Mary A. Kohler, U.S. D.o.D. (U.S.A.)
In 1996, the U.S. Department of Defense Digital Voice Processing Consortium (DDVPC) selected Texas Instrument's mixed excitation linear prediction (MELP) algorithm as the recommended new federal standard for 2400 bps voice communications. The algorithm selection process involved quality, intelligibility, communicability, and recognizability testing in many acoustic noise, error, and tandem conditions. Algorithm complexity was also measured. This paper compares the performance scores, diagnostic information, and complexity of MELP to the 4800 bps federal standard (FS1016) code excited linear prediction (CELP) algorithm, the 16 kbps continuously variable slope delta modulation (CVSD) algorithm, and the venerable federal standard (FIPS Pub. 137) 2400 bps linear predictive coding (LPC-10) algorithm.
Lynn M. Supplee, Department of Defense (U.S.A.)
Ronald P. Cohn, Department of Defense (U.S.A.)
John S. Collura, Department of Defense (U.S.A.)
Alan V. McCree, Texas Instruments (U.S.A.)
This paper describes the new U.S. Federal Standard at 2400 bps. The Mixed Excitation Linear Prediction (MELP) coder was chosen by the DoD Digital Voice Processing Consortium to replace the existing 2400 bps Federal Standard FS 1015 (LPC-10). This new standard provides equal or improved performance over the 4800 bps Federal Standard FS 1016 (CELP) at a rate equivalent to LPC-10. The MELP coder is based on the traditional LPC model, but includes additional features to improve its performance.
Jes Thyssen, AT&T Labs (U.S.A.)
Bastiaan Kleijn, AT&T Labs (U.S.A.)
Roar Hagen, AT&T Labs (U.S.A.)
In speech coding it is important to focus the coding effort on the perceptually important features of the speech signal. This paper describes new quantization techniques which take advantage of current knowledge of human perception in speech coders. The new procedures exploit the frequency-dependent frequency resolution of the human auditory system. The methods are applied to the waveform interpolation (WI) coder, and their effectiveness is confirmed with experimental results. The principles described in the paper are not restricted to the WI coder, but are also applicable to many other speech coding algorithms.
Yair Shoham, Bell Laboratories, Lucent Technologies (U.S.A.)
The recently-introduced waveform interpolation (WI) coders provide good-quality speech at low rates but may be too complex for commercial use. This paper proposes new approaches to low-complexity WI speech coding at rates of 1.2 and 2.4 kbps. The proposed coders are 4 to 5 times faster than the previously reported ones . At 2.4 kbps, the complexity is about 7.5 and 2.5 MFLOPS for the encoder and decoder, respectively. At 1.2 kbps, the complexity is about 6 and 2.3 MFLOPS for the encoder and decoder, respectively. Informal subjective evaluation shows that, at 2.4 kbps, the quality is close to that of the high-complexity coders. The quality does not significantly degrade at 1.2 kbps and it is considered sufficient for messaging applications.
Michele Jamrozik, Clemson University (U.S.A.)
John Gowdy, Clemson University (U.S.A.)
This paper presents the Modified Multiband Excitation Model used for speech coding. In many MBE model coders, speech quality is degraded when incorrect voicing decisions are made, particularly for high-pitched female speakers. The MMBE addresses this issue with a modified voiced/unvoiced decision algorithm and a more robust pitch estimate. The listening quality of speech produced using the MMBE model is superior to the FS-1016 CELP coder and is at least comparable with the new 2400 bps MELP coder chosen as the new 2400 bps Federal Standard.
Eric W.M. Yu, City University of Hong Kong (Hong Kong)
Cheung-Fat Chan, City University of Hong Kong (Hong Kong)
A variable bit rate multiband excited linear predictive speech coder is proposed in this paper. Speech signal is compressed in different bit rates ranging from 0.88 kbps to 2.6 kbps according to the mode of operation and the optimum V/UV transition frequency. An average bit rate of 1.24 kbps is achieved. The proposed speech coder improves the speech quality by splitting the non-stationary speech segments for analysis. The V/UV distribution of a short-time speech spectrum is represented efficiently by using a closed-loop minimised V/UV transition frequency. Depending on the V/UV transition frequency, the spectrum envelope is quantized in variable bit rate through embedded differential predictive scalar and vector quantizations of the LSP parameters. The proposed spectral quantization scheme results in a spectral distortion comparable to a fixed 24-bit 2-dimensional differential scalar quantization scheme.