Session TAD Speech Coding II

Chairperson John Mourjopoulos Univ. of Patras, Greece

Home


HIGH QUALITY SPLIT BAND LPC VOCODER AND ITS FIXED POINT REAL TIME IMPLEMENTATION

Authors: S Villette, M Stefanovic, I.Atkinson, AM Kondoz

Centre for Communication Systems Research University of Surrey, Guildford, Surrey, UK. a.kondoz@ee.surrey.ac.uk

Volume 3 pages 1243 - 1246

ABSTRACT

A split band vocoder in which the LP excitation is split into voiced and unvoiced frequencies is presented. In doing this the coder's performance during both mixed voicing and speech containing acoustic noise is greatly improved, producing soft natural sounding speech. In addition a variable rate version which achieves an average rate of 1.4 kb/s is detailed. The issue of fixed point real time implementation of this coder is also presented.

A0002.pdf

Recordings

TOP


Missing Packet Recovery Techniques for DM Coded Speech

Authors: Wen-Whei Chang (1), Hwai-Tsu Chang (2) , Wan-Yu Meng (2)

(1)Department of Communication Engineering National Chiao Tung University, Hsinchu, Taiwan, ROC (2) Advance Technology Center, Computer and Communication Research Laboratories, Industrial Technology Research Institute, Hsinchu, Taiwan, ROC

Volume 3 pages 1247 - 1250

ABSTRACT

The packet loss effects of DM coded speech can be mitigated by either using an embedded DM system (EDM) or using a tree search interpolator. This paper provides theoretical and experimental results for EDM coding of autoregressive sources under random error conditions. For the tree interpolation, we explore the benefits of delayed decoding by using an interpolative DM code generator to form a tree of sample possibilities given the remaining adjacent samples.


A0060.pdf 

TOP


SPECTRAL SENSITIVITY OF LSP PARAMETERS AND THEIR TRANSFORMED COEFFICIENTS

Authors: Hai Le Vu and Laszlo Lois

Department of Telecommunications Technical University of Budapest Sztoczek 2,1111 Budapest, Hungary Tel. +36 1 463 2093, FAX: +36 1 463 3266, E-mail: hai@hit.bme.hu

Volume 3 pages 1251 - 1254

ABSTRACT

In this paper, the optimal transformation and quantization of Line Spectrum Pair (LSP) are accomplished. Based upon the interframe and intraframe correlation properties of the LSPs, the Karhunen-Loeve (KL) transformation is adopted by Principal Component Analysis (PCA) neural network. The spectral sensitivity of the LSP and transformed coefficients are investigated in order to develop better scalar and vector quantizers for these coefficients. Using PCA network with spectral sensitivity guided quantizers we show that this new approach leads to as good as or better distortion compared to other methods for speech coding.


A0065.pdf 

TOP


REDUCING THE COMPLEXITY OF THE LPC VECTOR QUANTIZER USING THE K -D TREE SEARCH ALGORITHM

Authors: V. Ramasubramanian and K. K. Paliwal

ATR Interpreting Telecommunications Res. Labs. 2-2 Hikaridai, Seika-cho, Soraku-gun, Kyoto 619-02 Japan

Volume 3 pages 1255 - 1258

ABSTRACT

Linear predictive coding (LPC) parameters are widely used in various speech coding applications for representing the spectral envelope information of speech. Transparent quantization of the LPC parameters (average spectral distortion of 1 dB) can be achieved at 24 bits/frame using the split vector LPC quantizer (SVLPC) which quantizes 10-dimensional line spectral frequency (LSF) vectors in two parts. However, SVLPC su ers from a high computational complexity in quantizing each part (one of dimension 4 and the other of dimension 6) using independent codebooks of size 4096 (corresponding to a rate of 12 bits/part). This limits the practical real-time application of the coder. In this paper, we reduce the computational complexity of the split vector quantizer by 2 orders of magnitude using the fast K-dimensional (K-d) tree search algorithm under the bucket-Voronoi intersection (BVI) search framework. This is of signiFIcant importance in rendering the SVLPC amenable for practical real-time coding applications.
A0074.pdf 

TOP


LPC QUANTIZATION USING WAVELET BASED TEMPORAL DECOMPOSITION OF THE LSF

Authors: Aweke N. Lemma (1) , W. Bastiaan Kleijn (2) , and Ed. F. Deprettere (1)

(1) Department of Electrical Engineering Delft University of Technology 2628 CD Delft, The Netherlands (2) Department of Speech, Music and Hearing KTH (Royal Institute of Technology) Box 700 14, 100 44 Stockholm, Sweden

Volume 3 pages 1259 - 1262

ABSTRACT

The quantization of linear prediction coefficients (LPC) is an important aspect in low bit rate speech coding. In this work, we introduce a new approach, which exploits the temporal dependencies in the line spectral frequencies (LSF). We approximate each LSF track using expansion into wavelet basis functions. As the LSF vary fairly smoothly as functions of time, they perform very well when interpolated. By vector quantizing the resulting wavelet expansion coefficients, the interpolated LSF tracks could be quantized with a distortion of 0.91 dB using only 15.6 bits per 20 ms update (780 bits per second). This is about 4 bits per update less than the results obtained with previously described procedures.
A0292.pdf 

TOP


A Novel 1.7/2.4 Kb/s DCT Based Prototype Interpolation Speech Coding System

Authors: Prof. C.S Xydeas and H. Gokhan Ilk

Speech Processing Research Laboratory, Electrical Engineering Division, Manchester School of Engineering, University of Manchester, Manchester M13 9PL, U.K.

Volume 3 pages 1263 - 1266

ABSTRACT

In this paper a novel DCT prototype interpolation synthesis process is presented and used to model the input speech signal. The compression efficiency of the DCT when applied to prototype pitch segments, leads to 1.7/2.4 kb/s DCT-PIC systems which can deliver decoded speech of high communication quality.
A0300.pdf 

TOP


IMPROVED REGULAR PULSE VSELP CODING OF SPEECH AT LOW BIT-RATES

Authors: Yong-Soo Choi (1) Hong-Goo Kang (2) Sang-Wook Park (1) Jae-Ha Yoo* Dae-Hee Youn (1)

(1) ASSP Lab., Dept. of Electronic Eng., Yonsei University Seoul 120-749, Korea E-mail: cando@caas.yonsei.ac.kr (2) AT&T-Labs Research, Murray Hill, NJ07974, USA *LG Electronic Inc., Seoul Korea

Volume 3 pages 1267 - 1270

ABSTRACT

This paper describes an improved RP-VSELP (IRP-VSELP) speech coding. The RP-VSELP is classifed as a fast VSELP since it produces a comparable speech quality to the VSELP with much simplified system complexity. The new RP-VSELP coder proposed in this paper has additional new features, such as a fast codebook search obtained by employing backward filtering and pitch-adaptive regular pulse excitation. Due to new features added to the original RP-VSELP, the proposed method not only reduces the complexity of the original RP-VSELP but also provides an improved speech quality. Throughout objective and subjective tests, IRP-VSELP outperformed RP-VSELP as the ref erence coder. Simulation results are presented to verify the performance of the proposed method.


A0338.pdf 

TOP


JOINT ESTIMATION OF PITCH, BAND MAGNITUDES AND V/UV DECISIONS FOR MBE VOCODER

Authors: Yong Duk Cho 1 , Hong Kook Kim 2 , Moo Young Kim 3 , and Sang Ryong Kim 4

Human & Computer Interaction Lab., Samsung Advanced Institute of Technology San 14, Nongseo-Ri, Kiheung-Eup, Yongin City, Kyungki-Do, 449-712, Korea {ydcho 1 , kimhk 2 , moo 3 , srkim 4 }@saitgw.sait.samsung.co.kr

Volume 3 pages 1271 - 1274

The multiband excitation (MBE) vocoder represents speech signal with a pitch, band magnitudes, and a voice / unvoice (V/UV) decision for each spectral band. In the conventional MBE model, model parameters are sequentially estimated in two steps. The pitch and band magnitudes are firstly estimated on the assumption of voiced speech model by the analysis-by-synthesis (AbS) in frequency domain, and then the V/UVs are decided. However, the synthetic spectrum by the above assumption may have large spectral distortion if the speech frame is strongly unvoiced such as transient region. In this paper, we propose joint estimation method which estimates and decides all the model parameters in AbS loop. For this, voiced or unvoiced speech models for each band are used during the analysis procedure. After estimating the parameters with the two speech models, a model for each band is selected so as to produce smaller spectral estimation error. By analyzing the short time spectrum and the long time spectrogram, it is shown that the reproduced speech of the proposed model is superior to that of the conventional one. In addition, through informal listening test we also confirm the superiority of the proposed model.
A0339.pdf 

TOP


A NEW DISTANCE MEASURE IN LPC CODING APPLICATION FOR REAL TIME SITUATIONS

Authors: Balazs KOVESI*. Samir SAOUDI*, Jean Marc BOUCHER* and Gabor HORVATH (1)

* ENST-Br, Dept. SC'., Technopole de Brest Iroise, BP 832, 29285 Brest Cedex, France. (1) Technical University of Budapest, Dept. MMT., Muegyetem rkp. 9.1521 Budapest, Hungary

Volume 3 pages 1275 - 1278

ABSTRACT

The distance measure has a great importance in the phase of the construction of a vector quantizer for LSP parameters as well as in the coding phase. Due to its complexity, the meaningful spectral distance is seldom used for the purpose of quantization. The weighted squared Euclidean distances are mathematicallv more tractahle and are commonly used. Significant differences can be found in the performanance of different distances. The aim of this paper is to study different distance measures used in the field of LSP Coding. A new weighted Euclidean distance will he proposed that not only replaces the spectral distance hut estimates well its exact value. However, the use of squared distances will he justified as well. In a real time application. often weights can not be calculated according to the input vector computation must be done according to the code- words, before coding. This causes some problems in case of split vector quantization or multi stage vector quantization. Some solutions will he given at the end of this paper.


A0353.pdf 

TOP


ABSTRACT

CONSIDERATION OF PROCESSING STRATEGIES FOR VERY-LOW-RATE COMPRESSION OF WIDEBAND SPEECH SIGNALS WITH KNOWN TEXT TRANSCRIPTION

Authors: Peter Vepyek, Prof. Alan B. Bradley

Department of Communication and Electronic Engineering RMIT, PO Box 2476 V, Melbourne 3001, Australia. Tel: 61 3 96602455, FAX: 61 3 96621060 E-mail: peterv@icpdd.neca.nec.com.au, alanb@rmit.edu.au

Volume 3 pages 1279 - 1282

ABSTRACT

This paper addresses the problem of very-low-rate compression of digitized wideband speech signals for storage. It concentrates on applications where the text transcription of the speech corpus is available and where high quality of recovered speech is required. Following the problem statement, all unique features of the task are analysed and possible methods of implementation discussed. As a result, a novel speech compression technique is proposed, its general structure is presented, and its characteristics are considered. The new compression technique - hybrid speech compression - takes full advantage of the available text transcription. The proposed hybrid compression approach utilises an optimum balance of Text To Speech (TTS) synthesis technology with dynamic speech conversion to yield a data stream comprising original text enriched by prosodic features and conversion control information. The proposed speech compression method aims to achieve an extremely low data rate while preserving a high quality of the compressed wideband speech.


A0436.pdf 

TOP


ZERO-REDUNDANCY ERROR PROTECTION FOR CELP SPEECH CODECS

Authors: Norbert Gortz

Institute for Network and System Theory University of Kiel, Germany Tel.: +49 431 77572 406, Fax: +49 431 77572 403 E-Mail: ng@techfak.uni-kiel.de

Volume 3 pages 1283 - 1286

ABSTRACT

In this paper the possibilities of channel-error protection for transmission of CELP-coded speech over highly disturbed channels without additional bits for error-control are discussed. Algorithms are given which do not require explicit channel models and work without additional delay and almost no additional complexity. Time-based and mutual dependencies of the speech codec parameters are exploited for channel-error detection and parameter extrapolation at the decoder. The algorithms are optimized by informal listening tests rather than by maximization of a mathematically tractable measure.
A0440.pdf 

TOP


LOW BIT RATE SPEECH CODING USING AN IMPROVED HSX MODEL

Authors: Ridha Matmti, Milan Jelinek and Jean-Pierre Adoul

Department of Electrical Engineering, University of Sherbrooke Sherbrooke, Quebec, Canada, J1K 2R1 E-mail:ridha@gel.usherb.ca

Volume 3 pages 1287 - 1290

ABSTRACT

This paper presents some improvements to the mixed Harmonic and Stochastic eXcitation (HSX) algorithm in the context of low bit rate speech coding (around 2.4 kbit/s). The dominant issue is the modeling of the excitation signal in order to improve the quality of the synthesized speech signal without increasing neither the bit rate nor the complexity. The pitch tracking algorithm is revised in order to increase the robustness and to reduce the complexity. The voicing analysis algorithm is also refined. Informal listening of the synthesized speech at 2.4 kbitls shows a significant improvement.


A0550.pdf 

TOP


PHONETIC VOCODING WITH SPEAKER ADAPTATION

Authors: Carlos M. Ribeiro and Isabel M. Trancoso

INESC/ISEL-CEDET cmr@inesc.pt INESC/IST Isabel.Trancoso@inesc.pt INESC, Rua Alves Redol, 9, 1000 Lisbon, Portugal Phone: +351 1 3100314; FAX: +351 1 3145843

Volume 3 pages 1291 - 1294

ABSTRACT

This paper describes a phonetic vocoding scheme which relies on speaker adaptation to capture important speaker characteristics. These are typically lost in phonetic vocoders which transmit only information about the phones which are recognized, together with some prosodic information. In our scheme, however, additional speaker characteristics are transmitted in vowel regions (average values of LSP coefficients for each phone). This additional information yielded potentially good speaker recognizability results, in informal listening tests, while still achieving a rather low average bit rate, suitable for many transmission and storage applications. This work extends our previous phonetic vocoding scheme described in [5]. The vocoder is now fully quantized and the number of transmitted parameters had been significantly reduced.
A0656.pdf 

TOP


QUANTIZATION OF SPECTRAL SEQUENCES USING VARIABLE LENGTH SPECTRAL SEGMENTS FOR SPEECH CODING AT VERY LOW BIT RATE

Authors: Genevieve Baudoin (1), Jan Cernocky (1,2), Gerard Chollet (3)

(1) ESIEE, Dpt Signaux-telecommunications, BP 99, Noisy Le Grand, 93162 CEDEX, France, baudoing@esiee.fr (2) FEIVUT Brno, Purkynova 1 18, 61200 Brno, Czech Republic, cernocky@urel.fee.vutbr.cz (3) ENST, Dpt Signal, 46 rue Barrault, 75013 Paris, France, chollet@sig.enst.fr

Volume 3 pages 1295 - 1298

ABSTRACT

This paper deals with the coding of spectral envelope parameters for very low bit rate speech coding (inferior to 500 bps). In order to obtain a sufficient intelligibility, segmental techniques are necessary. Variable dimension vector quantization is one of these. We propose a new interpretation of already published research from Chou-Lockabaugh [2] and Cernocky- Baudoin-Chollet [4,6] on the quantization of variable length sequences of spectral vectors, named respectively Variable to Variable length Vector Quantization (VVVQ) and Multigrams Quantization (MGQ). This interpretation gives a meaning to the Lagrange multiplier used in the optimization criterion of the VVVQ, and should allow new developments as, for example, new modelization of the probability density of the source. We have also studied the influence of the limitation of the delay introduced by the method. It was found that a maximal delay of 400 ms is generally sufficient. Finally, we propose the introduction of long sequences in the segmental codebook by linear interpolation of shorter ones.


A0761.pdf 

TOP


ON MODELING EVENT FUNCTIONS IN TEMPORAL DECOMPOSITION BASED SPEECH CODING

Authors: S. Ghaemmaghami M. Deriche B. Boashash

Signal Processing and Avionic Research Centre Queensland University of Technology 2 George st, Brisbane, Q 4001, Australia shahrokh@markov.eese.qut.edu.au m.deriche@qut.edu.au b.boashash@qut.edu.au

Volume 3 pages 1299 - 1302

ABSTRACT

Temporal Decomposition (TD) is an efficient technique for modeling speech spectral evolution through orthogonalization of the matrix of spectral parameters which reduces the amount of spectral information in TD-based speech coding. We have shown in earlier work that ``event'' functions can be approximated by fixed-width Gaussian functions with a minor degradation in the reconstructed speech, leading to further bit-rate reduction in such systems. In this paper, through perceptually-based spectral distortion measurement, we show the impact of events shape on the speech quality, and propose a new composite function and discuss its effect on the coder performance using different combinations of spectral parameters in event detection and speech synthesis.
A0764.pdf 

TOP


PHASE QUANTIZATION BY PITCH-CYCLE WAVEFORM CODING IN LOW BIT RATE SINUSOIDAL CODERS

Authors: Soledad Torres F. Javier Casajús-Quirós

e-mail: martor@tel.uva.es ETSI Telecomunicación Universidad de Valladolid SPAIN e-mail: javier@gaps.ssr.upm.es ETSI Telecomunicación Universidad Politécnica de Madrid SPAIN

Volume 3 pages 1303 - 1306

ABSTRACT

A new phase coding algorithm is introduced in this paper, which works in the pitch-cycle waveform domain. It provides accurate phase coding at low bit cost. Its performance is analyzed inside a multiband excitation coder with improved onset representation. In this context, the introduction of original phase information by means of the proposed coding algorithm provides noticeable quality improvement without increasing the total bit rate of the coder.
A0772.pdf 

TOP


A PERCEPTUAL STUDY OF THE GREEK VOWEL SPACE USING SYNTHETIC STIMULI

Authors: A. Botinis *, M. Fourakis * *, and J. W. Hawks * * *

*Linguistics Department, Athens University ** Department of Speech and Hearing Science, The Ohio State University *** School of Speech Pathology and Audiology, Kent State University E-mail: fourakis.l@osu.edu

Volume 3 pages 1307 - 1310

ABSTRACT

Four female native speakers of Modern Greek listened to 465 synthetic vowel tokens with Fl frequencies ranging from 250 to 800 Hz and F2 frequencies ranging from 900 to 2900 Hz in 50 Hz steps. They were asked to identify each stimulus as one of the five vowels of Modern Greek or to reject it if they thought it could not be a vowel of their language. The subjects rejected about 64 percent of the tokens as not possible vowels. The remaining points were plotted in an F I by F2 space with the codes assigned by each subject and in a composite space, where only the points identified with the same response by at least three subjects were used. The results replicated those of Hawks and Fourakis [1], except that the code for the vowel [e] was assigned to many more points than the codes for the other wowels.


A0851.pdf 

TOP


MIXED MULTI-BAND EXCITATION CODER USING FREQUENCY DOMAIN MIXTURE FUNCTION (FDMF) FOR A LOW BIT-RATE SPEECH CODING

Authors: Woo-Jin Han, Sung-Joo Kim, Yung-Hwan Oh

Computer Science Dept. Korea Advanced Institute of Science and Technology Ku-song dong, Yu-song ku, Taejun, 305-701 Korea. E-mail: hwjketel@bulsai.kaist.ac.kr

Volume 3 pages 1311 - 1314

ABSTRACT

This paper describes the Mixed Multi-Band Excitation coder used for a low bit-rate speech coding. In MBE coders, there are significant differences of the fine structure between the original and the synthetic spectrum. They are mainly due to the exclusive partition of voiced and unvoiced regions in frequency domain and the decision procedure based on the experimental threshold. The MMBE uses frequency domain mixture function (FDMF) to overcome these drawbacks of the MBE coder. Also, two analysis methods, which do not need any decision procedure based on a threshold, are presented. The performance evaluation results show that the 2.6kbps MMBE coder reduces the average spectral distortion by a clear margin comparing to the 2.9kbps MBE coder. The computational load of the proposed coder is sufficiently small for a real-time implementation on the modern DSP chip.
A0860.pdf 

TOP


Robust GSM Speech Decoding Using the Channel Decoder's Soft Output

Authors: Tim Fingscheidt, Olaf Scheufen

Institute of Communication Systems and Data Processing Aachen University of Technology, Templergraben 55, D -- 52056 Aachen, Germany Tel.: ++49 241 806963, Fax: ++49 241 8888186, E-Mail: Tim.Fingscheidt@ind.rwth-aachen.de http://www.ind.rwth-aachen.de

Volume 3 pages 1315 - 1318

ABSTRACT

In the digital mobile radio system GSM (Global System for Mobile Communications) there is a need for reducing the subjective effects of residual bit errors by error concealment techniques. Due to the fact that the standard does not specify these algorithms bit exactly, there is room for new solutions to improve the decoding process. This contribution presents a new approach for optimum estimation of speech codec parameters [7] applied to the GSM system. It requires a soft-output channel decoder (e.g. soft-output Viterbi algorithm -- SOVA [8]) providing a bit reliability information for the proposed parameter estimation process. Additionally, a priori knowledge about the residual redundancy in the sequence of codec parameters is exploited. The new method includes an inherent muting mechanism leading to a graceful degradation of speech quality in case of adverse transmission conditions. If the channel is error free, bit exactness as required by the GSM standard is preserved.
A1185.pdf 

Recordings

TOP


A LOW-BIT-RATE SPEECH CODER USING ADAPTIVE LINE SPECTRAL FREQUENCY PREDICTION

Authors: C. W. Seymour and A. J. Robinson

Cambridge University Engineering Department Cambridge, CB2 1PZ, UK fcws, ajrg@eng.cam.ac.uk

Volume 3 pages 1319 - 1322

ABSTRACT

This paper describes two aspects of a linear predictive coding (LPC) vocoder developed for operation on wide- band speech. The method for encoding the LPC parameters, based on the use of an adaptive predictor, is pre- sented together with an extension to the vocoder model which enables it to operate on speech sampled at 16kHz rather than 8kHz. Good-quality operation on wide-band speech is achieved with an increase in bit rate of about 500 bits/s. Diagnostic rhyme test (DRT) results demonstrate the improvement in intelligibility gained through coding speech at the higher sample rate.
A1333.pdf 

TOP