ABSTRACT
Most state-of-the-art TTS synthesizers are based on a technique known as synthesis by concatenation, in which speech is produced by concatenating elementary speech units. The design of a high-quality TTS system implies the storage of a large number of segments. To facilitate the storage of these segments, this paper proposes a very low complexity coder to compress unit databases with a toll quality. A particular interest has been taken in the databases used by the MBROLA synthesizer, composed of fixed-length pitch periods with constrained harmonic phases. The coder developed here uses this special characteristic to reach compression rates from 7 to 9 without degrading the speech quality produced by the synthesizer, and with very limited computational cost.
ABSTRACT
This paper describes a low bit-rate segmental formant vocoder. The formants are estimated using mixture of Gaussians whose means are constrained to vary linearly with time within a segment. A new method of smoothing the power spectrum has been used in order to improve modelling with mixtures of Gaussians. Pitch is estimated using the autocorrelation function, and voicing is detected using the autocorrelation function method and the energy in the spectrum. Optimal segment boundaries are obtained using a dynamic programming procedure based on the power normalised log-likelihood of the segment. Magnitude-only sinusoidal synthesis is then used to synthesise speech from the estimated spectrum. Using multiple codebooks an average bit-rate of 500 bps has been obtained.
ABSTRACT
Voice mimic systems using articulatory codebooks require an initial estimate of the vocal tract shape in the vicinity of the global optimum. For this purpose, we need to gather a large set of corresponding articulatory and acoustic data in the articulatory codebook. Thus, searching and accessing the codebook becomes a dificult task. In this paper, the design of an articulatory codebook is presented where an acoustic network sub-samples the acoustic space such that vocal tract model shapes are ordered and clustered in the network according to acoustic parameters. Another issue addressed in this paper concerns estimating the trajectory of vocal tract shapes as they change with time. Since the inverse mapping from acoustic parameters to model shape does not have a unique solution, several vocal tract shape variations are possible. Therefore, a dynamic optimization of trajectories has been developed. This optimization uses dynamic properties of each articulatory parameter to estimate the next position.
ABSTRACT
In voice coding applications where there is no constraint on the encoding delay, such as store and forward message systems or voice storage, segment coding techniques can be used to achieve a reduction in data rate without compromising the level of distortion. For low data rate linear predictive coding schemes, increasing the encoding delay allows one to exploit any long term temporal stationarities on an inter&ame basis, thus reducing the transmission bandwidth or storage needs of the speech signal. Transform coding has previously been applied to exploit both the inter and intra frame correlation, but has been limited to short term spectral redundancy [ 1 ] [2]. This paper investigates the potential for data rate reduction through extending the use of segment coding techniques to identify redundancies in the LPC residual. Initial tests indicate a potential 40% average reduction in data rate for a given subjective speech quality.
ABSTRACT
A performance evaluation method for objective measures estimating subjective quality of coded speech is proposed and applied the comparison of existing objective quality measures. The measure based on Bark spectrum distortion performs the best. Comparing its estimation error with the statistical reliability of subjective quality assessment shows that objective quality measurement can be as reliable as subjective measurements for some testing conditions.
ABSTRACT
This paper describes a system for speech coding designed to operate at 300 bits/sec and below. A continuous speech recogniser is used to transcribe incoming speech as a sequence of sub-word units termed acoustic segments. Prosodic information is combined with segment identity to form a serial data stream suitable for transmission. A rule-based system maps segment identity and prosodic information to parameters suitable for driving a parallel formant speech synthesiser. Acoustic segment Hidden Markov Models (HMMs) are shown to perform as well as conventional phone HMMs during recognition. A segment error rate of 3.8 % was achieved in a speaker-dependent, task-dependent configuration. An average data rate of 262 bits/sec was obtained. Speech from the synthesiser was better than obtainable from a purely textual representation though not as good as 2400 bit/sec Linear Predictive Coding (LPC) vocoded speech.