Prosodic Prominence Detection in Speech

Prosodic Prominence Detection in Speech

F. Tamburini

published in International Symposium on Signal Processing and its Applications, 2003

summarized by Davidson, 2011/08/29

1. Introduction

Prominence: a word or part of a word made prominent is perceived as standing out from its environment
Prosodic prominence involves two phonetic features: pitch accent and stress
Acoustic correlates of prominence (according to past studies):
- pitch movement
- overall syllable energy
- syllable duration
- spectral emphasis

2. The acoustic parameters

All acoustic parameters must be normalized to avoid variations among different speakers

Duration

Syllable duration vs. Syllable nucleus duration:


Syllable duration	Syllable nucleus duration
Figure 1: Gaussian approximation of duration measures of non-prominent (solid lines) and prominent syllables (dash-dotted lines).

Almost the same recognition rates are obtained --> we can substitute the syllable duration to a rather reliable syllable nucleus duration
Normalized by ROS (rate of speech, or simply speech rate) (# of phonemes per unit time)

Energy
- Syllable nucleus RMS energy
- Normalized by dividing it over the mean energy of the utterance
Fundamental frequency (F0) contour
- Post-process to smooth out the contour and a final interpolation between voiced regions to obtain a continuous profile
Spectral emphasis
- RMS energy of the signal in 500~2000Hz obtained from a bandpass FIR (finite impulse response) filter

3. The prosodic Parameters

Stress detector

Past literature showed that the most reliable acoustic correlates of syllable stress are syllable duration and energy

Stressed syllables exhibit a longer duration and greater energy in the mid-to-high frequency band


Prominent syllables	Non-prominent syllables
Figure 2: Scattered plots of prominent and non-prominent syllables in terms of log-normalized duration and log-normalized spectral emphasis

Overlapping regions exist due to the reason that stress is the only contributin parameters here, but prominence is dependent on both stress and pitch accent.
Dashed lines represents the decision boundary formed by multivariate Gaussian distributions of the prominent and non-prominent syllables.
Both acoustic parameters are logged to achieve a better fit to a Gaussian distribution.

Pitch accent detector

Intonation events


Prominent syllables	Non-prominent syllables
Figure 3: Scattered plots of prominent and non-prominent syllables in terms of log-normalized overall syllable energy and log-normalized intonational event parameters

As before, dashed lines represents the decision boundary formed by multivariate Gaussian distributions of the prominent and non-prominent syllables.

Prominence detector
- Prominence detector = stress detector + pitch accent detector
- Prominence syllable = stress syllable or pitch accented syllable
- Experiment was done on a subset of TIMIT corpus
  - Training set: 3637 syllables, 25 speakers
  - Test set: 3643 syllables, 26 speakers
- Detection result
  
  　　　　　Detected as
  Syllable type　　　　　 Stressed Pitch Accented Stressed+
  Pitch Accented None
  
  Prominent 650 53 280 271
  
  Non-Prominent 314 41 50 1984
- Recognition rate = (650+53+280+1983) / 3643 = 81.44%
- Insertion rate = (314+41+50) / 3643 = 11.12%
- Deletion rate = 271 / 3643 = 7.44%

Detected as Syllable type	Stressed	Pitch Accented	Stressed+ Pitch Accented	None
Prominent	650	53	280	271
Non-Prominent	314	41	50	1984

4. Conclusions

Performance agrees with the inter-rater agreement rate of around 80% in past research
- 82% for distinguishing high/low/neutral tone, rated by two human raters by inspecting pitch height only
- 77% for distinguishing high/low tone only

last updated: 2011/08/29