Prosodic Prominence Detection in Speech
F. Tamburini
published in International Symposium on Signal Processing and its Applications, 2003
summarized by Davidson, 2011/08/29
back to homepage
1. Introduction
- Prominence: a word or part of a word made prominent is perceived as standing out from its environment
- Prosodic prominence involves two phonetic features: pitch accent and stress
- Acoustic correlates of prominence (according to past studies):
- pitch movement
- overall syllable energy
- syllable duration
- spectral emphasis
2. The acoustic parameters
- All acoustic parameters must be normalized to avoid variations among different speakers
- Duration
- Syllable duration vs. Syllable nucleus duration:
|
|
Syllable duration |
Syllable nucleus duration |
Figure 1: Gaussian approximation of duration measures of non-prominent (solid lines) and prominent syllables (dash-dotted lines). |
- Almost the same recognition rates are obtained --> we can substitute the syllable duration to a rather reliable syllable nucleus duration
- Normalized by ROS (rate of speech, or simply speech rate) (# of phonemes per unit time)
- Energy
- Syllable nucleus RMS energy
- Normalized by dividing it over the mean energy of the utterance
- Fundamental frequency (F0) contour
- Post-process to smooth out the contour and a final interpolation between voiced regions to obtain a continuous profile
- Spectral emphasis
- RMS energy of the signal in 500~2000Hz obtained from a bandpass FIR (finite impulse response) filter
3. The prosodic Parameters
- Stress detector
- Past literature showed that the most reliable acoustic correlates of syllable stress are syllable duration and energy
- Stressed syllables exhibit a longer duration and greater energy in the mid-to-high frequency band
|
|
Prominent syllables |
Non-prominent syllables |
Figure 2: Scattered plots of prominent and non-prominent syllables in terms of log-normalized duration and log-normalized spectral emphasis |
- Overlapping regions exist due to the reason that stress is the only contributin parameters here, but prominence is dependent on both stress and pitch accent.
- Dashed lines represents the decision boundary formed by multivariate Gaussian distributions of the prominent and non-prominent syllables.
- Both acoustic parameters are logged to achieve a better fit to a Gaussian distribution.
- Pitch accent detector
- Intonation event method
- Use RFC model (rise/fall/connection) parameters to uniquely describe pitch accent shapes, called the TILT parameter set
- The F0 contour is divided into 25ms frames and data in each frame is linearly interpolated using a least median squares method to obtain robust regression and deletion of outliers
- Classify each frame as R/F/C depending on its gradient
- Merge all adjacent frames into one interval (an intonational event) if they have the same RFC classification
- Possible candidates of pitch accent are the intonational events that exhibits a rise followed by a fall profile, where the rise section is more relevant to prominence.
- Sluijter and van Heuven suggested that pitch accent can be detected by overall syllable energy and pitch variation.
- Pitch variation --> event amplitude (EvAmp) = |rise amplitude| + |fall amplitude| --> EvAmp * EvDur * EvRel where EvDur = event duration and EvRel = relevance of the event along the utterance (???)
- Intonation events
|
|
Prominent syllables |
Non-prominent syllables |
Figure 3: Scattered plots of prominent and non-prominent syllables in terms of log-normalized overall syllable energy and log-normalized intonational event parameters |
- As before, dashed lines represents the decision boundary formed by multivariate Gaussian distributions of the prominent and non-prominent syllables.
- Prominence detector
- Prominence detector = stress detector + pitch accent detector
- Prominence syllable = stress syllable or pitch accented syllable
- Experiment was done on a subset of TIMIT corpus
- Training set: 3637 syllables, 25 speakers
- Test set: 3643 syllables, 26 speakers
- Detection result
ˇ@ˇ@ˇ@ˇ@ˇ@Detected as Syllable typeˇ@ˇ@ˇ@ˇ@ˇ@ | Stressed | Pitch Accented | Stressed+ Pitch Accented | None |
Prominent | 650 | 53 | 280 | 271 |
Non-Prominent | 314 | 41 | 50 | 1984 |
- Recognition rate = (650+53+280+1983) / 3643 = 81.44%
- Insertion rate = (314+41+50) / 3643 = 11.12%
- Deletion rate = 271 / 3643 = 7.44%
4. Conclusions
- Performance agrees with the inter-rater agreement rate of around 80% in past research
- 82% for distinguishing high/low/neutral tone, rated by two human raters by inspecting pitch height only
- 77% for distinguishing high/low tone only
back to homepage
last updated: 2011/08/29