Word Stress Assessment for Computer Aided Language Learning

Word Stress Assessment for Computer Aided Language Learning

J. P. Arias, N. B. Yoma, and H. Vivanco

published in Interspeech, 2009

summarized by Davidson, 2011/09/09

1. Introduction

Language and text independent
- No transcript is required
- But requires a reference utterance
The reference and student utterances are compared to decide whether the stress patterns are the same.
F0 contour and frame energy contour are used for the comparison.
DTW algorithm with MFCCs are used to align the two utterances.

2. The proposed system

Figure 1: Block diagram of the proposed system
- Both utterances are aligned by DTW with MFCCs.
- F0 and frame energy contours are used to evaluate the trend similarity and to make the stress decision
- xxx

Duration

Syllable duration vs. Syllable nucleus duration:


Syllable duration	Syllable nucleus duration
Figure 1: Gaussian approximation of duration measures of non-prominent (solid lines) and prominent syllables (dash-dotted lines).

Almost the same recognition rates are obtained --> we can substitute the syllable duration to a rather reliable syllable nucleus duration
Normalized by ROS (rate of speech, or simply speech rate) (# of phonemes per unit time)

Energy
- Syllable nucleus RMS energy
- Normalized by dividing it over the mean energy of the utterance
Fundamental frequency (F0) contour
- Post-process to smooth out the contour and a final interpolation between voiced regions to obtain a continuous profile
Spectral emphasis
- RMS energy of the signal in 500~2000Hz obtained from a bandpass FIR (finite impulse response) filter

3. Experiments

Stress detector

Past literature showed that the most reliable acoustic correlates of syllable stress are syllable duration and energy

Stressed syllables exhibit a longer duration and greater energy in the mid-to-high frequency band


Prominent syllables	Non-prominent syllables
Figure 2: Scattered plots of prominent and non-prominent syllables in terms of log-normalized duration and log-normalized spectral emphasis

Overlapping regions exist due to the reason that stress is the only contributin parameters here, but prominence is dependent on both stress and pitch accent.
Dashed lines represents the decision boundary formed by multivariate Gaussian distributions of the prominent and non-prominent syllables.
Both acoustic parameters are logged to achieve a better fit to a Gaussian distribution.

Pitch accent detector

Intonation events


Prominent syllables	Non-prominent syllables
Figure 3: Scattered plots of prominent and non-prominent syllables in terms of log-normalized overall syllable energy and log-normalized intonational event parameters

As before, dashed lines represents the decision boundary formed by multivariate Gaussian distributions of the prominent and non-prominent syllables.

Prominence detector
- Prominence detector = stress detector + pitch accent detector
- Prominence syllable = stress syllable or pitch accented syllable
- Experiment was done on a subset of TIMIT corpus
  - Training set: 3637 syllables, 25 speakers
  - Test set: 3643 syllables, 26 speakers
- Detection result
  
  　　　　　Detected as
  Syllable type　　　　　 Stressed Pitch Accented Stressed+
  Pitch Accented None
  
  Prominent 650 53 280 271
  
  Non-Prominent 314 41 50 1984
- Recognition rate = (650+53+280+1983) / 3643 = 81.44%
- Insertion rate = (314+41+50) / 3643 = 11.12%
- Deletion rate = 271 / 3643 = 7.44%

Detected as Syllable type	Stressed	Pitch Accented	Stressed+ Pitch Accented	None
Prominent	650	53	280	271
Non-Prominent	314	41	50	1984

4. Results and discussion

Performance agrees with the inter-rater agreement rate of around 80% in past research
- 82% for distinguishing high/low/neutral tone, rated by two human raters by inspecting pitch height only
- 77% for distinguishing high/low tone only

last updated: 2011/08/29