The procedure from human voice production to voice recognition involves the following steps:

  1. Rapid open and close of your vocal cords (or glottis) to generate the vibration in air flow.
  2. Resonance of the pharyngeal cavity, nasal cavity, and oral cavity.
  3. The vibration of air.
  4. The vibration of the ear drum (or tympanum).
  5. The reception of the inner ear.
  6. The recognition by the brain.
The following diagram demonstrate the production mechanism for human voices.

The production mechanism of human voices.
Due to the pressure of the glottis and the air pushed from the lungs, the vocal cords can open and close very quickly, which generates vibrations in the air. The vibration is modulated by the resonances of pharyngeal/nasal/oral cavities, forming different timbre of your voices. In other words:

The following figure demonstrates the airflow velocity around the glottis and the voice signals measured around the mouth.

Airflow velocity around the glottis and the resultant voices signals

You can observe the movement of the vocal cords from the following link:

http://www.humnet.ucla.edu/humnet/linguistics/faciliti/demos/vocalfolds/vocalfolds.htmlocal copy

In fact, it is not easy to capture the movements of vocal cords due to its high frequency in movement. So we need to have high-speed cameras for such purpose, for instance:

http://www.kayelemetrics.com/Product%20Info/9700/9700.htmlocal copy

We can conceive the production of human voices as a source-filter model where the source is the airflow caused by the vocal cords, and the filter includes the pharyngeal/nasal/oral cavities. The following figure shows the representative spectrum for each stage:

Source-filter model and the corresponding spectra

We can also use the following block diagram to represent the source-filter model of human voice production:

Block diagram representation of source-filter model

In general, a regular vibration of the glottis will generate quasi-periodic voiced sounds. On the other hand, if the source is irregular airflow, then we will have unvoiced sounds. Take the utterance of "six" for example:

Unvoiced and voiced sounds

We can clearly observe that "s" and "k" are unvoiced sounds, while "i" is a voiced sound.

For Mandarin, almsot all unvoiced sounds happen at the beginning of a syllable. Take the utterance of "清" as in "清華大學" for example:

  1. No vibration from the glottis. Close your teech and push forward your tongue tip against the lower teeth to generate the unvoiced sound "ㄑ" by a jet of airflow.
  2. Keep almost the sampe position but start glottis vibration to pronunce the voiced "ㄧ".
  3. Keep glottis vibrate but retract your tongue to pronuced the final voiced "ㄥ".

Just put your hand on your throat, you can feel the vibration of the glottis.

Here are some terminologies in both English and Chinese for your reference:

  1. Cochlea:耳蝸
  2. Phoneme:音素、音位
  3. Phonics:聲學;聲音基礎教學法(以聲音為基礎進而教拼字的教學法)
  4. Phonetics:語音學
  5. Phonology:音系學、語音體系
  6. Prosody:韻律學;作詩法
  7. Syllable:音節
  8. Tone:音調
  9. Alveolar:齒槽音
  10. Silence:靜音
  11. Noise:雜訊
  12. Glottis:聲門
  13. larynx:喉頭
  14. Pharynx:咽頭
  15. Pharyngeal:咽部的,喉音的
  16. Velum:軟顎
  17. Vocal chords:聲帶
  18. Glottis: 聲門
  19. Esophagus:食管
  20. Diaphragm:橫隔膜
  21. Trachea:氣管

Audio Signal Processing and Recognition (音訊處理與辨識)