The procedure from human voice production to voice recognition involves the following steps:
The following diagram demonstrate the production mechanism for human voices.
- Rapid open and close of your vocal cords (or glottis) to generate the vibration in air flow.
- Resonance of the pharyngeal cavity, nasal cavity, and oral cavity.
- The vibration of air.
- The vibration of the ear drum (or tympanum).
- The reception of the inner ear.
- The recognition by the brain.
Due to the pressure of the glottis and the air pushed from the lungs, the vocal cords can open and close very quickly, which generates vibrations in the air. The vibration is modulated by the resonances of pharyngeal/nasal/oral cavities, forming different timbre of your voices. In other words:
The production mechanism of human voices.
- The vibration frequency of the vocal cords determines the pitch of the voices.
- The positions/shapes of your lips, tongue, and nose determine the timbre.
- The compression from your lungs determine the loudness of the voices.
The following figure demonstrates the airflow velocity around the glottis and the voice signals measured around the mouth.
Airflow velocity around the glottis and the resultant voices signals
You can observe the movement of the vocal cords from the following link:
http://www.humnet.ucla.edu/humnet/linguistics/faciliti/demos/vocalfolds/vocalfolds.htm （local copy）
In fact, it is not easy to capture the movements of vocal cords due to its high frequency in movement. So we need to have high-speed cameras for such purpose, for instance:
http://www.kayelemetrics.com/Product%20Info/9700/9700.htm （local copy）
We can conceive the production of human voices as a source-filter model where the source is the airflow caused by the vocal cords, and the filter includes the pharyngeal/nasal/oral cavities. The following figure shows the representative spectrum for each stage:
Source-filter model and the corresponding spectra
We can also use the following block diagram to represent the source-filter model of human voice production:
Block diagram representation of source-filter model
In general, a regular vibration of the glottis will generate quasi-periodic voiced sounds. On the other hand, if the source is irregular airflow, then we will have unvoiced sounds. Take the utterance of "six" for example:
Unvoiced and voiced sounds
We can clearly observe that "s" and "k" are unvoiced sounds, while "i" is a voiced sound.
For Mandarin, almsot all unvoiced sounds happen at the beginning of a syllable. Take the utterance of "清" as in "清華大學" for example:
- No vibration from the glottis. Close your teech and push forward your tongue tip against the lower teeth to generate the unvoiced sound "ㄑ" by a jet of airflow.
- Keep almost the sampe position but start glottis vibration to pronunce the voiced "ㄧ".
- Keep glottis vibrate but retract your tongue to pronuced the final voiced "ㄥ".
Here are some terminologies in both English and Chinese for your reference:
- Vocal chords：聲帶
- Glottis: 聲門
Audio Signal Processing and Recognition (音訊處理與辨識)