Estimation of the glottal source and the vocal tract filter from the voiced sound for the singing synthesis application

Back to research page

June 22, 1999


Abstract

To improve the naturalness of the sustained vowels, a source-filter type synthesis model is suggested. The associated estimation procedure of model parameters is shown to have a better performance than the LPC (linear predictive coding) method. The proposed synthesis model can reproduce the desired recording almost perfectly via the suggested estimation procedure.

In this preliminary study, we will only consider the non-nasal voiced sound. To trade off between the complexity of the modeling and the analysis procedure to acquire the model parameters, we propose a source-filter type synthesis model, based on a simplified human voice production system. The source-filter model decomposes the human voice production system into three linear systems: glottal source, vocal tract and radiation. The radiation is simplified as a differencing filter. The vocal tract filter is assumed all-poled for non-nasal sound. After combining the glottal source and the radiation, we model the resulting derivative glottal wave as a 10th order polynomial. We also add spectral tilt to the vocal tract filter to increase the flexibility for changing the sound quality.

There are two phases for estimating the parameters of the vocal tract filter and glottal source from the recordings. The first phase is to estimate the vocal tract filter via the joint estimation method that proposed in a recent paper submitted to MOHONK 99'. The second phase is to estimate the polynomial coefficients for the inverse filtered derivative glottal source.


Sound Examples

Section Ia:

The following sound (*.wav) examples show that the proposed estimation method is effective. All sound files are sampled at 44kHz with 16-bit resolution (CD quality). The original recording (vowel /e/ with pitch F4, 349.228 Hz) is sung by a Coloratura Soprano (age 40).
The following sound files are reconstructed via GLPC method or the proposed method. The desired recording is the one with amplitude normalization. Only one vocal tract filter is used for the entire time duration.
Sound files after adding the amplitude variation to the reconstructed ones:

Section Ib:

The original recording (vowel /e/ with pitch A4, 440Hz) is sung by a Coloratura Soprano (age 40).
The following sound files are reconstructed via GLPC method or the proposed method. The desired recording is the one with amplitude normalization. Only one vocal tract filter is used for the entire time duration.

Section Ic:

The original recording (vowel /e/ with pitch C5, 523.251 Hz) is sung by a Coloratura Soprano (age 40).
The following sound files are reconstructed via GLPC method or the proposed method. The desired recording is the one with amplitude normalization. Only one vocal tract filter is used for the entire time duration.

Section II: 

The following example shows that a fixed set of glottal parameters (polynomial coefficients) and vocal tract shape filter coefficients that extracted from the above example can be used to generate a glissando phrase without too much degradation. The pitch and amplitude envelopes that generate this glissando phrase are obtained from a real recording. The hybrid fundamental frequency estimation method is used to extract the pitch contour at a rate about 172Hz. The amplitude contour (at a 172 Hz rate too) is the RMS average value for each windowed frame. Pitch synchronized pitch and amplitude envelopes are obtained by linear interpolation from this 172Hz data points.

How if the vocal tract and glottal source parameters are extracted from a non-singer male vowel /a/ speech sound?
Now, if we use the envelope contour extracted from this file (a sound example from CHANT) and we still apply the non-singer vocal tract and glottal source parameter.


Last modified: 8/7/99 11:27AM

email: Hui-Ling Lu