Estimation of the glottal
source and the vocal tract filter from the voiced sound for the singing
synthesis application
June 22, 1999
Abstract
To improve the naturalness of the sustained vowels,
a source-filter type synthesis model is suggested. The associated estimation
procedure of model parameters is shown to have a better performance than
the LPC (linear predictive coding) method. The proposed synthesis model
can reproduce the desired recording almost perfectly via the suggested
estimation procedure.
In this preliminary study, we will only consider the non-nasal voiced
sound. To trade off between the complexity of the modeling and the analysis
procedure to acquire the model parameters, we propose a source-filter type
synthesis model, based on a simplified human voice production system. The
source-filter model decomposes the human voice production system into three
linear systems: glottal source, vocal tract and radiation. The radiation
is simplified as a differencing filter. The vocal tract filter is assumed
all-poled for non-nasal sound. After combining the glottal source and the
radiation, we model the resulting derivative glottal wave as a 10th order
polynomial. We also add spectral tilt to the vocal tract filter to increase
the flexibility for changing the sound quality.
There are two phases for estimating the parameters of the vocal tract
filter and glottal source from the recordings. The first phase is to estimate
the vocal tract filter via the joint estimation method that proposed in
a recent paper
submitted to MOHONK 99'. The second phase is to estimate the polynomial
coefficients for the inverse filtered derivative glottal source.
Sound Examples
Section Ia:
The following sound (*.wav) examples show that the
proposed estimation method is effective. All sound files are sampled at
44kHz with 16-bit resolution (CD quality). The original recording (vowel
/e/ with pitch F4, 349.228 Hz) is sung by a Coloratura Soprano (age 40).
The following sound files are reconstructed via
GLPC method or the proposed method. The desired recording is the one with
amplitude normalization. Only one vocal tract filter is used for the entire
time duration.
Sound files after adding the amplitude variation
to the reconstructed ones:
Section Ib:
The original recording (vowel /e/ with pitch A4,
440Hz) is sung by a Coloratura Soprano (age 40).
The following sound files are reconstructed via
GLPC method or the proposed method. The desired recording is the one with
amplitude normalization. Only one vocal tract filter is used for the entire
time duration.
Section Ic:
The original recording (vowel /e/ with pitch C5,
523.251 Hz) is sung by a Coloratura Soprano (age 40).
The following sound files are reconstructed via
GLPC method or the proposed method. The desired recording is the one with
amplitude normalization. Only one vocal tract filter is used for the entire
time duration.
Section II:
The following example shows that a fixed set of
glottal parameters (polynomial coefficients) and vocal tract shape filter
coefficients that extracted from the above example can be used to generate
a glissando phrase without too much degradation. The pitch and amplitude
envelopes that generate this glissando phrase are obtained from a real
recording. The hybrid fundamental frequency estimation method is used to
extract the pitch contour at a rate about 172Hz. The amplitude contour
(at a 172 Hz rate too) is the RMS average value for each windowed frame.
Pitch synchronized pitch and amplitude envelopes are obtained by linear
interpolation from this 172Hz data points.
Now, if we use the envelope contour extracted from
this
file (a sound example from CHANT) and we still apply the non-singer
vocal tract and glottal source parameter.
Last modified: 8/7/99 11:27AM
email:
Hui-Ling Lu