TY - GEN
T1 - Continuous F0 in the source-excitation generation for HMM-based TTS
T2 - 36th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011
AU - Latorre, Javier
AU - Gales, Mark J.F.
AU - Buchholz, Sabine
AU - Knill, Kate
AU - Tamura, Masatsune
AU - Ohtani, Yamato
AU - Akamine, Masami
PY - 2011
Y1 - 2011
N2 - Most HMM-based TTS systems use a hard voiced/unvoiced classification to produce a discontinuous F0 signal which is used for the generation of the source-excitation. When a mixed source excitation is used, this decision can be based on two different sources of information: the state-specific MSD-prior of the F0 models, and/or the frame-specific features generated by the aperiodicity model. This paper examines the meaning of these variables in the synthesis process, their interaction, and how they affect the perceived quality of the generated speech The results of several perceptual experiments show that when using mixed excitation, subjects consistently prefer samples with very few or no false unvoiced errors, whereas a reduction in the rate of false voiced errors does not produce any perceptual improvement. This suggests that rather than using any form of hard voiced/unvoiced classification, e.g., the MSD-prior, it is better for synthesis to use a continuous F0 signal and rely on the frame-level soft voiced/unvoiced decision of the aperiodicity model.
AB - Most HMM-based TTS systems use a hard voiced/unvoiced classification to produce a discontinuous F0 signal which is used for the generation of the source-excitation. When a mixed source excitation is used, this decision can be based on two different sources of information: the state-specific MSD-prior of the F0 models, and/or the frame-specific features generated by the aperiodicity model. This paper examines the meaning of these variables in the synthesis process, their interaction, and how they affect the perceived quality of the generated speech The results of several perceptual experiments show that when using mixed excitation, subjects consistently prefer samples with very few or no false unvoiced errors, whereas a reduction in the rate of false voiced errors does not produce any perceptual improvement. This suggests that rather than using any form of hard voiced/unvoiced classification, e.g., the MSD-prior, it is better for synthesis to use a continuous F0 signal and rely on the frame-level soft voiced/unvoiced decision of the aperiodicity model.
KW - aperiodicity
KW - Continuous F0
KW - HMM-based synthesis
KW - multi-band mixed excitation
KW - voiced/unvoiced decision
UR - http://www.scopus.com/inward/record.url?scp=80051606114&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=80051606114&partnerID=8YFLogxK
U2 - 10.1109/ICASSP.2011.5947410
DO - 10.1109/ICASSP.2011.5947410
M3 - Conference contribution
AN - SCOPUS:80051606114
SN - 9781457705397
T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
SP - 4724
EP - 4727
BT - 2011 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011 - Proceedings
Y2 - 22 May 2011 through 27 May 2011
ER -