TY - GEN
T1 - Tonal context labeling using quantized F0 symbols for improving tone correctness in average-voice-based speech synthesis
AU - Chunwijitra, Vataya
AU - Nose, Takashi
AU - Kobayashi, Takao
PY - 2011
Y1 - 2011
N2 - This paper proposes a technique for improving tone correctness in Thai speech synthesis based on an average voice model trained with nonprofessional speech corpus. The proposed technique utilizes quantized F0 symbols as the tonal context in order to obtain an appropriate F0 model. With this technique, the prosodic context can be extracted from real speech directly and this leads to prevent the inconsistency between speech data and F0 labels generated from transcription, which affects the naturalness and tone correctness in synthetic speech. We examine two types of tonal context labeling using the quantized F0 symbols based on phone and sub-phone boundaries. Experimental results of both objective and subjective tests show that the proposed technique can improve not only the naturalness but also the tone correctness of synthetic speech under condition of using a small amount speech data of nonprofessional target speakers.
AB - This paper proposes a technique for improving tone correctness in Thai speech synthesis based on an average voice model trained with nonprofessional speech corpus. The proposed technique utilizes quantized F0 symbols as the tonal context in order to obtain an appropriate F0 model. With this technique, the prosodic context can be extracted from real speech directly and this leads to prevent the inconsistency between speech data and F0 labels generated from transcription, which affects the naturalness and tone correctness in synthetic speech. We examine two types of tonal context labeling using the quantized F0 symbols based on phone and sub-phone boundaries. Experimental results of both objective and subjective tests show that the proposed technique can improve not only the naturalness but also the tone correctness of synthetic speech under condition of using a small amount speech data of nonprofessional target speakers.
KW - average voice model
KW - F0 modeling
KW - F0 quantization
KW - HMM-based speech synthesis
UR - http://www.scopus.com/inward/record.url?scp=80051657104&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=80051657104&partnerID=8YFLogxK
U2 - 10.1109/ICASSP.2011.5947406
DO - 10.1109/ICASSP.2011.5947406
M3 - Conference contribution
AN - SCOPUS:80051657104
SN - 9781457705397
T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
SP - 4708
EP - 4711
BT - 2011 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011 - Proceedings
T2 - 36th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011
Y2 - 22 May 2011 through 27 May 2011
ER -