TY - JOUR
T1 - Dimensional paralinguistic information control based on multiple-regression HSMM for spontaneous dialogue speech synthesis with robust parameter estimation
AU - Nagata, Tomohiro
AU - Mori, Hiroki
AU - Nose, Takashi
N1 - Publisher Copyright:
© 2017 Elsevier B.V.
PY - 2017/4/1
Y1 - 2017/4/1
N2 - This paper describes spontaneous dialogue speech synthesis based on the multiple regression hidden semi-Markov model (MRHSMM), which enables users to specify paralinguistic information of synthesized speech with a dimensional representation. Paralinguistic aspects of synthesized speech are controlled by multiple regression models whose explanatory variables are abstract dimensions such as pleasant-unpleasant and aroused-sleepy. However, in the training phase of the MRHSMM, estimated regression coefficients may have unreasonably large values, which cause fragility in the parameter generation with respect to paralinguistic information given to the synthesizer. For robust estimation of the regression matrices of the MRHSMM with unbalanced spontaneous dialogue speech samples, the re-estimation formulae were derived in the framework of the maximum a posteriori (MAP) estimation. By examining the synthesized speech, it was confirmed that the acoustic features of synthesized speech are well controlled by the dimensions, especially by the dimension of aroused-sleepy. The result of a perceptual experiment confirmed that the naturalness of synthesized speech was improved by applying the MAP estimation for regression matrices. In addition, a relatively high correlation was observed between given and perceived paralinguistic information, which implies that the proposed method could successfully reflect intended paralinguistic messages on the synthesized speech.
AB - This paper describes spontaneous dialogue speech synthesis based on the multiple regression hidden semi-Markov model (MRHSMM), which enables users to specify paralinguistic information of synthesized speech with a dimensional representation. Paralinguistic aspects of synthesized speech are controlled by multiple regression models whose explanatory variables are abstract dimensions such as pleasant-unpleasant and aroused-sleepy. However, in the training phase of the MRHSMM, estimated regression coefficients may have unreasonably large values, which cause fragility in the parameter generation with respect to paralinguistic information given to the synthesizer. For robust estimation of the regression matrices of the MRHSMM with unbalanced spontaneous dialogue speech samples, the re-estimation formulae were derived in the framework of the maximum a posteriori (MAP) estimation. By examining the synthesized speech, it was confirmed that the acoustic features of synthesized speech are well controlled by the dimensions, especially by the dimension of aroused-sleepy. The result of a perceptual experiment confirmed that the naturalness of synthesized speech was improved by applying the MAP estimation for regression matrices. In addition, a relatively high correlation was observed between given and perceived paralinguistic information, which implies that the proposed method could successfully reflect intended paralinguistic messages on the synthesized speech.
KW - HMM-based speech synthesis
KW - MAP estimation
KW - MRHSMM
KW - Speech emotion
KW - Spontaneous speech
KW - UU Database
UR - http://www.scopus.com/inward/record.url?scp=85009751227&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85009751227&partnerID=8YFLogxK
U2 - 10.1016/j.specom.2017.01.002
DO - 10.1016/j.specom.2017.01.002
M3 - Article
AN - SCOPUS:85009751227
SN - 0167-6393
VL - 88
SP - 137
EP - 148
JO - Speech Communication
JF - Speech Communication
ER -