TY - JOUR
T1 - Prosodic variation enhancement using unsupervised context labeling for HMM-based expressive speech synthesis
AU - Maeno, Yu
AU - Nose, Takashi
AU - Kobayashi, Takao
AU - Koriyama, Tomoki
AU - Ijima, Yusuke
AU - Nakajima, Hideharu
AU - Mizuno, Hideyuki
AU - Yoshioka, Osamu
PY - 2014
Y1 - 2014
N2 - This paper proposes an unsupervised labeling technique using phrase-level prosodic contexts for HMM-based expressive speech synthesis, which enables users to manually enhance prosodic variations of synthetic speech without degrading the naturalness. In the proposed technique, HMMs are first trained using the conventional labels including only linguistic information, and prosodic features are generated from the HMMs. The average difference of original and generated prosodic features for each accent phrase is then calculated and classified into three classes, e.g.; low, neutral, and high in the case of fundamental frequency. The created prosodic context label has a practical meaning such as high/low of relative pitch at the phrase level, and hence it is expected that users can modify the prosodic characteristic of synthetic speech in an intuitive way by manually changing the proposed labels. In the experiments, we evaluate the proposed technique in both ideal and practical conditions using speech of sales talk and fairy tale recorded under a realistic domain. In the evaluation under the practical condition, we evaluate whether the users achieve their intended prosodic modification by changing the proposed context label of a certain accent phrase for a given sentence.
AB - This paper proposes an unsupervised labeling technique using phrase-level prosodic contexts for HMM-based expressive speech synthesis, which enables users to manually enhance prosodic variations of synthetic speech without degrading the naturalness. In the proposed technique, HMMs are first trained using the conventional labels including only linguistic information, and prosodic features are generated from the HMMs. The average difference of original and generated prosodic features for each accent phrase is then calculated and classified into three classes, e.g.; low, neutral, and high in the case of fundamental frequency. The created prosodic context label has a practical meaning such as high/low of relative pitch at the phrase level, and hence it is expected that users can modify the prosodic characteristic of synthetic speech in an intuitive way by manually changing the proposed labels. In the experiments, we evaluate the proposed technique in both ideal and practical conditions using speech of sales talk and fairy tale recorded under a realistic domain. In the evaluation under the practical condition, we evaluate whether the users achieve their intended prosodic modification by changing the proposed context label of a certain accent phrase for a given sentence.
KW - Audiobook
KW - HMM-based expressive speech synthesis
KW - Prosodic context
KW - Prosody control
KW - Unsupervised labeling
UR - http://www.scopus.com/inward/record.url?scp=84887030542&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84887030542&partnerID=8YFLogxK
U2 - 10.1016/j.specom.2013.09.014
DO - 10.1016/j.specom.2013.09.014
M3 - Article
AN - SCOPUS:84887030542
SN - 0167-6393
VL - 57
SP - 144
EP - 154
JO - Speech Communication
JF - Speech Communication
ER -