TY - GEN
T1 - HMM-based expressive speech synthesis based on phrase-level F0 context labeling
AU - Maeno, Yu
AU - Nose, Takashi
AU - Kobayashi, Takao
AU - Koriyama, Tomoki
AU - Ijima, Yusuke
AU - Nakajima, Hideharu
AU - Mizuno, Hideyuki
AU - Yoshioka, Osamu
PY - 2013/10/18
Y1 - 2013/10/18
N2 - This paper proposes a technique for adding more prosodic variations to the synthetic speech in HMM-based expressive speech synthesis. We create novel phrase-level F0 context labels from the residual information of F0 features between original and synthetic speech for the training data. Specifically, we classify the difference of average log F0 values between the original and synthetic speech into three classes which have perceptual meanings, i.e., high, neutral, and low of relative pitch at the phrase level. We evaluate both ideal and practical cases using appealing and fairy tale speech recorded under a realistic condition. In the ideal case, we examine the potential of our technique to modify the F0 patterns under a condition where the original F0 contours of test sentences are known. In the practical case, we show how the users intuitively modify the pitch by changing the initial F0 context labels obtained from the input text.
AB - This paper proposes a technique for adding more prosodic variations to the synthetic speech in HMM-based expressive speech synthesis. We create novel phrase-level F0 context labels from the residual information of F0 features between original and synthetic speech for the training data. Specifically, we classify the difference of average log F0 values between the original and synthetic speech into three classes which have perceptual meanings, i.e., high, neutral, and low of relative pitch at the phrase level. We evaluate both ideal and practical cases using appealing and fairy tale speech recorded under a realistic condition. In the ideal case, we examine the potential of our technique to modify the F0 patterns under a condition where the original F0 contours of test sentences are known. In the practical case, we show how the users intuitively modify the pitch by changing the initial F0 context labels obtained from the input text.
KW - audiobook
KW - HMM-based expressive speech synthesis
KW - prosodic context
KW - prosody control
KW - unsupervised labeling
UR - http://www.scopus.com/inward/record.url?scp=84890491815&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84890491815&partnerID=8YFLogxK
U2 - 10.1109/ICASSP.2013.6639194
DO - 10.1109/ICASSP.2013.6639194
M3 - Conference contribution
AN - SCOPUS:84890491815
SN - 9781479903566
T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
SP - 7859
EP - 7863
BT - 2013 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2013 - Proceedings
T2 - 2013 38th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2013
Y2 - 26 May 2013 through 31 May 2013
ER -