TY - GEN
T1 - Integrated automatic expression prediction and speech synthesis from text
AU - Chen, Langzhou
AU - Gales, Mark J.F.
AU - Braunschweiler, Norbert
AU - Akamine, Masami
AU - Knill, Kate
PY - 2013/10/18
Y1 - 2013/10/18
N2 - Getting a text to speech synthesis (TTS) system to speak lively animated stories like a human is very difficult. To generate expressive speech, the system can be divided into 2 parts: predicting expressive information from text; and synthesizing the speech with a particular expression. Traditionally these blocks have been studied separately. This paper proposes an integrated approach, sharing the expressive synthesis space and training data across the two expressive components. There are several advantages to this approach, including a simplified expression labelling process, support of a continuous expressive synthesis space, and joint training of the expression predictor and speech synthesiser to maximise the likelihood of the TTS system given the training data. Synthesis experiments indicated that the proposed approach generated far more expressive speech than both a neutral TTS and one where the expression was randomly selected. The experimental results also showed the advantage of a continuous expressive synthesis space over a discrete space.
AB - Getting a text to speech synthesis (TTS) system to speak lively animated stories like a human is very difficult. To generate expressive speech, the system can be divided into 2 parts: predicting expressive information from text; and synthesizing the speech with a particular expression. Traditionally these blocks have been studied separately. This paper proposes an integrated approach, sharing the expressive synthesis space and training data across the two expressive components. There are several advantages to this approach, including a simplified expression labelling process, support of a continuous expressive synthesis space, and joint training of the expression predictor and speech synthesiser to maximise the likelihood of the TTS system given the training data. Synthesis experiments indicated that the proposed approach generated far more expressive speech than both a neutral TTS and one where the expression was randomly selected. The experimental results also showed the advantage of a continuous expressive synthesis space over a discrete space.
KW - audiobook
KW - cluster adaptive training
KW - expressive speech synthesis
KW - hidden Markov model
KW - neural network
UR - http://www.scopus.com/inward/record.url?scp=84887070110&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84887070110&partnerID=8YFLogxK
U2 - 10.1109/ICASSP.2013.6639218
DO - 10.1109/ICASSP.2013.6639218
M3 - Conference contribution
AN - SCOPUS:84887070110
SN - 9781479903566
T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
SP - 7977
EP - 7981
BT - 2013 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2013 - Proceedings
T2 - 2013 38th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2013
Y2 - 26 May 2013 through 31 May 2013
ER -