TY - GEN
T1 - Exploring rich expressive information from audiobook data using cluster adaptive training
AU - Chen, Langzhou
AU - Gales, Mark J.F.
AU - Wan, Vincent
AU - Latorre, Javier
AU - Akamine, Masami
PY - 2012/12/1
Y1 - 2012/12/1
N2 - Audiobook data is a freely available source of rich expressive speech data. To accurately generate speech of this form, expressiveness must be incorporated into the synthesis system. This paper investigates two parts of this process: the representation of expressive information in a statistical parametric speech synthesis system; and whether discrete expressive state labels can sufficiently represent the full diversity of expressive speech. Initially a discrete form of expressive information was used. A new form of expressive representation, where each condition maps to a point in an expressive speech space, is described. This cluster adaptively trained (CAT) system is compared to incorporating information in the decision tree construction and a transform based system using CMLLR and CSMAPLR. Experimental results indicate that the CAT system outperformed the contrast systems in both expressiveness and voice quality. The CAT-style representation yields a continuous expressive speech space. Thus, it is possible to treat utterance-level expressiveness as a point in this continuous space, rather than as one of a set of discrete states. This continuous-space representation outperformed discrete clusters, indicating limitations of discrete labels for expressiveness in audiobook data.
AB - Audiobook data is a freely available source of rich expressive speech data. To accurately generate speech of this form, expressiveness must be incorporated into the synthesis system. This paper investigates two parts of this process: the representation of expressive information in a statistical parametric speech synthesis system; and whether discrete expressive state labels can sufficiently represent the full diversity of expressive speech. Initially a discrete form of expressive information was used. A new form of expressive representation, where each condition maps to a point in an expressive speech space, is described. This cluster adaptively trained (CAT) system is compared to incorporating information in the decision tree construction and a transform based system using CMLLR and CSMAPLR. Experimental results indicate that the CAT system outperformed the contrast systems in both expressiveness and voice quality. The CAT-style representation yields a continuous expressive speech space. Thus, it is possible to treat utterance-level expressiveness as a point in this continuous space, rather than as one of a set of discrete states. This continuous-space representation outperformed discrete clusters, indicating limitations of discrete labels for expressiveness in audiobook data.
KW - Audiobook
KW - Cluster adaptive training
KW - Expressive speech synthesis
KW - Hidden Markov model
UR - http://www.scopus.com/inward/record.url?scp=84878397811&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84878397811&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:84878397811
SN - 9781622767595
T3 - 13th Annual Conference of the International Speech Communication Association 2012, INTERSPEECH 2012
SP - 958
EP - 961
BT - 13th Annual Conference of the International Speech Communication Association 2012, INTERSPEECH 2012
T2 - 13th Annual Conference of the International Speech Communication Association 2012, INTERSPEECH 2012
Y2 - 9 September 2012 through 13 September 2012
ER -