TY - GEN
T1 - Non-native speech conversion with consistency-aware recursive network and generative adversarial network
AU - Oyamada, Keisuke
AU - Kameoka, Hirokazu
AU - Kaneko, Takuhiro
AU - Ando, Hiroyasu
AU - Hiramatsu, Kaoru
AU - Kashino, Kunio
N1 - Funding Information:
ACKNOWLEDGMENT This research was conducted with the grant of JSPS Grant 26730100.
Publisher Copyright:
© 2017 IEEE.
PY - 2018/2/5
Y1 - 2018/2/5
N2 - This paper deals with the problem of automatically correcting the pronunciation of non-native speakers. Since the pronunciation characteristics of non-native speakers depend heavily on the context (such as words), conversion rules for correcting pronunciation should be learned from a sequence of features rather than a single-frame feature. For the online conversion of local sequences of features, we construct a neural network (NN) that takes a sequence of features as an input/output, generates a sequence of features in a segment-by- segment fashion and guarantees the consistency of the generated features within overlapped segments. Futhermore, we apply a recently proposed generative adversarial network (GAN)-based postfilter to the generated feature sequence with the aim of synthesizing natural-sounding speech. Through subjective and quantitative evaluations, we confirmed the superiority of our proposed method over a conventional NN approach in terms of conversion quality.
AB - This paper deals with the problem of automatically correcting the pronunciation of non-native speakers. Since the pronunciation characteristics of non-native speakers depend heavily on the context (such as words), conversion rules for correcting pronunciation should be learned from a sequence of features rather than a single-frame feature. For the online conversion of local sequences of features, we construct a neural network (NN) that takes a sequence of features as an input/output, generates a sequence of features in a segment-by- segment fashion and guarantees the consistency of the generated features within overlapped segments. Futhermore, we apply a recently proposed generative adversarial network (GAN)-based postfilter to the generated feature sequence with the aim of synthesizing natural-sounding speech. Through subjective and quantitative evaluations, we confirmed the superiority of our proposed method over a conventional NN approach in terms of conversion quality.
UR - http://www.scopus.com/inward/record.url?scp=85047505634&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85047505634&partnerID=8YFLogxK
U2 - 10.1109/APSIPA.2017.8282025
DO - 10.1109/APSIPA.2017.8282025
M3 - Conference contribution
AN - SCOPUS:85047505634
T3 - Proceedings - 9th Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2017
SP - 182
EP - 188
BT - Proceedings - 9th Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2017
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 9th Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2017
Y2 - 12 December 2017 through 15 December 2017
ER -