Transform mapping using shared decision tree context clustering for HMM-based cross-lingual speech synthesis

Daiki Nagahama, Takashi Nose, Tomoki Koriyama, Takao Kobayashi

Research output: Contribution to journalConference articlepeer-review

3 Citations (Scopus)

Abstract

This paper proposes a novel transform mapping technique based on shared decision tree context clustering (STC) for HMM- based cross-lingual speech synthesis. In the conventional cross- lingual speaker adaptation based on state mapping, the adapta- Tion performance is not always satisfactory when there are mis- matches of languages and speakers between the average voice models of input and output languages. In the proposed tech- nique, we alleviate the effect of the mismatches on the trans- form mapping by introducing a language-independent decision tree constructed by STC, and represent the average voice mod- els using language-independent and dependent tree structures. We also use a bilingual speech corpus for keeping speaker char- Acteristics between the average voice models of different lan- guages. The experimental results show that the proposed tech- nique decreases both spectral and prosodic distortions between original and generated parameter trajectories and significantly improves the naturalness of synthetic speech while keeping the speaker similarity compared to the state mapping.

Original languageEnglish
Pages (from-to)770-774
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Publication statusPublished - 2014
Event15th Annual Conference of the International Speech Communication Association: Celebrating the Diversity of Spoken Languages, INTERSPEECH 2014 - Singapore, Singapore
Duration: 2014 Sept 142014 Sept 18

Keywords

  • Bilingual speech corpus
  • Cross-lingual TTS
  • HMM-based speech synthesis
  • Shared decision tree context clustering

Fingerprint

Dive into the research topics of 'Transform mapping using shared decision tree context clustering for HMM-based cross-lingual speech synthesis'. Together they form a unique fingerprint.

Cite this