Frame-level acoustic modeling based on Gaussian process regression for statistical nonparametric speech synthesis

Tomoki Koriyama, Takashi Nose, Takao Kobayashi

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

7 Citations (Scopus)

Abstract

This paper proposes a new approach to text-to-speech based on Gaussian processes which are widely used to perform non-parametric Bayesian regression and classification. The Gaussian process regression model is designed for the prediction of frame-level acoustic features from the corresponding frame information. The frame information includes relative position in the phone and preceding and succeeding phoneme information obtained from linguistic information. In this paper, a frame context kernel is proposed as a similarity measure of respective frames. Experimental results using a small data set show the potential of the proposed approach without state-dependent dynamic features or decision-tree clustering used in a conventional HMM-based approach.

Original languageEnglish
Title of host publication2013 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2013 - Proceedings
Pages8007-8011
Number of pages5
DOIs
Publication statusPublished - 2013 Oct 18
Event2013 38th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2013 - Vancouver, BC, Canada
Duration: 2013 May 262013 May 31

Publication series

NameICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
ISSN (Print)1520-6149

Conference

Conference2013 38th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2013
Country/TerritoryCanada
CityVancouver, BC
Period13/5/2613/5/31

Keywords

  • acoustic models
  • context kernel
  • Gaussian process regression
  • non-parametric Bayesian model
  • statistical speech synthesis

Fingerprint

Dive into the research topics of 'Frame-level acoustic modeling based on Gaussian process regression for statistical nonparametric speech synthesis'. Together they form a unique fingerprint.

Cite this