This paper proposes a method that automatically generates synthesis units from a speech corpus, so that the distortion due to the modifications of pitch period and duration in the synthesized speech is minimized. In the proposed method, a large number of speech segments extracted from the speech corpus are defined as candidates of the synthesis unit. The distortion in the synthetic speech is calculated by modifying the pitch period and duration of candidates and comparing them to natural speech. The speech segment which minimizes the sum of distortions of synthetic speech with various pitch patterns is selected and defined as the synthesis unit. The proposed method is called the closed-loop training method, since the distortion in the synthesized speech is evaluated, and the result is fed back to select the synthesis unit. The generation of the synthesis unit by closed-loop training can be applied to various synthesizers, regardless of the kind of synthesis unit or the scheme of the synthesizer. An experiment was performed using the proposed method, where the synthesis units were generated by the synthesizer based on PSOLA and it was shown that the quality of the synthesized speech was improved compared to the conventional method where distortion by prosodic modification is not considered.
|Number of pages||7|
|Journal||Systems and Computers in Japan|
|Publication status||Published - 1999 Aug|