Photo-realistic expressive text to talking head synthesis

Vincent Wan, Robert Anderson, Art Blokland, Norbert Braunschweiler, Langzhou Chen, Bala Krishna Kolluru, Javier Latorre, Ranniery Maia, Björn Stenger, Kayoko Yanagisawa, Yannis Stylianou, Masami Akamine, Mark J.F. Gales, Roberto Cipolla

Research output: Contribution to journalConference articlepeer-review

18 Citations (Scopus)


A controllable computer animated avatar that could be used as a natural user interface for computers is demonstrated. Driven by text and emotion input, it generates expressive speech with corresponding facial movements. To create the avatar, HMM-based text-to-speech synthesis is combined with active appearance model (AAM)-based facial animation. The novelty is the degree of control achieved over the expressiveness of both the speech and the face while keeping the controls simple. Controllability is achieved by training both the speech and facial parameters within a cluster adaptive training (CAT) framework. CAT creates a continuous, low dimensional eigenspace of expressions, which allows the creation of expressions of different intensity (including ones more intense than those in the original recordings) and combining different expressions to create new ones. Results on an emotion-recognition task show that recognition rates given the synthetic output are comparable to those given the original videos of the speaker.

Original languageEnglish
Pages (from-to)2667-2669
Number of pages3
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Publication statusPublished - 2013
Event14th Annual Conference of the International Speech Communication Association, INTERSPEECH 2013 - Lyon, France
Duration: 2013 Aug 252013 Aug 29


  • Expressive and controllable speech synthesis
  • Visual speech synthesis


Dive into the research topics of 'Photo-realistic expressive text to talking head synthesis'. Together they form a unique fingerprint.

Cite this