Contribution of the detailed parts around a talker's mouth for speech intelligibility

Shuichi Sakamoto, Gen Hasegawa, Tomoko Ohtani, Yôiti Suzuki, Tom Abe, Testuaki Kawase

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Moving images of a talker's face carry much information for speech understanding. Interpretation of that information is known as lip-reading, which can be used effectively when people hear speech sounds, especially under difficult listening conditions. For the development of advanced multi-modal communications systems, such information should be well considered. Actually, talker movies have been applied effectively, for example, to voice activity detection (VAD), and automatic speech recognition (ASR). We have been particularly examining which parts around the talker's mouth contribute most to speech understanding. In this study, we performed audio-visual speech intelligibility tests and investigated the relationship between speech intelligibility and effects of the parts around the talker's mouth. As the stimuli, nonsense tri-syllable speech sounds were combined with three kinds of moving images of a talker's face: original face, neighborhood of the lips (mouth part extracted from the original face), and audio only (without video). The size of extracted area around the mouth was changed as a parameter in the neighborhood of the lip conditions. All possible vowel-consonant combinations in Japanese were included in the presented nonsense tri-syllable speech. Generated audio-visual stimuli were presented with speech spectrum noise to the participants, who all had normal hearing and normal or corrected normal vision. Results showed that intelligibility scores of several phonemes (/n/, /h/, Imi, /w/, /d/, Ibi, /p/) were increased by adding the visual information. Moreover, no significant difference was found between the score of the original face condition and that of the neighborhood of the lips condition. This result suggests that the mouth area alone provides sufficient information for speech intelligibility.

Original languageEnglish
Title of host publication21st International Congress on Sound and Vibration 2014, ICSV 2014
PublisherInternational Institute of Acoustics and Vibrations
Pages2553-2559
Number of pages7
ISBN (Electronic)9781634392389
Publication statusPublished - 2014
Event21st International Congress on Sound and Vibration 2014, ICSV 2014 - Beijing, China
Duration: 2014 Jul 132014 Jul 17

Publication series

Name21st International Congress on Sound and Vibration 2014, ICSV 2014
Volume3

Conference

Conference21st International Congress on Sound and Vibration 2014, ICSV 2014
Country/TerritoryChina
CityBeijing
Period14/7/1314/7/17

Fingerprint

Dive into the research topics of 'Contribution of the detailed parts around a talker's mouth for speech intelligibility'. Together they form a unique fingerprint.

Cite this