TY - GEN
T1 - Contribution of the detailed parts around a talker's mouth for speech intelligibility
AU - Sakamoto, Shuichi
AU - Hasegawa, Gen
AU - Ohtani, Tomoko
AU - Suzuki, Yôiti
AU - Abe, Tom
AU - Kawase, Testuaki
PY - 2014
Y1 - 2014
N2 - Moving images of a talker's face carry much information for speech understanding. Interpretation of that information is known as lip-reading, which can be used effectively when people hear speech sounds, especially under difficult listening conditions. For the development of advanced multi-modal communications systems, such information should be well considered. Actually, talker movies have been applied effectively, for example, to voice activity detection (VAD), and automatic speech recognition (ASR). We have been particularly examining which parts around the talker's mouth contribute most to speech understanding. In this study, we performed audio-visual speech intelligibility tests and investigated the relationship between speech intelligibility and effects of the parts around the talker's mouth. As the stimuli, nonsense tri-syllable speech sounds were combined with three kinds of moving images of a talker's face: original face, neighborhood of the lips (mouth part extracted from the original face), and audio only (without video). The size of extracted area around the mouth was changed as a parameter in the neighborhood of the lip conditions. All possible vowel-consonant combinations in Japanese were included in the presented nonsense tri-syllable speech. Generated audio-visual stimuli were presented with speech spectrum noise to the participants, who all had normal hearing and normal or corrected normal vision. Results showed that intelligibility scores of several phonemes (/n/, /h/, Imi, /w/, /d/, Ibi, /p/) were increased by adding the visual information. Moreover, no significant difference was found between the score of the original face condition and that of the neighborhood of the lips condition. This result suggests that the mouth area alone provides sufficient information for speech intelligibility.
AB - Moving images of a talker's face carry much information for speech understanding. Interpretation of that information is known as lip-reading, which can be used effectively when people hear speech sounds, especially under difficult listening conditions. For the development of advanced multi-modal communications systems, such information should be well considered. Actually, talker movies have been applied effectively, for example, to voice activity detection (VAD), and automatic speech recognition (ASR). We have been particularly examining which parts around the talker's mouth contribute most to speech understanding. In this study, we performed audio-visual speech intelligibility tests and investigated the relationship between speech intelligibility and effects of the parts around the talker's mouth. As the stimuli, nonsense tri-syllable speech sounds were combined with three kinds of moving images of a talker's face: original face, neighborhood of the lips (mouth part extracted from the original face), and audio only (without video). The size of extracted area around the mouth was changed as a parameter in the neighborhood of the lip conditions. All possible vowel-consonant combinations in Japanese were included in the presented nonsense tri-syllable speech. Generated audio-visual stimuli were presented with speech spectrum noise to the participants, who all had normal hearing and normal or corrected normal vision. Results showed that intelligibility scores of several phonemes (/n/, /h/, Imi, /w/, /d/, Ibi, /p/) were increased by adding the visual information. Moreover, no significant difference was found between the score of the original face condition and that of the neighborhood of the lips condition. This result suggests that the mouth area alone provides sufficient information for speech intelligibility.
UR - http://www.scopus.com/inward/record.url?scp=84922612433&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84922612433&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:84922612433
T3 - 21st International Congress on Sound and Vibration 2014, ICSV 2014
SP - 2553
EP - 2559
BT - 21st International Congress on Sound and Vibration 2014, ICSV 2014
PB - International Institute of Acoustics and Vibrations
T2 - 21st International Congress on Sound and Vibration 2014, ICSV 2014
Y2 - 13 July 2014 through 17 July 2014
ER -