Training a language model using webdata for large vocabulary Japanese spontaneous speech recognition

Ryo Masumura, Seongjun Hahm, Akinori Ito

Research output: Contribution to journalConference articlepeer-review

12 Citations (Scopus)

Abstract

This paper describes a language modeling method using large-scale spoken language data retrieved from the Web for spontaneous speech recognition. We downloaded 15 million Web pages on a comprehensive range topics. Next, spoken language- like texts were selected from the downloaded Web data using the naïve Bayes classifier, and typical linguistic phenomena such as fillers and pauses were added using simulation models. A language model trained by the generated data gave as high performance as the large-scale spontaneous speech corpus (Corpus of Spontaneous Japanese, CSJ). By combining the generated data and CSJ, we improved word accuracy.

Original languageEnglish
Pages (from-to)1465-1468
Number of pages4
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Publication statusPublished - 2011
Event12th Annual Conference of the International Speech Communication Association, INTERSPEECH 2011 - Florence, Italy
Duration: 2011 Aug 272011 Aug 31

Keywords

  • Corpus of Spontaneous Japanese
  • Language model
  • Large vocabulary continuous speech recognition
  • Spontaneous speech recognition
  • World Wide Web

Fingerprint

Dive into the research topics of 'Training a language model using webdata for large vocabulary Japanese spontaneous speech recognition'. Together they form a unique fingerprint.

Cite this