Abstract
This paper describes a language modeling method using large-scale spoken language data retrieved from the Web for spontaneous speech recognition. We downloaded 15 million Web pages on a comprehensive range topics. Next, spoken language- like texts were selected from the downloaded Web data using the naïve Bayes classifier, and typical linguistic phenomena such as fillers and pauses were added using simulation models. A language model trained by the generated data gave as high performance as the large-scale spontaneous speech corpus (Corpus of Spontaneous Japanese, CSJ). By combining the generated data and CSJ, we improved word accuracy.
Original language | English |
---|---|
Pages (from-to) | 1465-1468 |
Number of pages | 4 |
Journal | Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH |
Publication status | Published - 2011 |
Event | 12th Annual Conference of the International Speech Communication Association, INTERSPEECH 2011 - Florence, Italy Duration: 2011 Aug 27 → 2011 Aug 31 |
Keywords
- Corpus of Spontaneous Japanese
- Language model
- Large vocabulary continuous speech recognition
- Spontaneous speech recognition
- World Wide Web