TY - GEN
T1 - Automatic clustering of part-of-speech for vocabulary divided PLSA language model
AU - Suzuki, Motoyuki
AU - Kuriyama, Naoto
AU - Ito, Akinori
AU - Makino, Shozo
PY - 2008
Y1 - 2008
N2 - PLSA is one of the most powerful language models for adaptation to a target speech. The vocabulary divided PLSA language model (VD-PLSA) shows higher performance than the conventional PLSA model because it can be adapted to the target topic and the target speaking style individually. However, all of the vocabulary must be manually divided into three categories (topic, speaking style, and general category). In this paper, an automatic method for clustering parts-of-speech (POS) is proposed for VD-PLSA. Several corpora with different styles are prepared, and the distance between corpora in terms of POS is calculated. The "general tendency score" and "style tendency score" for each POS are calculated based on the distance between corpora. All of the POS are divided into three categories using two scores and appropriate thresholds. Experimental results showed the proposed method formed appropriate clusters, and VD-PLSA with acquired categories gave the highest performance of all other models. We applied the VD-PLSA into large vocabulary continuous speech recognition system. VD-PLSA improved the recognition accuracy for documents with lower out-of-vocabulary ratio, while other documents were not improved or slightly descended the accuracy.
AB - PLSA is one of the most powerful language models for adaptation to a target speech. The vocabulary divided PLSA language model (VD-PLSA) shows higher performance than the conventional PLSA model because it can be adapted to the target topic and the target speaking style individually. However, all of the vocabulary must be manually divided into three categories (topic, speaking style, and general category). In this paper, an automatic method for clustering parts-of-speech (POS) is proposed for VD-PLSA. Several corpora with different styles are prepared, and the distance between corpora in terms of POS is calculated. The "general tendency score" and "style tendency score" for each POS are calculated based on the distance between corpora. All of the POS are divided into three categories using two scores and appropriate thresholds. Experimental results showed the proposed method formed appropriate clusters, and VD-PLSA with acquired categories gave the highest performance of all other models. We applied the VD-PLSA into large vocabulary continuous speech recognition system. VD-PLSA improved the recognition accuracy for documents with lower out-of-vocabulary ratio, while other documents were not improved or slightly descended the accuracy.
KW - General/style tendency score
KW - Language model
KW - Part-of-speech
KW - Speech recognition
KW - Vocabulary divided PLSA
UR - http://www.scopus.com/inward/record.url?scp=67650400718&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=67650400718&partnerID=8YFLogxK
U2 - 10.1109/NLPKE.2008.4906747
DO - 10.1109/NLPKE.2008.4906747
M3 - Conference contribution
AN - SCOPUS:67650400718
SN - 9781424427802
T3 - 2008 International Conference on Natural Language Processing and Knowledge Engineering, NLP-KE 2008
BT - 2008 International Conference on Natural Language Processing and Knowledge Engineering, NLP-KE 2008
T2 - 2008 International Conference on Natural Language Processing and Knowledge Engineering, NLP-KE 2008
Y2 - 19 October 2008 through 22 October 2008
ER -