Semi-supervised sequential labeling and segmentation using giga-word scale unlabeled data

Jun Suzuki, Hideki Isozaki

Research output: Chapter in Book/Report/Conference proceedingConference contribution

110 Citations (Scopus)

Abstract

This paper provides evidence that the use of more unlabeled data in semi-supervised learning can improve the performance of Natural Language Processing (NLP) tasks, such as part-of-speech tagging, syntactic chunking, and named entity recognition. We first propose a simple yet powerful semi-supervised discriminative model appropriate for handling large scale unlabeled data. Then, we describe experiments performed on widely used test collections, namely, PTB III data, CoNLL'00 and'03 shared task data for the above three NLP tasks, respectively. We incorporate up to 1G-words (one billion tokens) of unlabeled data, which is the largest amount of unlabeled data ever used for these tasks, to investigate the performance improvement. In addition, our results are superior to the best reported results for all of the above test collections.

Original languageEnglish
Title of host publicationACL-08
Subtitle of host publicationHLT - 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference
Pages665-673
Number of pages9
Publication statusPublished - 2008
Externally publishedYes
Event46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, ACL-08: HLT - Columbus, OH, United States
Duration: 2008 Jun 152008 Jun 20

Publication series

NameACL-08: HLT - 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference

Other

Other46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, ACL-08: HLT
Country/TerritoryUnited States
CityColumbus, OH
Period08/6/1508/6/20

ASJC Scopus subject areas

  • Language and Linguistics
  • Computer Networks and Communications
  • Linguistics and Language

Fingerprint

Dive into the research topics of 'Semi-supervised sequential labeling and segmentation using giga-word scale unlabeled data'. Together they form a unique fingerprint.

Cite this