Semi-supervised text categorization by considering sufficiency and diversity

Shoushan Li, Yat Mei Lee, Wei Gao, Chu-ren Huang

Research output: Chapter in book / Conference proceedingConference article published in proceeding or bookAcademic researchpeer-review

2 Citations (Scopus)

Abstract

In text categorization (TC), labeled data is often limited while unlabeled data is ample. This motivates semi-supervised learning for TC to improve the performance by exploring the knowledge in both labeled and unlabeled data. In this paper, we propose a novel bootstrapping approach to semi-supervised TC. First of all, we give two basic preferences, i.e., sufficiency and diversity for a possibly successful bootstrapping. After carefully considering the diversity preference, we modify the traditional bootstrapping algorithm by training the involved classifiers with random feature subspaces instead of the whole feature space. Moreover, we further improve the random feature subspace-based bootstrapping with some constraints on the subspace generation to better satisfy the diversity preference. Experimental evaluation shows the effectiveness of our modified bootstrapping approach in both topic and sentiment-based TC tasks.
Original languageEnglish
Title of host publicationNatural Language Processing and Chinese Computing - Second CCF Conference, NLPCC 2013, Proceedings
PublisherSpringer Verlag
Pages105-115
Number of pages11
ISBN (Print)9783642416439
DOIs
Publication statusPublished - 1 Jan 2013
Event2nd CCF Conference on Natural Language Processing and Chinese Computing, NLPCC 2013 - Chongqing, China
Duration: 15 Nov 201319 Nov 2013

Publication series

NameCommunications in Computer and Information Science
Volume400
ISSN (Print)1865-0929

Conference

Conference2nd CCF Conference on Natural Language Processing and Chinese Computing, NLPCC 2013
Country/TerritoryChina
CityChongqing
Period15/11/1319/11/13

Keywords

  • Bootstrapping
  • Semi-supervised learning
  • Sentiment classification

ASJC Scopus subject areas

  • Computer Science(all)

Cite this