TY - GEN
T1 - Semi-supervised text categorization by considering sufficiency and diversity
AU - Li, Shoushan
AU - Lee, Yat Mei
AU - Gao, Wei
AU - Huang, Chu-ren
PY - 2013/1/1
Y1 - 2013/1/1
N2 - In text categorization (TC), labeled data is often limited while unlabeled data is ample. This motivates semi-supervised learning for TC to improve the performance by exploring the knowledge in both labeled and unlabeled data. In this paper, we propose a novel bootstrapping approach to semi-supervised TC. First of all, we give two basic preferences, i.e., sufficiency and diversity for a possibly successful bootstrapping. After carefully considering the diversity preference, we modify the traditional bootstrapping algorithm by training the involved classifiers with random feature subspaces instead of the whole feature space. Moreover, we further improve the random feature subspace-based bootstrapping with some constraints on the subspace generation to better satisfy the diversity preference. Experimental evaluation shows the effectiveness of our modified bootstrapping approach in both topic and sentiment-based TC tasks.
AB - In text categorization (TC), labeled data is often limited while unlabeled data is ample. This motivates semi-supervised learning for TC to improve the performance by exploring the knowledge in both labeled and unlabeled data. In this paper, we propose a novel bootstrapping approach to semi-supervised TC. First of all, we give two basic preferences, i.e., sufficiency and diversity for a possibly successful bootstrapping. After carefully considering the diversity preference, we modify the traditional bootstrapping algorithm by training the involved classifiers with random feature subspaces instead of the whole feature space. Moreover, we further improve the random feature subspace-based bootstrapping with some constraints on the subspace generation to better satisfy the diversity preference. Experimental evaluation shows the effectiveness of our modified bootstrapping approach in both topic and sentiment-based TC tasks.
KW - Bootstrapping
KW - Semi-supervised learning
KW - Sentiment classification
UR - http://www.scopus.com/inward/record.url?scp=84901489060&partnerID=8YFLogxK
U2 - 10.1007/978-3-642-41644-6_11
DO - 10.1007/978-3-642-41644-6_11
M3 - Conference article published in proceeding or book
SN - 9783642416439
T3 - Communications in Computer and Information Science
SP - 105
EP - 115
BT - Natural Language Processing and Chinese Computing - Second CCF Conference, NLPCC 2013, Proceedings
PB - Springer Verlag
T2 - 2nd CCF Conference on Natural Language Processing and Chinese Computing, NLPCC 2013
Y2 - 15 November 2013 through 19 November 2013
ER -