A sentence vector based over-sampling method for imbalanced emotion classification

Tao Chen, Ruifeng Xu, Qin Lu, Bin Liu, Jun Xu, Lin Yao, Zhenyu He

Research output: Chapter in book / Conference proceedingConference article published in proceeding or bookAcademic researchpeer-review

10 Citations (Scopus)

Abstract

Imbalanced training data poses a serious problem for supervised learning based text classification. Such a problem becomes more serious in emotion classification task with multiple emotion categories as the training data can be quite skewed. This paper presents a novel over-sampling method to form additional sum sentence vectors for minority classes in order to improve emotion classification for imbalanced data. Firstly, a large corpus is used to train a continuous skip-gram model to form each word vector using word/POS pair as the unit of word vector. The sentence vectors of the training data are then constructed as the sum vector of their word/POS vectors. The new minority class training samples are then generated by randomly add two sentence vectors in the corresponding class until the training samples for each class are the same so that the classifiers can be trained on fully balanced training dataset. Evaluations on NLP&CC2013 Chinese micro blog emotion classification dataset shows that the obtained classifier achieves 48.4% average precision, an 11.9 percent improvement over the state-of-art performance on this dataset (at 36.5%). This result shows that the proposed over-sampling method can effectively address the problem of data imbalance and thus achieve much improved performance for emotion classification.
Original languageEnglish
Title of host publicationComputational Linguistics and Intelligent Text Processing - 15th International Conference, CICLing 2014, Proceedings
PublisherSpringer Verlag
Pages62-72
Number of pages11
EditionPART 2
ISBN (Print)9783642549021
DOIs
Publication statusPublished - 1 Jan 2014
Event15th International Conference on Computational Linguistics and Intelligent Text Processing, CICLing 2014 - Kathmandu, Nepal
Duration: 6 Apr 201412 Apr 2014

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
NumberPART 2
Volume8404 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference15th International Conference on Computational Linguistics and Intelligent Text Processing, CICLing 2014
Country/TerritoryNepal
CityKathmandu
Period6/04/1412/04/14

Keywords

  • Emotion classification
  • Imbalanced training data
  • Over-sampling
  • Sentence vector

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Science(all)

Cite this