Topic attention encoder: A self-supervised approach for short text clustering

Jian Jin, Haiyuan Zhao, Ping Ji

Research output: Journal article publicationJournal articleAcademic researchpeer-review

Abstract

Short text clustering is a challenging and important task in many practical applications. However, many Bag-of-Word–based methods for short text clustering are often limited by the sparsity of text representation, while many sentence embedding–based methods fail to capture the document structure dependencies within a text corpus. In considerations of the shortcomings of many existing studies, a topic attention encoder (TAE) is proposed in this study. Given topics derived from corpus by the techniques of topic modelling, the cross-document information is introduced. This encoder assumes the document-topic vector to be the learning target and the concatenating vectors of the word embedding and corresponding topic-word vector to be the input. Also, a self-attention mechanism is employed in the encoder, which aims to extract weights of hidden states adaptively and encode the semantics of each short text document. With captured global dependencies and local semantics, TAE integrates the superiority of Bag-of-Word methods and sentence embedding methods. Finally, categories of benchmarking experiments were conducted by analysing three public data sets. It demonstrates that the proposed TAE outperforms many document representation benchmark methods for short text clustering.

Original languageEnglish
Number of pages17
JournalJournal of Information Science
DOIs
Publication statusPublished - 30 Nov 2020

Keywords

  • Global dependency
  • local semantic
  • self-attention
  • short text clustering
  • text representation
  • topic attention encoder

ASJC Scopus subject areas

  • Information Systems
  • Library and Information Sciences

Cite this