TY - JOUR
T1 - Topic attention encoder: A self-supervised approach for short text clustering
AU - Jin, Jian
AU - Zhao, Haiyuan
AU - Ji, Ping
N1 - Funding Information:
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by grants from the National Nature Science Foundation of China (project no. NSFC 71701019/G0114) and National Social Science Foundation of China (Grant.19ATQ005).
Publisher Copyright:
© The Author(s) 2020.
Copyright:
Copyright 2020 Elsevier B.V., All rights reserved.
PY - 2020/11/30
Y1 - 2020/11/30
N2 - Short text clustering is a challenging and important task in many practical applications. However, many Bag-of-Word–based methods for short text clustering are often limited by the sparsity of text representation, while many sentence embedding–based methods fail to capture the document structure dependencies within a text corpus. In considerations of the shortcomings of many existing studies, a topic attention encoder (TAE) is proposed in this study. Given topics derived from corpus by the techniques of topic modelling, the cross-document information is introduced. This encoder assumes the document-topic vector to be the learning target and the concatenating vectors of the word embedding and corresponding topic-word vector to be the input. Also, a self-attention mechanism is employed in the encoder, which aims to extract weights of hidden states adaptively and encode the semantics of each short text document. With captured global dependencies and local semantics, TAE integrates the superiority of Bag-of-Word methods and sentence embedding methods. Finally, categories of benchmarking experiments were conducted by analysing three public data sets. It demonstrates that the proposed TAE outperforms many document representation benchmark methods for short text clustering.
AB - Short text clustering is a challenging and important task in many practical applications. However, many Bag-of-Word–based methods for short text clustering are often limited by the sparsity of text representation, while many sentence embedding–based methods fail to capture the document structure dependencies within a text corpus. In considerations of the shortcomings of many existing studies, a topic attention encoder (TAE) is proposed in this study. Given topics derived from corpus by the techniques of topic modelling, the cross-document information is introduced. This encoder assumes the document-topic vector to be the learning target and the concatenating vectors of the word embedding and corresponding topic-word vector to be the input. Also, a self-attention mechanism is employed in the encoder, which aims to extract weights of hidden states adaptively and encode the semantics of each short text document. With captured global dependencies and local semantics, TAE integrates the superiority of Bag-of-Word methods and sentence embedding methods. Finally, categories of benchmarking experiments were conducted by analysing three public data sets. It demonstrates that the proposed TAE outperforms many document representation benchmark methods for short text clustering.
KW - Global dependency
KW - local semantic
KW - self-attention
KW - short text clustering
KW - text representation
KW - topic attention encoder
UR - http://www.scopus.com/inward/record.url?scp=85097008205&partnerID=8YFLogxK
U2 - 10.1177/0165551520977453
DO - 10.1177/0165551520977453
M3 - Journal article
AN - SCOPUS:85097008205
JO - Journal of Information Science
JF - Journal of Information Science
SN - 0165-5515
ER -