TY - GEN
T1 - Incorporating concept information into term weighting schemes for topic models
AU - Zhang, Huakui
AU - Cai, Yi
AU - Zhu, Bingshan
AU - Zheng, Changmeng
AU - Yang, Kai
AU - Wong, Raymond Chi Wing
AU - Li, Qing
N1 - Funding Information:
Acknowlegement. This work was supported by the Fundamental Research Funds for the Central Universities, SCUT (No. 2017ZD048, D2182480), the Science and Technology Planning Project of Guangdong Province (No. 2017B050506004), the Science and Technology Programs of Guangzhou (No. 201704030076, 201802010027, 201902010046), the Hong Kong Research Grants Council (project no. PolyU 1121417), and an internal research grant from the Hong Kong Polytechnic University (project 1.9B0V).
Publisher Copyright:
© Springer Nature Switzerland AG 2020.
PY - 2020/4
Y1 - 2020/4
N2 - Topic models demonstrate outstanding ability in discovering latent topics in text corpora. A coherent topic consists of words or entities related to similar concepts, i.e., abstract ideas of categories of things. To generate more coherent topics, term weighting schemes have been proposed for topic models by assigning weights to terms in text, such as promoting the informative entities. However, in current term weighting schemes, entities are not discriminated by their concepts, which may cause incoherent topics containing entities from unrelated concepts. To solve the problem, in this paper we propose two term weighting schemes for topic models, CEP scheme and DCEP scheme, to improve the topic coherence by incorporating the concept information of the entities. More specifically, the CEP term weighting scheme gives more weights to entities from the concepts that reveals the topics of the document. The DCEP scheme further reduces the co-occurrence of the entities from unrelated concepts and separates them into different duplicates of a document. We develop CEP-LDA and DCEP-LDA term weighting topic models by applying the two proposed term weighting schemes to LDA. Experimental results on two public datasets show that CEP-LDA and DCEP-LDA topic models can produce more coherent topics.
AB - Topic models demonstrate outstanding ability in discovering latent topics in text corpora. A coherent topic consists of words or entities related to similar concepts, i.e., abstract ideas of categories of things. To generate more coherent topics, term weighting schemes have been proposed for topic models by assigning weights to terms in text, such as promoting the informative entities. However, in current term weighting schemes, entities are not discriminated by their concepts, which may cause incoherent topics containing entities from unrelated concepts. To solve the problem, in this paper we propose two term weighting schemes for topic models, CEP scheme and DCEP scheme, to improve the topic coherence by incorporating the concept information of the entities. More specifically, the CEP term weighting scheme gives more weights to entities from the concepts that reveals the topics of the document. The DCEP scheme further reduces the co-occurrence of the entities from unrelated concepts and separates them into different duplicates of a document. We develop CEP-LDA and DCEP-LDA term weighting topic models by applying the two proposed term weighting schemes to LDA. Experimental results on two public datasets show that CEP-LDA and DCEP-LDA topic models can produce more coherent topics.
KW - Latent Dirichlet Allocation
KW - Term weighting scheme
KW - Topic model
UR - http://www.scopus.com/inward/record.url?scp=85092108989&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-59416-9_14
DO - 10.1007/978-3-030-59416-9_14
M3 - Conference article published in proceeding or book
AN - SCOPUS:85092108989
SN - 9783030594152
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 227
EP - 244
BT - Database Systems for Advanced Applications - 25th International Conference, DASFAA 2020, Proceedings
A2 - Nah, Yunmook
A2 - Cui, Bin
A2 - Lee, Sang-Won
A2 - Yu, Jeffrey Xu
A2 - Moon, Yang-Sae
A2 - Whang, Steven Euijong
PB - Springer Science and Business Media Deutschland GmbH
T2 - 25th International Conference on Database Systems for Advanced Applications, DASFAA 2020
Y2 - 24 September 2020 through 27 September 2020
ER -