TY - JOUR
T1 - Copula Guided Parallel Gibbs Sampling for Nonparametric and Coherent Topic Discovery
AU - Lin, Lihui
AU - Rao, Yanghui
AU - Xie, Haoran
AU - Lau, Raymond Y.K.
AU - Yin, Jian
AU - Wang, Fu Lee
AU - Li, Qing
N1 - Funding Information:
This work was supported in part by the National Natural Science Foundation of China under Grant 61972426, in part by the Interdisciplinary Research Scheme of the Dean's Research Fund 2018-19 under Grant FLASS/DRF/IDS-3, in part by the Departmental Collaborative Research Fund 2019 of The Education University of Hong Kong under Grant MIT/DCRF-R2/18-19, in part by the HKIBS Research Seed Fund 2019/20 of Lingnan University, Hong Kong under Grant 190-009, in part by a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China under Grant UGC/FDS16/E01/19, in part by the RGC of the Hong Kong SAR under Grants CityU 11507219 and CityU 11525716, in part by the NSFC Basic Research Program under Grant 71671155 and the CityU Shenzhen Research Institute. The work of J. Yin was supported by the National Natural Science Foundation of China under Grants U1711262, U1611264, U1711261, U1811261, U1811264, and U1911203.
Publisher Copyright:
© 1989-2012 IEEE.
PY - 2022/1/1
Y1 - 2022/1/1
N2 - Hierarchical Dirichlet Process (HDP) has attracted much attention in the research community of natural language processing. Given a corpus, HDP is able to determine the number of topics automatically, possessing an important feature dubbed nonparametric that overcomes the challenging issue of manually specifying a suitable topic number in parametric topic models, such as Latent Dirichlet Allocation (LDA). Nevertheless, HDP requires a much higher computational cost than LDA for parameter estimation. By taking the advantage of multi-threading, a parallel Gibbs sampling algorithm is proposed to estimate parameters for HDP based on the equivalence between HDP and Gamma-Gamma Poisson Process (G2PP) in terms of the generative process. Unfortunately, the above parallel Gibbs sampling algorithm requires to apply the finite approximation on the number of topics manually (i.e., predefine the topic number), thus can not retain the nonparametric feature of HDP. Another drawback of the above models is the lack of capturing the semantic dependencies between words, because the topic assignment of words is independent with each other. Although some works have been done in phrase-based topic modelling, these existing methods are still limited by either enforcing the entire phrase to share a common topic or requiring much complex and time-consuming phrase mining methods. In this paper, we aim to develop a copula guided parallel Gibbs sampling algorithm for HDP which can adjust the number of topics dynamically and capture the latent semantic dependencies between words that compose a coherent segment. Extensive experiments on real-world datasets indicate that our method achieves low perplexities and high topic coherence scores with a small time cost. In addition, we validate the effectiveness of our method on the modelling of word semantic dependencies by comparing the extracted topical phrases with those learned by state-of-the-art phrase-based baselines.
AB - Hierarchical Dirichlet Process (HDP) has attracted much attention in the research community of natural language processing. Given a corpus, HDP is able to determine the number of topics automatically, possessing an important feature dubbed nonparametric that overcomes the challenging issue of manually specifying a suitable topic number in parametric topic models, such as Latent Dirichlet Allocation (LDA). Nevertheless, HDP requires a much higher computational cost than LDA for parameter estimation. By taking the advantage of multi-threading, a parallel Gibbs sampling algorithm is proposed to estimate parameters for HDP based on the equivalence between HDP and Gamma-Gamma Poisson Process (G2PP) in terms of the generative process. Unfortunately, the above parallel Gibbs sampling algorithm requires to apply the finite approximation on the number of topics manually (i.e., predefine the topic number), thus can not retain the nonparametric feature of HDP. Another drawback of the above models is the lack of capturing the semantic dependencies between words, because the topic assignment of words is independent with each other. Although some works have been done in phrase-based topic modelling, these existing methods are still limited by either enforcing the entire phrase to share a common topic or requiring much complex and time-consuming phrase mining methods. In this paper, we aim to develop a copula guided parallel Gibbs sampling algorithm for HDP which can adjust the number of topics dynamically and capture the latent semantic dependencies between words that compose a coherent segment. Extensive experiments on real-world datasets indicate that our method achieves low perplexities and high topic coherence scores with a small time cost. In addition, we validate the effectiveness of our method on the modelling of word semantic dependencies by comparing the extracted topical phrases with those learned by state-of-the-art phrase-based baselines.
KW - Copulas
KW - Parallel gibbs sampling
KW - Topic modelling
UR - http://www.scopus.com/inward/record.url?scp=85120373073&partnerID=8YFLogxK
U2 - 10.1109/TKDE.2020.2976945
DO - 10.1109/TKDE.2020.2976945
M3 - Journal article
AN - SCOPUS:85120373073
SN - 1041-4347
VL - 34
SP - 219
EP - 235
JO - IEEE Transactions on Knowledge and Data Engineering
JF - IEEE Transactions on Knowledge and Data Engineering
IS - 1
ER -