Hierarchical clustering on HDP topics to build a semantic tree from text

Jianfeng Si, Qing Li, Tieyun Qian, Xiaotie Deng

Research output: Journal article publicationConference articleAcademic researchpeer-review

Abstract

An ideal semantic representation of text corpus should exhibit a hierarchical topic tree structure, and topics residing at different node levels of the tree should exhibit different levels of semantic abstraction (i.e., the deeper level a topic resides, the more specific it would be). Instead of learning every node directly which is a quite time consuming task, our approach bases on a nonparametric Bayesian topic model, namely, Hierarchical Dirichlet Processes (HDP). By tuning on the topic's Dirichlet scale parameter settings, two topic sets of different levels of abstraction are learned from the HDP separately and further integrated into a hierarchical clustering process. We term our approach as HDP Clustering(HDP-C). During the hierarchical clustering process, a lower level of specific topics are clustered into a higher level of more general topics in an agglomerative style to get the final topic tree. Evaluation of the tree quality on several real world datasets demonstrates its competitive performance.

Original languageEnglish
Pages (from-to)9-14
Number of pages6
JournalCEUR Workshop Proceedings
Volume884
Publication statusPublished - 1 Dec 2012
Externally publishedYes
Event2nd International Workshop on Searching and Integrating New Web Data Sources: Very Large Data Search, VLDS 2012 - Istanbul, Turkey
Duration: 31 Aug 201231 Aug 2012

ASJC Scopus subject areas

  • Computer Science(all)

Cite this