An ideal semantic representation of text corpus should exhibit a hierarchical topic tree structure, and topics residing at different node levels of the tree should exhibit different levels of semantic abstraction (i.e., the deeper level a topic resides, the more specific it would be). Instead of learning every node directly which is a quite time consuming task, our approach bases on a nonparametric Bayesian topic model, namely, Hierarchical Dirichlet Processes (HDP). By tuning on the topic's Dirichlet scale parameter settings, two topic sets of different levels of abstraction are learned from the HDP separately and further integrated into a hierarchical clustering process. We term our approach as HDP Clustering(HDP-C). During the hierarchical clustering process, a lower level of specific topics are clustered into a higher level of more general topics in an agglomerative style to get the final topic tree. Evaluation of the tree quality on several real world datasets demonstrates its competitive performance.
|Number of pages||6|
|Journal||CEUR Workshop Proceedings|
|Publication status||Published - 1 Dec 2012|
|Event||2nd International Workshop on Searching and Integrating New Web Data Sources: Very Large Data Search, VLDS 2012 - Istanbul, Turkey|
Duration: 31 Aug 2012 → 31 Aug 2012
ASJC Scopus subject areas
- Computer Science(all)