TY - GEN
T1 - Leveraging network structure for incremental document clustering
AU - Qian, Tieyun
AU - Si, Jianfeng
AU - Li, Qing
AU - Yu, Qian
PY - 2012/4/18
Y1 - 2012/4/18
N2 - Recent studies have shown that link-based clustering methods can significantly improve the performance of content-based clustering. However, most previous algorithms are developed for fixed data sets, and are not applicable to the dynamic environments such as data warehouse and online digital library. In this paper, we introduce a novel approach which leverages the network structure for incremental clustering. Under this framework, both the link and content information are incorporated to determine the host cluster of a new document. The combination of two types of information ensures a promising performance of the clustering results. Furthermore, the status of core members is used to quickly determine whether to split or merge a new cluster. This filtering process eliminates the unnecessary and time-consuming checks of textual similarity on the whole corpus, and thus greatly speeds up the entire procedure. We evaluate our proposed approach on several real-world publication data sets and conduct an extensive comparison with both the classic content based and the recent link based algorithms. The experimental results demonstrate the effectiveness and efficiency of our method.
AB - Recent studies have shown that link-based clustering methods can significantly improve the performance of content-based clustering. However, most previous algorithms are developed for fixed data sets, and are not applicable to the dynamic environments such as data warehouse and online digital library. In this paper, we introduce a novel approach which leverages the network structure for incremental clustering. Under this framework, both the link and content information are incorporated to determine the host cluster of a new document. The combination of two types of information ensures a promising performance of the clustering results. Furthermore, the status of core members is used to quickly determine whether to split or merge a new cluster. This filtering process eliminates the unnecessary and time-consuming checks of textual similarity on the whole corpus, and thus greatly speeds up the entire procedure. We evaluate our proposed approach on several real-world publication data sets and conduct an extensive comparison with both the classic content based and the recent link based algorithms. The experimental results demonstrate the effectiveness and efficiency of our method.
UR - http://www.scopus.com/inward/record.url?scp=84859702147&partnerID=8YFLogxK
U2 - 10.1007/978-3-642-29253-8_29
DO - 10.1007/978-3-642-29253-8_29
M3 - Conference article published in proceeding or book
AN - SCOPUS:84859702147
SN - 9783642292521
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 342
EP - 353
BT - Web Technologies and Applications - 14th Asia-Pacific Web Conference, APWeb 2012, Proceedings
T2 - 14th Asia Pacific Web Technology Conference, APWeb 2012
Y2 - 11 April 2012 through 13 April 2012
ER -