XML document clustering using common xpath

Ho Pong Leung, Fu Lai Korris Chung, Stephen C.F. Chan, Wing Pong Robert Luk

Research output: Chapter in book / Conference proceedingConference article published in proceeding or bookAcademic researchpeer-review

39 Citations (Scopus)

Abstract

XML is becoming a common way of storing data. The elements and their arrangement in the document's hierarchy not only describe the document structure but also imply the data's semantic meaning, and hence provide valuable information to develop tools for manipulating XML documents. In this paper, we pursue a data mining approach to the problem of XML document clustering. We introduce a novel XML structural representation called common XPath (CXP), which encodes the frequently occurring elements with the hierarchical information, and propose to take the CXPs mined to form the feature vectors for XML document clustering. In other words, data mining acts as a feature extractor in the clustering process. Based on this idea, we devise a path-based XML document clustering algorithm called PBClustering which groups the documents according to their CXPs, i.e. their frequent structures. Encouraging simulation results are observed and reported.
Original languageEnglish
Title of host publicationProceedings - International Workshop on Challenges in Web Information Retrieval and Integration, WIRI'05
Pages91-96
Number of pages6
Volume2005
DOIs
Publication statusPublished - 1 Dec 2005
EventInternational Workshop on Challenges in Web Information Retrieval and Integration, WIRI'05 - Tokyo, Japan
Duration: 8 Apr 20059 Apr 2005

Conference

ConferenceInternational Workshop on Challenges in Web Information Retrieval and Integration, WIRI'05
Country/TerritoryJapan
CityTokyo
Period8/04/059/04/05

Keywords

  • Frequent structure mining
  • XML document clustering
  • XML mining
  • XPath

ASJC Scopus subject areas

  • General Engineering

Fingerprint

Dive into the research topics of 'XML document clustering using common xpath'. Together they form a unique fingerprint.

Cite this