Canonicalization of graph database records using similarity measures

Na Li, Qing Li, Liping Wang

Research output: Chapter in book / Conference proceedingConference article published in proceeding or bookAcademic researchpeer-review

1 Citation (Scopus)

Abstract

Information extraction and crawling from the Web have been increasingly common, yet raw data are often noisy and redundant due to heterogeneous sources. Although much work has focused on duplicate records detection, there is little investigation in providing a uniform, standard result from the duplicates to users, which we refer to as a canonical result, and the process is referred to record canonicalization. In this paper, we focus on the situation of imperfect and duplicate documents on the Web, and propose a preprocessing method of graph canonicalization. We first formalize the problem of graph records canonicalization, and then we propose three possible solutions in order. Upon the framework, we implement graph selection canonicalization, which aims to construct a canonical graph by selecting the central graph among records. Experiment results demonstrate its performance in representing real world entities.

Original languageEnglish
Title of host publicationProceedings of the 2nd International Conference on Ubiquitous Information Management and Communication, ICUIMC-2008
Pages278-283
Number of pages6
DOIs
Publication statusPublished - 1 Dec 2008
Externally publishedYes
Event2nd International Conference on Ubiquitous Information Management and Communication, ICUIMC-2008 - Suwon, Korea, Republic of
Duration: 31 Jan 20081 Feb 2008

Publication series

NameProceedings of the 2nd International Conference on Ubiquitous Information Management and Communication, ICUIMC-2008

Conference

Conference2nd International Conference on Ubiquitous Information Management and Communication, ICUIMC-2008
CountryKorea, Republic of
CitySuwon
Period31/01/081/02/08

Keywords

  • canonicalization
  • database record
  • deduplication
  • graph mining

ASJC Scopus subject areas

  • Computer Science Applications

Cite this