TY - GEN
T1 - Canonicalization of graph database records using similarity measures
AU - Li, Na
AU - Li, Qing
AU - Wang, Liping
PY - 2008/12/1
Y1 - 2008/12/1
N2 - Information extraction and crawling from the Web have been increasingly common, yet raw data are often noisy and redundant due to heterogeneous sources. Although much work has focused on duplicate records detection, there is little investigation in providing a uniform, standard result from the duplicates to users, which we refer to as a canonical result, and the process is referred to record canonicalization. In this paper, we focus on the situation of imperfect and duplicate documents on the Web, and propose a preprocessing method of graph canonicalization. We first formalize the problem of graph records canonicalization, and then we propose three possible solutions in order. Upon the framework, we implement graph selection canonicalization, which aims to construct a canonical graph by selecting the central graph among records. Experiment results demonstrate its performance in representing real world entities.
AB - Information extraction and crawling from the Web have been increasingly common, yet raw data are often noisy and redundant due to heterogeneous sources. Although much work has focused on duplicate records detection, there is little investigation in providing a uniform, standard result from the duplicates to users, which we refer to as a canonical result, and the process is referred to record canonicalization. In this paper, we focus on the situation of imperfect and duplicate documents on the Web, and propose a preprocessing method of graph canonicalization. We first formalize the problem of graph records canonicalization, and then we propose three possible solutions in order. Upon the framework, we implement graph selection canonicalization, which aims to construct a canonical graph by selecting the central graph among records. Experiment results demonstrate its performance in representing real world entities.
KW - canonicalization
KW - database record
KW - deduplication
KW - graph mining
UR - http://www.scopus.com/inward/record.url?scp=77952982625&partnerID=8YFLogxK
U2 - 10.1145/1352793.1352853
DO - 10.1145/1352793.1352853
M3 - Conference article published in proceeding or book
AN - SCOPUS:77952982625
SN - 9781595939937
T3 - Proceedings of the 2nd International Conference on Ubiquitous Information Management and Communication, ICUIMC-2008
SP - 278
EP - 283
BT - Proceedings of the 2nd International Conference on Ubiquitous Information Management and Communication, ICUIMC-2008
T2 - 2nd International Conference on Ubiquitous Information Management and Communication, ICUIMC-2008
Y2 - 31 January 2008 through 1 February 2008
ER -