An integrated approach to heterogeneous data for information extraction

Research output: Chapter in book / Conference proceedingConference article published in proceeding or bookAcademic researchpeer-review


The paper proposes an integrated framework for web personal information extraction, such as biographical information and occupation, and those kinds of information are necessary to further construct a social network (a kind of semantic web) for a person. As web data is heterogeneous in nature, most of IE systems, regardless of named entity recognition (NER) or relation detection and recognition (RDR) systems, fail to get reliably robust results. We propose a flexible framework, which can effectively complement state-of-the-art statistical IE systems with rule-based IE systems for web data, and achieves substantial improvement over other existing systems. In particular, in our current experiment, both the rule-based IE system, which is designed according to some web specific expression patterns, and the statistical IE systems, which are developed for some homogeneous corpora, are sensitive only to specific information types. Hence we argue that our system performance can be incrementally improved when new and effective IE systems are added into our framework. M. Lee, and Chu-Ren Huang.
Original languageEnglish
Title of host publicationPACLIC 23 - Proceedings of the 23rd Pacific Asia Conference on Language, Information and Computation
Number of pages10
Publication statusPublished - 1 Dec 2009
Event23rd Pacific Asia Conference on Language, Information and Computation, PACLIC 23 - Hong Kong, Hong Kong
Duration: 3 Dec 20095 Dec 2009


Conference23rd Pacific Asia Conference on Language, Information and Computation, PACLIC 23
Country/TerritoryHong Kong
CityHong Kong


  • Information extraction
  • Relation extraction

ASJC Scopus subject areas

  • Language and Linguistics
  • Computer Science (miscellaneous)

Cite this