Hybrid method for automated news content extraction from the Web

Yu Li, Xiaofeng Meng, Qing Li, Liping Wang

Research output: Chapter in book / Conference proceedingConference article published in proceeding or bookAcademic researchpeer-review

8 Citations (Scopus)

Abstract

Web news content extraction is vital to improve news indexing and searching in nowadays search engines, especially for the news searching service. In this paper we study the Web news content extraction problem and propose an automated extraction algorithm for it. Our method is a hybrid one taking the advantage of both sequence matching and tree matching techniques. We propose TSReC, a variant of tag sequence representation suitable for both sequence matching and tree matching, along with an associated algorithm for automated Web news content extraction. By implementing a prototype system for Web news content extraction, the empirical evaluation is conducted and the result shows that our method is highly effective and efficient.

Original languageEnglish
Title of host publicationWeb Information Systems - WISE 2006
Subtitle of host publication7th International Conference on Web Information Systems Engineering, Proceedings
PublisherSpringer-Verlag
Pages327-338
Number of pages12
ISBN (Print)3540481052, 9783540481058
Publication statusPublished - 1 Jan 2006
Externally publishedYes
Event7th International Conference on Web Information Systems Engineering, WISE 2006 - Wuhan, China
Duration: 23 Oct 200626 Oct 2006

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume4255 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference7th International Conference on Web Information Systems Engineering, WISE 2006
Country/TerritoryChina
CityWuhan
Period23/10/0626/10/06

ASJC Scopus subject areas

  • Theoretical Computer Science
  • General Computer Science

Fingerprint

Dive into the research topics of 'Hybrid method for automated news content extraction from the Web'. Together they form a unique fingerprint.

Cite this