TY - GEN
T1 - Hybrid method for automated news content extraction from the Web
AU - Li, Yu
AU - Meng, Xiaofeng
AU - Li, Qing
AU - Wang, Liping
PY - 2006/1/1
Y1 - 2006/1/1
N2 - Web news content extraction is vital to improve news indexing and searching in nowadays search engines, especially for the news searching service. In this paper we study the Web news content extraction problem and propose an automated extraction algorithm for it. Our method is a hybrid one taking the advantage of both sequence matching and tree matching techniques. We propose TSReC, a variant of tag sequence representation suitable for both sequence matching and tree matching, along with an associated algorithm for automated Web news content extraction. By implementing a prototype system for Web news content extraction, the empirical evaluation is conducted and the result shows that our method is highly effective and efficient.
AB - Web news content extraction is vital to improve news indexing and searching in nowadays search engines, especially for the news searching service. In this paper we study the Web news content extraction problem and propose an automated extraction algorithm for it. Our method is a hybrid one taking the advantage of both sequence matching and tree matching techniques. We propose TSReC, a variant of tag sequence representation suitable for both sequence matching and tree matching, along with an associated algorithm for automated Web news content extraction. By implementing a prototype system for Web news content extraction, the empirical evaluation is conducted and the result shows that our method is highly effective and efficient.
UR - http://www.scopus.com/inward/record.url?scp=33845272152&partnerID=8YFLogxK
M3 - Conference article published in proceeding or book
AN - SCOPUS:33845272152
SN - 3540481052
SN - 9783540481058
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 327
EP - 338
BT - Web Information Systems - WISE 2006
PB - Springer-Verlag
T2 - 7th International Conference on Web Information Systems Engineering, WISE 2006
Y2 - 23 October 2006 through 26 October 2006
ER -