RecipeCrawler: Collecting recipe data from WWW incrementally

Yu Li, Xiaofeng Meng, Liping Wang, Qing Li

Research output: Chapter in book / Conference proceedingConference article published in proceeding or bookAcademic researchpeer-review

12 Citations (Scopus)

Abstract

WWW has posed itself as the largest data repository ever available in the history of humankind. Utilizing the Internet as a data source seems to be natural and many efforts have been made. In this paper we focus on establishing a robust system to collect structured recipe data from the Web incrementally, which, as we believe, is a critical step towards practical, continuous, reliable web data extraction systems and therefore utilizing WWW as data sources for various database applications. The reasons for advocating such an incremental approach are two-fold: (1) it is impractical to crawl all the recipe pages from relevant web sites as the Web is highly dynamic; (2) it is almost impossible to induce a general wrapper for future extraction from the initial batch of recipe web pages. In this paper, we describe such a system called RecipeCrawler which targets at incrementally collecting recipe data from WWW. General issues in establishing an incremental data extraction system are considered and techniques are applied to recipe data collection from the Web. Our RecipeCrawler is actually used as the backend of a fully-fledged multimedia recipe database system being developed jointly by City University of Hong Kong and Renmin University of China.

Original languageEnglish
Title of host publicationAdvances in Web-Age Information Management - 7th International Conference, WAIM 2006, Proceedings
PublisherSpringer-Verlag
Pages263-274
Number of pages12
ISBN (Print)3540352252, 9783540352259
Publication statusPublished - 1 Jan 2006
Externally publishedYes
Event7th International Conference on Advances in Web-Age Information Management, WAIM 2006 - Hong Kong, Hong Kong
Duration: 17 Jun 200619 Jun 2006

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume4016 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference7th International Conference on Advances in Web-Age Information Management, WAIM 2006
CountryHong Kong
CityHong Kong
Period17/06/0619/06/06

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Science(all)

Cite this