Abstract
Extracting loosely structured data records (LSDRs) has wide applications in many domains, such as forum pattern recognition, Weblogs data analysis, and books and news review analysis. Yet currently existing methods only work well for strongly structured data records (SDRs). In this paper, we propose to address the problem of extracting LSDRs through mining strict patterns. In our method, we utilize both content feature and tag tree feature to recognize the LSDRs, and propose a new algorithm to extract the Data Records (DRs) automatically. The experimental results demonstrate that our algorithm is able to effectively extract LSDRs with higher precision and recall. © 2009 Springer Science+Business Media, LLC.
| Original language | English |
|---|---|
| Pages (from-to) | 263-284 |
| Number of pages | 22 |
| Journal | World Wide Web |
| Volume | 12 |
| Issue number | 3 |
| DOIs | |
| Publication status | Published - 1 Aug 2009 |
| Externally published | Yes |
Keywords
- Content feature
- Data extraction
- Loosely structured data record
- Semi-structured data
- Tree edit distance
ASJC Scopus subject areas
- Software
- Hardware and Architecture
- Computer Networks and Communications