Abstract
The distribution of the number of documents in topic classes is typically highly skewed. This leads to good micro-average performance but not so desirable macro-average performance. By viewing topics as clusters in a high dimensional space, we propose the use of clustering to determine subtopic clusters for large topic classes by assuming that large topic clusters are in general a mixture of a number of subtopic clusters. We used the Reuters News articles and support vector machines to evaluate whether using subtopic cluster can lead to better macro-average performance.
Original language | English |
---|---|
Pages (from-to) | 203-214 |
Number of pages | 12 |
Journal | Lecture Notes in Computer Science |
Volume | 3513 |
Publication status | Published - 30 Sept 2005 |
Event | 10th International Conference on Applications of Natural Language to Information Systems, NLDB 2005: Natural Language Processing and Information Systems - Alicante, Spain Duration: 15 Jun 2005 → 17 Jun 2005 |
ASJC Scopus subject areas
- Theoretical Computer Science
- Computer Science(all)