Abstract
This paper examines the feasibility of discovering "title-like" terms using a decision tree classifier from the document. The premise of discovering title-like terms is that title terms and title-like terms should behave similarly in the document. This behavior is characterized by a set of distributional and linguistic features. By training the classifier to observe the behavior of title terms in a balanced manner using 25,000 titles in Reuters articles, other terms with similar behavior would also be discovered. Based on 5000 unseen titles, the recall of title terms was 83%, similar to the manual identification of title terms. The precision of finding title terms is low (i.e., 32%) because some non-title but title-like terms should have been identified as well. Seven subjects were asked to rate, on a scale of between 1 and 5, whether the identified term is a topical/thematic/title term. If a rating of 2.5 is used to determine whether a term is judged to be a "title-like" term, then the mean precision is increased to 58%, or the headline/title is expanded with twice the average number of terms. Since this precision (i.e., 58%) is similar to the mean precision of manually identified title terms averaged across different subjects, we conclude that the discovery of title-like terms using classifiers is a promising approach.
Original language | English |
---|---|
Pages (from-to) | 789-800 |
Number of pages | 12 |
Journal | Information Processing and Management |
Volume | 41 |
Issue number | 4 |
DOIs | |
Publication status | Published - 1 Jul 2005 |
Keywords
- Classification
- Induction
- Term extraction
ASJC Scopus subject areas
- Information Systems
- Media Technology
- Computer Science Applications
- Management Science and Operations Research
- Library and Information Sciences