Abstract
This paper examines the feasibility of discovering "title-like" terms using a decision tree classifier from the document. The premise of discovering title-like terms is that title terms and title-like terms should behave similarly in the document. This behavior is characterized by a set of distributional and linguistic features. By training the classifier to observe the behavior of title terms in a balanced manner using 25,000 titles in Reuters articles, other terms with similar behavior would also be discovered. Based on 5000 unseen titles, the recall of title terms was 83%, similar to the manual identification of title terms. The precision of finding title terms is low (i.e., 32%) because some non-title but title-like terms should have been identified as well. Seven subjects were asked to rate, on a scale of between 1 and 5, whether the identified term is a topical/thematic/title term. If a rating of 2.5 is used to determine whether a term is judged to be a "title-like" term, then the mean precision is increased to 58%, or the headline/title is expanded with twice the average number of terms. Since this precision (i.e., 58%) is similar to the mean precision of manually identified title terms averaged across different subjects, we conclude that the discovery of title-like terms using classifiers is a promising approach.
| Original language | English |
|---|---|
| Pages (from-to) | 789-800 |
| Number of pages | 12 |
| Journal | Information Processing and Management |
| Volume | 41 |
| Issue number | 4 |
| DOIs | |
| Publication status | Published - 1 Jul 2005 |
Keywords
- Classification
- Induction
- Term extraction
ASJC Scopus subject areas
- Information Systems
- Media Technology
- Computer Science Applications
- Management Science and Operations Research
- Library and Information Sciences
Fingerprint
Dive into the research topics of 'Discovering "title-like" terms'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver