Two-stage fine-tuning with ChatGPT data augmentation for learning class-imbalanced data

Taha ValizadehAslani, Yiwen Shi, Jing Wang, Ping Ren, Yi Zhang, Meng Hu, Liang Zhao, Hualou Liang (Corresponding Author)

Research output: Journal article publicationJournal articleAcademic researchpeer-review

12 Citations (Scopus)

Abstract

Classification of long-tailed distributed data is a challenging problem, which suffers from serious class imbalance and hence poor performance on tail classes, which have only a few samples. Owing to this paucity of samples, learning on the tail classes is especially challenging for fine-tuning when transferring a pretrained model to a downstream task. In this work, we present a simple modification of standard fine-tuning to cope with these challenges. Specifically, we propose a two-stage fine-tuning. In Stage 1, we fine-tune the final layer of the pretrained model with class-balanced augmented data, generated using ChatGPT. As a large generative language model, ChatGPT is capable of generating novel and contextually similar responses to a given prompt, which makes it an excellent candidate for data augmentation. In Stage 2, we perform the standard fine-tuning. Our modification has several benefits: (1) it leverages pretrained representations by only fine-tuning a small portion of the model parameters while keeping the rest untouched; (2) it allows the model to learn an initial representation of the specific task; and importantly (3) it protects the learning of tail classes from being at a disadvantage during the model updating. We conduct extensive experiments on synthetic datasets of both two-class and multi-class tasks of text classification as well as a real-world application to ADME (i.e., absorption, distribution, metabolism, and excretion) semantic drug labeling. The experimental results show that the proposed two-stage fine-tuning outperforms vanilla fine-tuning and state-of-the-art methods on the above datasets.

Original languageEnglish
Article number127801
JournalNeurocomputing
Volume592
DOIs
Publication statusPublished - 9 May 2024
Externally publishedYes

Keywords

  • Imbalanced data
  • Machine learning
  • Natural language processing
  • Data augmentation
  • Reweighting
  • BERT

ASJC Scopus subject areas

  • Computer Science Applications
  • Cognitive Neuroscience
  • Artificial Intelligence

Fingerprint

Dive into the research topics of 'Two-stage fine-tuning with ChatGPT data augmentation for learning class-imbalanced data'. Together they form a unique fingerprint.

Cite this