Releasing the Capacity of GANs in Non-Autoregressive Image Captioning

Da Ren, Qing Li

Research output: Chapter in book / Conference proceedingConference article published in proceeding or bookAcademic researchpeer-review

Abstract

Building Non-autoregressive (NAR) models in image captioning can fundamentally tackle the high inference latency of autoregressive models. However, existing NAR image captioning models are trained on maximum likelihood estimation, and suffer from their inherent multi-modality problem. Although constructing NAR models based on GANs can theoretically tackle this problem, existing GAN-based NAR models obtain poor performance when transferred to image captioning due to their incapacity of modeling complicated relations between images and text. To tackle this problem, we propose an Adversarial Non-autoregressive Transformer for Image Captioning (CaptionANT) by improving performance from two aspects: 1) modifying the model structure so as to be compatible with contrastive learning to effectively make use of unpaired samples; 2) integrating a reconstruction process to better utilize paired samples. By further combining with other effective techniques and our proposed lightweight structure, CaptionANT can better align input images and output text, and thus achieves new state-of-the-art performance for fully NAR models on the challenging MSCOCO dataset. More importantly, CaptionANT achieves a 26.72× speedup compared to the autoregressive baseline with only 36.3% the number of parameters of the existing best fully NAR model for image captioning.

Original languageEnglish
Title of host publication2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024 - Main Conference Proceedings
EditorsNicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
PublisherEuropean Language Resources Association (ELRA)
Pages13906-13918
Number of pages13
ISBN (Electronic)9782493814104
Publication statusPublished - 2024
EventJoint 30th International Conference on Computational Linguistics and 14th International Conference on Language Resources and Evaluation, LREC-COLING 2024 - Hybrid, Torino, Italy
Duration: 20 May 202425 May 2024

Publication series

Name2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024 - Main Conference Proceedings

Conference

ConferenceJoint 30th International Conference on Computational Linguistics and 14th International Conference on Language Resources and Evaluation, LREC-COLING 2024
Country/TerritoryItaly
CityHybrid, Torino
Period20/05/2425/05/24

Keywords

  • GANs
  • Image Captioning
  • Non-Autoregressive Models

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computational Theory and Mathematics
  • Computer Science Applications

Fingerprint

Dive into the research topics of 'Releasing the Capacity of GANs in Non-Autoregressive Image Captioning'. Together they form a unique fingerprint.

Cite this