Abstract
The goal of unpaired image captioning (UIC) is to describe images without using image-caption pairs in the training phase. Although challenging, we expect the task can be accomplished by leveraging images aligned with visual concepts. Most existing studies use off-the-shelf algorithms to obtain the visual concepts because the Bounding Box (BBox) labels or relationship-triplet labels used for training are expensive to acquire. To avoid exhaustive annotations, we propose a novel approach to achieve cost-effective UIC. Specifically, we adopt image-level labels to optimize the UIC model in a weakly-supervised manner. For each image, we assume that only the image-level labels are available without specific locations and numbers. The image-level labels are utilized to train a weakly-supervised object recognition model to extract object information (e.g., instance), and the extracted instances are adopted to infer the relationships among different objects using an enhanced graph neural network (GNN). The proposed approach achieves comparable or even better performance compared with previous methods without expensive annotations. Furthermore, we design an unrecognized object (UnO) loss to improve the alignment of the inferred object and relationship information with the images. It can effectively alleviate the issue encountered by existing UIC models when generating sentences with nonexistent objects. To the best of our knowledge, this is the first attempt to address the problem of Weakly-Supervised visual concept recognition for UIC (WS-UIC) based only on image-level labels. Extensive experiments demonstrate that the proposed method achieves inspiring results on the COCO dataset while significantly reducing the labeling cost.
Original language | English |
---|---|
Pages (from-to) | 6702-6716 |
Number of pages | 15 |
Journal | IEEE Transactions on Multimedia |
Volume | 25 |
DOIs | |
Publication status | Published - Oct 2022 |
Keywords
- Graph neural network
- unpaired image captioning
- weakly-supervised instance segmentation
ASJC Scopus subject areas
- Signal Processing
- Media Technology
- Computer Science Applications
- Electrical and Electronic Engineering