Abstract
In this paper, we propose a method called LSAI (learning semantic alignment from image) to recover the corrupted image patches for text-guided image inpainting. Firstly, a multimodal preliminary (MP) module is designed to effectively encode global features for images and textual descriptions, where each local image patch and word are taken into account via multi-head self-attention. Secondly, non-Euclidean semantic relations between images and textual descriptions are captured with graph structure by building a semantic relation graph (SRG). The constructed SRG is able to obtain meaningful words describing the image content and alleviate the impact of distracting words, which is achieved by aggregating the semantic relations with graph convolution. In addition, a text-image matching loss is devised to penalize the restored images for diverse textual and visual semantics. Quantitative and qualitative experiments conducted on two public datasets show the outperformance of our proposed LSAI (e.g., FID value is reduced from 30.87 to 16.73 on CUB-200-2011 dataset).
Original language | English |
---|---|
Pages (from-to) | 3149-3161 |
Number of pages | 13 |
Journal | Visual Computer |
Volume | 38 |
Issue number | 9-10 |
DOIs | |
Publication status | Published - Sept 2022 |
Keywords
- Generative adversarial networks
- Graph convolution
- Text-guided image inpainting
ASJC Scopus subject areas
- Software
- Computer Vision and Pattern Recognition
- Computer Graphics and Computer-Aided Design