Abstract
Scene Graph Generation (SGG) is a typical computer vision task that detects objects and corresponding predicates in an image. Existing SGG methods focus on modeling visual contexts to generate scene graphs and are conducted on well-annotated datasets with high-quality images. However, the quality is unguaranteed for images in social media posts, so that some images may be incomplete or occluded by some obstacles, hence might not provide sufficient visual context for SGG. Therefore, previous methods might result in missing or false visual relationship detection due to lacking visual contexts. To effectively generate the scene graphs in social media, we study multimodal scene graph generation (MSG) in this paper. MSG aims to develop visual scene graphs from images in social media posts with the support of text sentences. However, leveraging textual contents by simple multimodal alignment such as object-level alignment neglects the inherent pair-wise mapping between multimodal object pairs. To address the limitations, we propose a method named Deep pair-wise Relation Alignment for Knowledge-Enhanced (DRAKE) multimodal scene graph generation. The model supplements the missing visual contexts with well-aligned textual knowledge. It first represents the textual information into object-aware knowledge representation with the help of vision data. Furthermore, our proposed DRAKE facilitates the interaction of the info between multimodal pair-wise representations. A multimodal context enhancement layer can be devised to help the model generate the scene graph. To evaluate the model performance of SGG on social media images, we propose a social media SGG dataset called MSG. We comprehensively analyze the effectiveness of our proposed method on the MSG dataset. The experimental results on the MSG dataset indicate that our model outperforms the previous methods. To fairly compare our method with other SGG models, we also conduct experiments on the Visual Genome dataset for more analysis The MSG dataset is released on https://github.com/FuZe4ever/MSG.
Original language | English |
---|---|
Pages (from-to) | 3199-3213 |
Number of pages | 15 |
Journal | IEEE Transactions on Circuits and Systems for Video Technology |
Volume | 33 |
Issue number | 7 |
DOIs | |
Publication status | Published - 1 Jul 2023 |
Keywords
- knowledge enhancement
- pair-wise alignment
- Scene graph generation
- social media posts
ASJC Scopus subject areas
- Media Technology
- Electrical and Electronic Engineering