Abstract
—Image-guided Story Ending Generation aims at generating a reasonable and logical ending given a story context and an ending-related image. The existing models have achieved some success by fusing global image features with story context through an attention mechanism. However, they ignore the logical relationship between the story context and the image regions, and have not considered the high-level semantic features of the image such as visual sentiment. This may cause the generated ending inconsistent with the logic or sentiment of the given information. In this paper, we propose a Multi-Granularity feature Fusion (MGF) model to solve this problem. Concretely, we first employ an image sentiment extractor to grasp the sentiment features of the image as part of the global image features. We then design a scene subgraph selector to capture the image features of the key region by picking the scene subgraph most relevant to the context. Finally, we fuse the textual and visual features from object level, region level, and global level, respectively. Our model is thereby capable of effectively capturing the key region features and visual sentiment of the image, so as to generate a more logical and sentimental ending. Experimental results show that our MGF model outperforms the state-of-the-art models on most metrics.
Original language | English |
---|---|
Pages (from-to) | 3437-3449 |
Number of pages | 13 |
Journal | IEEE/ACM Transactions on Audio Speech and Language Processing |
Volume | 32 |
DOIs | |
Publication status | Published - 2024 |
Keywords
- Image-guided story ending generation
- image sentiment
- multi-granularity feature fusion
- scene subgraph
- story ending generation
ASJC Scopus subject areas
- Computer Science (miscellaneous)
- Acoustics and Ultrasonics
- Computational Mathematics
- Electrical and Electronic Engineering