Abstract
The main purpose of multimodal machine translation (MMT) is to improve the quality of translation results by taking the corresponding visual context as an additional input. Recently many studies in neural machine translation have attempted to obtain high-quality multimodal representation of encoder or decoder via attention mechanism. However, attention mechanism does not always accurately identify the decisive input for each prediction, which leads to an unsatisfactory multimodal information fusion. To this end, we propose an encoder-decoder (Enc-Dec) calibration method which can automatically calibrate the image and text fusion representation in the encoder, and find the decisive input to the translation in the decoder. We validate our model on the MMT dataset Multi30K. Experimental results show that our method significantly outperforms several recent baselines for both English-German and English-French translation tasks in terms of BLEU and METEOR.
Original language | English |
---|---|
Pages (from-to) | 3965-3973 |
Number of pages | 9 |
Journal | IEEE Transactions on Artificial Intelligence |
Volume | 5 |
Issue number | 8 |
DOIs | |
Publication status | Published - Jan 2024 |
Keywords
- Encoder-decoder calibration
- multimodal fusion
- multimodal machine translation
- visual encoder
ASJC Scopus subject areas
- Computer Science Applications
- Artificial Intelligence