Abstract
Despite the remarkable success of deep multimodal learning in practice, it has not been well-explained in theory. Recently, it has been observed that the best uni-modal network outperforms the jointly trained multi-modal network, which is counter-intuitive since multiple signals generally bring more information (Wang et al., 2020). This work provides a theoretical explanation for the emergence of such performance gap in neural networks for the prevalent joint training framework. Based on a simplified data distribution that captures the realistic property of multimodal data, we prove that for the multi-modal late-fusion network with (smoothed) ReLU activation trained jointly by gradient descent, different modalities will compete with each other. The encoder networks will learn only a subset of modalities. We refer to this phenomenon as modality competition. The losing modalities, which fail to be discovered, are the origins where the sub-optimality of joint training comes from. Experimentally, we illustrate that modality competition matches the intrinsic behavior of late-fusion joint training.
| Original language | English |
|---|---|
| Pages (from-to) | 9226-9259 |
| Number of pages | 34 |
| Journal | Proceedings of Machine Learning Research |
| Volume | 162 |
| Publication status | Published - Mar 2022 |
| Externally published | Yes |
| Event | 39th International Conference on Machine Learning, ICML 2022 - Baltimore, United States Duration: 17 Jul 2022 → 23 Jul 2022 |
ASJC Scopus subject areas
- Artificial Intelligence
- Software
- Control and Systems Engineering
- Statistics and Probability