Modality Competition: What Makes Joint Training of Multi-modal Network Fail in Deep Learning? (Provably)

  • Yu Huang
  • , Junyang Lin
  • , Chang Zhou
  • , Hongxia Yang
  • , Longbo Huang

Research output: Journal article publicationConference articleAcademic researchpeer-review

101 Citations (Scopus)

Abstract

Despite the remarkable success of deep multimodal learning in practice, it has not been well-explained in theory. Recently, it has been observed that the best uni-modal network outperforms the jointly trained multi-modal network, which is counter-intuitive since multiple signals generally bring more information (Wang et al., 2020). This work provides a theoretical explanation for the emergence of such performance gap in neural networks for the prevalent joint training framework. Based on a simplified data distribution that captures the realistic property of multimodal data, we prove that for the multi-modal late-fusion network with (smoothed) ReLU activation trained jointly by gradient descent, different modalities will compete with each other. The encoder networks will learn only a subset of modalities. We refer to this phenomenon as modality competition. The losing modalities, which fail to be discovered, are the origins where the sub-optimality of joint training comes from. Experimentally, we illustrate that modality competition matches the intrinsic behavior of late-fusion joint training.

Original languageEnglish
Pages (from-to)9226-9259
Number of pages34
JournalProceedings of Machine Learning Research
Volume162
Publication statusPublished - Mar 2022
Externally publishedYes
Event39th International Conference on Machine Learning, ICML 2022 - Baltimore, United States
Duration: 17 Jul 202223 Jul 2022

ASJC Scopus subject areas

  • Artificial Intelligence
  • Software
  • Control and Systems Engineering
  • Statistics and Probability

Fingerprint

Dive into the research topics of 'Modality Competition: What Makes Joint Training of Multi-modal Network Fail in Deep Learning? (Provably)'. Together they form a unique fingerprint.

Cite this