Teaching Masked Autoencoder With Strong Augmentations

  • Rui Zhu
  • , Yalong Bai
  • , Ting Yao
  • , Jingen Liu
  • , Zhenglong Sun
  • , Tao Mei
  • , Chang Wen Chen

Research output: Journal article publicationJournal articleAcademic researchpeer-review

1 Citation (Scopus)

Abstract

Masked autoencoder (MAE) has been regarded as a capable self-supervised learner for various downstream tasks. Nevertheless, the model still lacks high-level discriminability, which results in poor linear probing performance. In view of the fact that strong augmentation plays an essential role in contrastive learning, can we capitalize on strong augmentation in MAE? The difficulty originates from the pixel uncertainty caused by strong augmentation that may affect the reconstruction, and thus, directly introducing strong augmentation into MAE often hurts the performance. In this article, we delve into the potential of strong augmented views to enhance MAE while maintaining MAE’s advantages. To this end, we propose a simple yet effective masked Siamese autoencoder (MSA) model, which consists of a student branch and a teacher branch. The student branch derives MAE’s advanced architecture, and the teacher branch treats the unmasked strong view as an exemplary teacher to impose high-level discrimination onto the student branch. We demonstrate that our MSA can improve the model’s spatial perception capability and, therefore, globally favors interimage discrimination. Empirical evidence shows that the model pretrained by MSA provides superior performances across different downstream tasks. Notably, linear probing performance on frozen features extracted from MSA leads to 6.1% gains over MAE on ImageNet-1k. Fine-tuning (FT) the network on VQAv2 task finally achieves 67.4% accuracy, outperforming 1.6% of the supervised method DeiT and 1.2% of MAE.

Original languageEnglish
Pages (from-to)9550-9564
Number of pages15
JournalIEEE Transactions on Neural Networks and Learning Systems
Volume36
Issue number5
DOIs
Publication statusPublished - May 2025

Keywords

  • Contrastive learning
  • data augmentation
  • deep learning
  • masked image modeling
  • self-supervised representation learning

ASJC Scopus subject areas

  • Software
  • Computer Science Applications
  • Computer Networks and Communications
  • Artificial Intelligence

Fingerprint

Dive into the research topics of 'Teaching Masked Autoencoder With Strong Augmentations'. Together they form a unique fingerprint.

Cite this