EAPT: Efficient attention pyramid transformer for image processing

Xiao Lin, Shuzhou Sun, Wei Huang, Bin Sheng, Ping Li, David Dagan Feng

Research output: Journal article publicationJournal articleAcademic researchpeer-review

281 Citations (Scopus)

Abstract

Recent transformer-based models, especially patch-based methods, have shown huge potentiality in vision tasks. However, the split fixed-size patches divide the input features into the same size patches, which ignores the fact that vision elements are often various and thus may destroy the semantic information. Also, the vanilla patch-based transformer cannot guarantee the information communication between patches, which will prevent the extraction of attention information with a global view. To circumvent those problems, we propose the Efficient Attention Pyramid Transformer (EAPT) in this paper. More specifically, we first propose Deformable Attention, which learns an offset for each position in patches. Therefore, even with split fixed-size patches, our method can still obtain non-fixed attention information that can cover various vision elements. Then, we design the Encode-Decode Communication module (En-DeC module), which can obtain communication information between all patches to get more complete global attention information. Finally, we also proposed a position encoding specifically for vision transformers, which can be used for patches of any dimensions and any lengths. Extensive experiments on the vision tasks of image classification, object detection, semantic segmentation demonstrate the effectiveness of our proposed model. Furthermore, we also conduct rigorous ablation studies to evaluate the key components of the proposed structure.

Original languageEnglish
Pages (from-to)1-13
Number of pages13
JournalIEEE Transactions on Multimedia
DOIs
Publication statusAccepted/In press - Oct 2021

Keywords

  • attention mechanism
  • classification
  • Convolutional neural networks
  • Costs
  • Encoding
  • Feature extraction
  • object detection
  • pyramid
  • semantic segmentation
  • Semantics
  • Task analysis
  • Transformer
  • Transformers

ASJC Scopus subject areas

  • Signal Processing
  • Media Technology
  • Computer Science Applications
  • Electrical and Electronic Engineering

Fingerprint

Dive into the research topics of 'EAPT: Efficient attention pyramid transformer for image processing'. Together they form a unique fingerprint.

Cite this