Skip to main navigation Skip to search Skip to main content

Extracting Building Footprint From Remote Sensing Images by an Enhanced Vision Transformer Network

  • Hua Zhang
  • , Hu Dou
  • , Zelang Miao
  • , Nanshan Zheng
  • , Ming Hao
  • , Wenzhong Shi

Research output: Journal article publicationJournal articleAcademic researchpeer-review

Abstract

Automatic extraction of building footprints from images is one of the vital means for obtaining building footprint data. However, due to the varied appearances, scales, and intricate structures of buildings, this task still remains challenging. Recently, the vision transformer (ViT) has exhibited significant promise in semantic segmentation, thanks to its efficient capability in obtaining long-range dependencies. This article employs the ViT for extracting building footprints. Yet, utilizing ViT often encounters limitations: extensive computational costs and insufficient preservation of local details in the process of extracting features. To address these challenges, a network based on an enhanced ViT (EViT) is proposed. In this network, one convolutional neural network (CNN)-based branch is introduced to extract comprehensive spatial details. Another branch, consisting of several multiscale enhanced ViT (EV) blocks, is developed to capture global dependencies. Subsequently, a multiscale and enhanced boundary feature extraction block is developed to fuse global dependencies and local details and perform boundary features enhancement, thereby yielding multiscale global-local contextual information with enhanced boundary feature. Specifically, we present a window-based cascaded multihead self-attention (W-CMSA) mechanism, characterized by linear complexity in relation to the window size, which not only reduces computational costs but also enhances attention diversity. The EViT has undergone comprehensive evaluation alongside other state-of-the-art (SOTA) approaches using three benchmark datasets. The findings illustrate that EViT exhibits promising performance in extracting building footprints and surpasses SOTA approaches. Specifically, it achieved 82.45%, 91.76%, and 77.14% IoU on the SpaceNet, WHU, and Massachusetts datasets, respectively. The implementation of EViT is available at https://github.com/dh609/EViT.
Original languageEnglish
Number of pages14
JournalIEEE Transactions on Geoscience and Remote Sensing
Volume62
Issue number5406814
DOIs
Publication statusPublished - 1 Jul 2024

Fingerprint

Dive into the research topics of 'Extracting Building Footprint From Remote Sensing Images by an Enhanced Vision Transformer Network'. Together they form a unique fingerprint.

Cite this