TY - GEN
T1 - Residual Attention Single-Head Vision Transformer Network for Rolling Bearing Fault Diagnosis in Noisy Environments
AU - Lai, Songjiang
AU - Cheung, Tsun Hin
AU - Zhao, Jiayi
AU - Xue, Kaiwen
AU - Fung, Ka Chun
AU - Lam, Kin Man
N1 - Publisher Copyright:
© 2024 Copyright held by the owner/author(s). Publication rights licensed to ACM.
PY - 2025/2/27
Y1 - 2025/2/27
N2 - Rolling bearings are critical components in modern industrial machinery, significantly impacting the performance, longevity, and safety of equipment. Due to harsh operating conditions, such as high speeds and temperatures, rolling bearings are prone to malfunctions, leading to equipment downtime, economic losses, and safety risks. In this paper, the Residual Attention Single-Head Vision Transformer Network (RA-SHViT-Net) is proposed for fault diagnosis in rolling bearings. The vibration signal collected from rolling bearings is first transformed from the time domain to the frequency domain using Fast Fourier Transform (FFT). The RA-SHViT-Net model then leverages the Single-Head Vision Transformer (SHViT), which is adept at capturing local and global features from time-series signals. SHViT also offers a state-of-the-art balance between computational complexity and prediction accuracy and it has been demonstrated to achieve promising results in the field of computer vision. To enhance feature extraction, we introduce an Adaptive Hybrid Attention Block (AHAB) that combines channel and spatial attention mechanisms. The core building block of the RA-SHViT-Net is the Residual Attention Single-Head Vision Transformer Block, which consists of a Depthwise Convolution (DWConv) layer, a Single-Head Self-Attention (SHSA) layer, a Residual Feed-Forward-Network (Res-FFN) and an Adaptive Hybrid Attention Block (AHAB). This architecture is designed to comprehensively extract vibration signal features by considering the interdependencies among feature channels and spatial information based on the excellent feature extraction capabilities of SHViT. Additionally, each Single-Head Vision Transformer Block incorporates a Residual Feed-Forward-Network (Res-FNN) module, which uses residual connections to mitigate the vanishing gradient problem, enabling stable and efficient training of deep models. This design enhances the model's ability to learn complex representations and improves its generalization capabilities. The proposed RA-SHViT-Net was evaluated using the Case Western Reserve University (CWRU) dataset and the Paderborn University dataset. The results demonstrate that the RA-SHViT-Net outperforms state-of-the-art methods in terms of accuracy and robustness, particularly in scenarios involving complex and noisy environments. In addition, we designed multiple ablation studies to investigate the impact of different modules on the network's prediction performance. Overall, the RA-SHViT-Net provides a powerful tool for the early detection and classification of bearing faults, contributing to more reliable and efficient maintenance strategies in industrial applications.
AB - Rolling bearings are critical components in modern industrial machinery, significantly impacting the performance, longevity, and safety of equipment. Due to harsh operating conditions, such as high speeds and temperatures, rolling bearings are prone to malfunctions, leading to equipment downtime, economic losses, and safety risks. In this paper, the Residual Attention Single-Head Vision Transformer Network (RA-SHViT-Net) is proposed for fault diagnosis in rolling bearings. The vibration signal collected from rolling bearings is first transformed from the time domain to the frequency domain using Fast Fourier Transform (FFT). The RA-SHViT-Net model then leverages the Single-Head Vision Transformer (SHViT), which is adept at capturing local and global features from time-series signals. SHViT also offers a state-of-the-art balance between computational complexity and prediction accuracy and it has been demonstrated to achieve promising results in the field of computer vision. To enhance feature extraction, we introduce an Adaptive Hybrid Attention Block (AHAB) that combines channel and spatial attention mechanisms. The core building block of the RA-SHViT-Net is the Residual Attention Single-Head Vision Transformer Block, which consists of a Depthwise Convolution (DWConv) layer, a Single-Head Self-Attention (SHSA) layer, a Residual Feed-Forward-Network (Res-FFN) and an Adaptive Hybrid Attention Block (AHAB). This architecture is designed to comprehensively extract vibration signal features by considering the interdependencies among feature channels and spatial information based on the excellent feature extraction capabilities of SHViT. Additionally, each Single-Head Vision Transformer Block incorporates a Residual Feed-Forward-Network (Res-FNN) module, which uses residual connections to mitigate the vanishing gradient problem, enabling stable and efficient training of deep models. This design enhances the model's ability to learn complex representations and improves its generalization capabilities. The proposed RA-SHViT-Net was evaluated using the Case Western Reserve University (CWRU) dataset and the Paderborn University dataset. The results demonstrate that the RA-SHViT-Net outperforms state-of-the-art methods in terms of accuracy and robustness, particularly in scenarios involving complex and noisy environments. In addition, we designed multiple ablation studies to investigate the impact of different modules on the network's prediction performance. Overall, the RA-SHViT-Net provides a powerful tool for the early detection and classification of bearing faults, contributing to more reliable and efficient maintenance strategies in industrial applications.
KW - attention mechanism
KW - Fast Fourier Transform (FFT)
KW - fault diagnosis
KW - noisy environments
KW - rolling bearings
KW - Vision Transformer
UR - https://www.scopus.com/pages/publications/105001566320
U2 - 10.1145/3708568.3708591
DO - 10.1145/3708568.3708591
M3 - Conference article published in proceeding or book
AN - SCOPUS:105001566320
T3 - Proceedings of the 2024 6th International Conference on Video, Signal and Image Processing, VSIP 2024
SP - 136
EP - 150
BT - Proceedings of the 2024 6th International Conference on Video, Signal and Image Processing, VSIP 2024
PB - Association for Computing Machinery, Inc
T2 - 6th International Conference on Video, Signal and Image Processing, VSIP 2024
Y2 - 22 November 2024 through 24 November 2024
ER -