Abstract
In the age of explosive information growth, short-video sharing on social networks has surged. Video streams are rarely presented without accompanying audio, emphasizing the need for robust audio-visual quality assessment (AVQA) over traditional single-mode assessment to maintain a high quality-of-service. Existing AVQA methods often extract audio and visual features separately and perform a subsequent fusion, which overlooks the crucial interdependencies between audio and visual elements, thereby impairing performance. Moreover, their effectiveness is often undermined by insufficient consideration of technical and aesthetic perspectives. In this work, we propose a novel AVQA model featuring a multi-branch architecture, specifically designed to capture more in-depth and effective audio-visual quality features to enhance AVQA performance. Within our proposed model, the audio-visual quality convergence branch incorporates a newly proposed Cross-Dimensional Audio-Visual Fusion (CDAVF) module, enabling the extraction of conjoint quality features through an innovative cross-fusion process across multi-dimensional layers. The technical branch assesses technical distortions for precise AVQA, while the aesthetic branch evaluates content-awareness and semantic understanding, crucial for viewer satisfaction and human perception. Building on the multi-branch framework, we further introduce a novel Cross Tri-Fusion Attention (CTFA) module for advanced late-stage fusion. This module is designed to fuse and apply attention to audio-visual quality features across branches, enabling the model to comprehensively capture quality features for accurate quality prediction. We conducted experiments on PGC and UGC databases to validate the effectiveness and robustness of our model. The results demonstrate that our model consistently outperforms existing state-of-the-art AVQA methods, showcasing its superior performance and reliability.
| Original language | English |
|---|---|
| Article number | 11435472 |
| Pages (from-to) | 1-16 |
| Number of pages | 16 |
| Journal | IEEE Transactions on Circuits and Systems for Video Technology |
| DOIs | |
| Publication status | Published - Mar 2026 |
Keywords
- cross tri-fusion attention
- Cross-dimensional fusion
- multi-branch audio-visual quality assessment
ASJC Scopus subject areas
- Media Technology
- Electrical and Electronic Engineering
Fingerprint
Dive into the research topics of 'Multi-Branch Aesthetic and Technical Perspectives with Cross Tri-Fusion Attention for No-Reference Audio-Visual Quality Assessment'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver