TY - GEN
T1 - Multimodal-Semantic Context-Aware Graph Neural Network for Group Activity Recognition
AU - Liu, Tianshan
AU - Zhao, Rui
AU - Lam, Kin Man
N1 - Publisher Copyright:
© 2021 IEEE
PY - 2021/7
Y1 - 2021/7
N2 - Group activities in videos involve visual interaction contexts in multiple modalities between actors, and co-occurrence between individual action labels. However, most of the current group activity recognition methods either model actor-actor relations based on the single RGB modality, or ignore exploiting the label relationships. To capture these rich visual and semantic contexts, we propose a multimodal-semantic context-aware graph neural network (MSCA-GNN). Specifically, we first build two visual sub-graphs based on the appearance cues and motion patterns extracted from RGB and optical-flow modalities, respectively. Then, two attention-based aggregators are proposed to refine each node, by gathering representations from other nodes and heterogeneous modalities. In addition, a semantic graph is constructed based on linguistic embeddings to model label relationships. We employ a bi-directional mapping learning strategy to further integrate the information from both multimodal visual and semantic graphs. Experimental results on two group activity benchmarks show the effectiveness of the proposed method.
AB - Group activities in videos involve visual interaction contexts in multiple modalities between actors, and co-occurrence between individual action labels. However, most of the current group activity recognition methods either model actor-actor relations based on the single RGB modality, or ignore exploiting the label relationships. To capture these rich visual and semantic contexts, we propose a multimodal-semantic context-aware graph neural network (MSCA-GNN). Specifically, we first build two visual sub-graphs based on the appearance cues and motion patterns extracted from RGB and optical-flow modalities, respectively. Then, two attention-based aggregators are proposed to refine each node, by gathering representations from other nodes and heterogeneous modalities. In addition, a semantic graph is constructed based on linguistic embeddings to model label relationships. We employ a bi-directional mapping learning strategy to further integrate the information from both multimodal visual and semantic graphs. Experimental results on two group activity benchmarks show the effectiveness of the proposed method.
KW - graph neural network
KW - Group activity recognition
KW - multimodal-semantic context
UR - http://www.scopus.com/inward/record.url?scp=85126452589&partnerID=8YFLogxK
U2 - 10.1109/ICME51207.2021.9428377
DO - 10.1109/ICME51207.2021.9428377
M3 - Conference article published in proceeding or book
AN - SCOPUS:85126452589
T3 - Proceedings - IEEE International Conference on Multimedia and Expo
SP - 1
EP - 6
BT - 2021 IEEE International Conference on Multimedia and Expo, ICME 2021
PB - IEEE Computer Society
T2 - 2021 IEEE International Conference on Multimedia and Expo, ICME 2021
Y2 - 5 July 2021 through 9 July 2021
ER -