TY - GEN
T1 - MM-NodeFormer: Node Transformer Multimodal Fusion for Emotion Recognition in Conversation
AU - Huang, Zilong
AU - Mak, Man Wai
AU - Lee, Kong Aik
N1 - Publisher Copyright:
© 2024 International Speech Communication Association. All rights reserved.
PY - 2024/9
Y1 - 2024/9
N2 - Emotion Recognition in Conversation (ERC) has great prospects in human-computer interaction and medical consultation. Existing ERC approaches mainly focus on information in the text and speech modalities and often concatenate multimodal features without considering the richness of emotional information in individual modalities. We propose a multimodal network called MM-NodeFormer for ERC to address this issue. The network leverages the characteristics of different Transformer encoding stages to fuse the emotional features from the text, audio, and visual modalities according to their emotional richness. The module considers text as the main modality and audio and visual as auxiliary modalities, leveraging the complementarity between the main and auxiliary modalities. We conducted extensive experiments on two public benchmark datasets, IEMOCAP and MELD, achieving an accuracy of 74.24% and 67.86%, respectively, significantly higher than many state-of-the-art approaches.
AB - Emotion Recognition in Conversation (ERC) has great prospects in human-computer interaction and medical consultation. Existing ERC approaches mainly focus on information in the text and speech modalities and often concatenate multimodal features without considering the richness of emotional information in individual modalities. We propose a multimodal network called MM-NodeFormer for ERC to address this issue. The network leverages the characteristics of different Transformer encoding stages to fuse the emotional features from the text, audio, and visual modalities according to their emotional richness. The module considers text as the main modality and audio and visual as auxiliary modalities, leveraging the complementarity between the main and auxiliary modalities. We conducted extensive experiments on two public benchmark datasets, IEMOCAP and MELD, achieving an accuracy of 74.24% and 67.86%, respectively, significantly higher than many state-of-the-art approaches.
KW - emotion recognition in conversation
KW - feature fusion
KW - multimodal network
UR - https://www.scopus.com/pages/publications/85214797293
U2 - 10.21437/Interspeech.2024-538
DO - 10.21437/Interspeech.2024-538
M3 - Conference article published in proceeding or book
AN - SCOPUS:85214797293
T3 - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
SP - 4069
EP - 4073
BT - English
T2 - 25th Interspeech Conferece 2024
Y2 - 1 September 2024 through 5 September 2024
ER -