TY - GEN
T1 - Temporal Sentence Grounding with Temporally Global Textual Knowledge
AU - Cai, Chen
AU - Zhang, Runzhong
AU - Gao, Jianjun
AU - Wu, Kejun
AU - Yap, Kim Hui
AU - Wang, Yi
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024/9
Y1 - 2024/9
N2 - Temporal sentence grounding involves the retrieval of a video moment with a natural language query. Many existing works directly incorporate the given video and temporally localized query for temporal grounding, overlooking the inherent domain gap between different modalities. In this paper, we utilize pseudo-query features containing extensive temporally global textual knowledge sourced from the same video-query pair, to enhance the bridging of domain gaps and attain a heightened level of similarity between multi-modal features. Specifically, we propose a Pseudo-query Intermediary Network (PIN) to achieve an improved alignment of visual and comprehensive pseudo-query features within the feature space through contrastive learning. Subsequently, we utilize learnable prompts to encapsulate the knowledge of pseudo-queries, propagating them into the textual encoder and multimodal fusion module, further enhancing the feature alignment between visual and language for better temporal grounding. Extensive experiments conducted on the Charades-STA and ActivityNet-Captions datasets demonstrate the effectiveness of our method.
AB - Temporal sentence grounding involves the retrieval of a video moment with a natural language query. Many existing works directly incorporate the given video and temporally localized query for temporal grounding, overlooking the inherent domain gap between different modalities. In this paper, we utilize pseudo-query features containing extensive temporally global textual knowledge sourced from the same video-query pair, to enhance the bridging of domain gaps and attain a heightened level of similarity between multi-modal features. Specifically, we propose a Pseudo-query Intermediary Network (PIN) to achieve an improved alignment of visual and comprehensive pseudo-query features within the feature space through contrastive learning. Subsequently, we utilize learnable prompts to encapsulate the knowledge of pseudo-queries, propagating them into the textual encoder and multimodal fusion module, further enhancing the feature alignment between visual and language for better temporal grounding. Extensive experiments conducted on the Charades-STA and ActivityNet-Captions datasets demonstrate the effectiveness of our method.
KW - contrastive learning
KW - domain gap
KW - pseudo-query
KW - Temporal sentence grounding
UR - https://www.scopus.com/pages/publications/85206570294
U2 - 10.1109/ICME57554.2024.10687646
DO - 10.1109/ICME57554.2024.10687646
M3 - Conference article published in proceeding or book
AN - SCOPUS:85206570294
T3 - Proceedings - IEEE International Conference on Multimedia and Expo
SP - 1
EP - 6
BT - 2024 IEEE International Conference on Multimedia and Expo, ICME 2024
PB - IEEE Computer Society
T2 - 2024 IEEE International Conference on Multimedia and Expo, ICME 2024
Y2 - 15 July 2024 through 19 July 2024
ER -