TY - JOUR
T1 - Exploring Duality in Visual Question-Driven Top-Down Saliency
AU - He, Shengfeng
AU - Han, Chu
AU - Han, Guoqiang
AU - Qin, Jing
N1 - Funding Information:
Manuscript received January 29, 2019; revised April 24, 2019; accepted August 2, 2019. Date of publication September 2, 2019; date of current version July 7, 2020. This work was supported in part by the National Natural Science Foundation of China under Grant 61472145 and Grant 61702194, in part by the Innovation and Technology Fund of Hong Kong under Project ITS/319/17, in part by the Special Fund of Science and Technology Research and Development on Application From Guangdong Province (SF-STRDA-GD) under Grant 2016B010127003, in part by the Guangzhou Key Industrial Technology Research fund under Grant 201802010036, and in part by the Guangdong Natural Science Foundation under Grant 2017A030312008. (Corresponding author: Chu Han.) S. He and G. Han are with the School of Computer Science and Engineering, South China University of Technology, Guangzhou 510006, China (e-mail: [email protected]; [email protected]).
Publisher Copyright:
© 2012 IEEE.
PY - 2020/7
Y1 - 2020/7
N2 - Top-down, goal-driven visual saliency exerts a huge influence on the human visual system for performing visual tasks. Text generations, like visual question answering (VQA) and visual question generation (VQG), have intrinsic connections with top-down saliency, which is usually involved in both VQA and VQG processes in an unsupervised manner. However, it is shown that the regions that humans choose to look at to answer questions are very different from the unsupervised attention models. In this brief, we aim to explore the intrinsic relationship between top-down saliency and text generations, and to figure out whether an accurate saliency response benefits text generation. To this end, we propose a dual supervised network with dynamic parameter prediction. Dual-supervision explicitly exploits the probabilistic correlation between the primal task top-down saliency detection and the dual task text generation, while dynamic parameter prediction encodes the given text (i.e., question or answer) into the fully convolutional network. Extensive experiments show the proposed top-down saliency method achieves the best correlation with human attention among various baselines. In addition, the proposed model can be guided by either questions or answers, and output the counterpart. Furthermore, we show that combining human-like visual question-saliency improves the performance of both answer and question generations.
AB - Top-down, goal-driven visual saliency exerts a huge influence on the human visual system for performing visual tasks. Text generations, like visual question answering (VQA) and visual question generation (VQG), have intrinsic connections with top-down saliency, which is usually involved in both VQA and VQG processes in an unsupervised manner. However, it is shown that the regions that humans choose to look at to answer questions are very different from the unsupervised attention models. In this brief, we aim to explore the intrinsic relationship between top-down saliency and text generations, and to figure out whether an accurate saliency response benefits text generation. To this end, we propose a dual supervised network with dynamic parameter prediction. Dual-supervision explicitly exploits the probabilistic correlation between the primal task top-down saliency detection and the dual task text generation, while dynamic parameter prediction encodes the given text (i.e., question or answer) into the fully convolutional network. Extensive experiments show the proposed top-down saliency method achieves the best correlation with human attention among various baselines. In addition, the proposed model can be guided by either questions or answers, and output the counterpart. Furthermore, we show that combining human-like visual question-saliency improves the performance of both answer and question generations.
KW - Dual learning
KW - saliency
KW - visual question answering (VQA)
KW - visual question generation (VQG)
UR - http://www.scopus.com/inward/record.url?scp=85088093958&partnerID=8YFLogxK
U2 - 10.1109/TNNLS.2019.2933439
DO - 10.1109/TNNLS.2019.2933439
M3 - Journal article
C2 - 31484137
AN - SCOPUS:85088093958
SN - 2162-237X
VL - 31
SP - 2672
EP - 2679
JO - IEEE Transactions on Neural Networks and Learning Systems
JF - IEEE Transactions on Neural Networks and Learning Systems
IS - 7
M1 - 8822633
ER -