TY - GEN
T1 - DeVLBert
T2 - 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2021
AU - Zhang, Shengyu
AU - Jiang, Tan
AU - Wang, Tan
AU - Kuang, Kun
AU - Zhao, Zhou
AU - Zhu, Jianke
AU - Yu, Jin
AU - Yang, Hongxia
AU - Wu, Fei
N1 - Publisher Copyright:
© 2021 IEEE.
PY - 2021/6
Y1 - 2021/6
N2 - In this paper, we propose to investigate out-of-domain visio-linguistic pretraining, where the pretraining data distribution differs from that of downstream data on which the pretrained model will be fine-tuned. Existing methods for this problem are purely likelihood-based, leading to the spurious correlations and hurt the generalization ability when transferred to out-of-domain downstream tasks. By spurious correlation, we mean that the conditional probability of one token (object or word) given another one can be high (due to the dataset biases) without robust (causal) relationships between them. To mitigate such dataset biases, we propose a Deconfounded Visio-Linguistic Bert framework, abbreviated as DeVLBert1, to perform intervention-based learning. We borrow the idea of the backdoor adjustment from the research field of causality and propose several neural-network based architectures for Bert-style out-of-domain pretraining. The quantitative results on three downstream tasks, Image Retrieval (IR), Zero-shot IR, and Visual Question Answering, show the effectiveness of De-VLBert by boosting generalization ability 2.
AB - In this paper, we propose to investigate out-of-domain visio-linguistic pretraining, where the pretraining data distribution differs from that of downstream data on which the pretrained model will be fine-tuned. Existing methods for this problem are purely likelihood-based, leading to the spurious correlations and hurt the generalization ability when transferred to out-of-domain downstream tasks. By spurious correlation, we mean that the conditional probability of one token (object or word) given another one can be high (due to the dataset biases) without robust (causal) relationships between them. To mitigate such dataset biases, we propose a Deconfounded Visio-Linguistic Bert framework, abbreviated as DeVLBert1, to perform intervention-based learning. We borrow the idea of the backdoor adjustment from the research field of causality and propose several neural-network based architectures for Bert-style out-of-domain pretraining. The quantitative results on three downstream tasks, Image Retrieval (IR), Zero-shot IR, and Visual Question Answering, show the effectiveness of De-VLBert by boosting generalization ability 2.
UR - https://www.scopus.com/pages/publications/85116022342
U2 - 10.1109/CVPRW53098.2021.00191
DO - 10.1109/CVPRW53098.2021.00191
M3 - Conference article published in proceeding or book
AN - SCOPUS:85116022342
T3 - IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops
SP - 1744
EP - 1747
BT - Proceedings - 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2021
PB - IEEE Computer Society
Y2 - 19 June 2021 through 25 June 2021
ER -