TY - GEN
T1 - Anatomical Structure-Guided Medical Vision-Language Pre-training
AU - Li, Qingqiu
AU - Yan, Xiaohan
AU - Xu, Jilan
AU - Yuan, Runtian
AU - Zhang, Yuejie
AU - Feng, Rui
AU - Shen, Quanli
AU - Zhang, Xiaobo
AU - Wang, Shujun
N1 - Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024.
PY - 2024/3/14
Y1 - 2024/3/14
N2 - Learning medical visual representations through visionlanguage pre-training has reached remarkable progress. Despite the promising performance, it still faces challenges, i.e., local alignment lacks interpretability and clinical relevance, and the insufficient internal and external representation learning of image-report pairs. To address these issues, we propose an Anatomical Structure-Guided (ASG) framework. Specifically, we parse raw reports into triplets, and fully utilize each element as supervision to enhance representation learning. For anatomical region, we design an automatic anatomical region-sentence alignment paradigm in collaboration with radiologists, considering them as the minimum semantic units to explore fine-grained local alignment. For finding and existence, we regard them as image tags, applying an image-tag recognition decoder to associate image features with their respective tags within each sample and constructing soft labels for contrastive learning to improve the semantic association of different image-report pairs. We evaluate the proposed ASG framework on two downstream tasks, including five public benchmarks. Experimental results demonstrate that our method outperforms the state-of-the-art methods. Our code is available at https://asgmvlp.github.io.
AB - Learning medical visual representations through visionlanguage pre-training has reached remarkable progress. Despite the promising performance, it still faces challenges, i.e., local alignment lacks interpretability and clinical relevance, and the insufficient internal and external representation learning of image-report pairs. To address these issues, we propose an Anatomical Structure-Guided (ASG) framework. Specifically, we parse raw reports into triplets, and fully utilize each element as supervision to enhance representation learning. For anatomical region, we design an automatic anatomical region-sentence alignment paradigm in collaboration with radiologists, considering them as the minimum semantic units to explore fine-grained local alignment. For finding and existence, we regard them as image tags, applying an image-tag recognition decoder to associate image features with their respective tags within each sample and constructing soft labels for contrastive learning to improve the semantic association of different image-report pairs. We evaluate the proposed ASG framework on two downstream tasks, including five public benchmarks. Experimental results demonstrate that our method outperforms the state-of-the-art methods. Our code is available at https://asgmvlp.github.io.
KW - Anatomical Structure
KW - Contrastive Learning
KW - Medical Vision-Language Pre-training
KW - Representation Learning
UR - https://www.scopus.com/pages/publications/105007676143
U2 - 10.1007/978-3-031-72120-5_8
DO - 10.1007/978-3-031-72120-5_8
M3 - Conference article published in proceeding or book
AN - SCOPUS:105007676143
SN - 9783031721199
T3 - Lecture Notes in Computer Science
SP - 80
EP - 90
BT - Medical Image Computing and Computer Assisted Intervention - MICCAI 2024 - 27th International Conference, Proceedings
A2 - Linguraru, Marius George
A2 - Feragen, Aasa
A2 - Glocker, Ben
A2 - Giannarou, Stamatia
A2 - Schnabel, Julia A.
A2 - Dou, Qi
A2 - Lekadir, Karim
PB - Springer Science and Business Media Deutschland GmbH
T2 - 27th International Conference on Medical Image Computing and Computer-Assisted Intervention, MICCAI 2024
Y2 - 6 October 2024 through 10 October 2024
ER -