TY - GEN
T1 - Text-guided Visual Prompt Tuning with Masked Images for Facial Expression Recognition
AU - Dong, Rongkang
AU - Yang, Cuixin
AU - Lam, Kin Man
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024/12
Y1 - 2024/12
N2 - Facial expression recognition (FER) has significantly advanced through the application of deep learning techniques for visual content classification. Recent research has explored the use of pre-trained language-image models, such as CLIP, which leverage natural language supervision to enhance image backbone training and facilitate the learning of general visual representations. Concurrently, visual prompt tuning has emerged as a method to minimize tuning overhead for downstream tasks by freezing the pre-trained backbone models and incorporating additional learnable parameters, known as visual prompts, into the model input. This strategy circumvents the need to update the entire neural network, focusing instead on optimizing visual prompts for specific tasks. In this study, we propose a novel tuning scheme, namely Text-guided Visual Prompt Tuning with Masked facial images (T-VPT-M), for both basic and compound FER. Our method utilizes natural language supervision for visual prompt learning and employs a random masking mechanism to adapt visual prompts to diverse informative facial regions. Experimental results on three real-world datasets, encompassing both basic and compound facial expressions, demonstrate the efficacy of the T-VPT-M scheme.
AB - Facial expression recognition (FER) has significantly advanced through the application of deep learning techniques for visual content classification. Recent research has explored the use of pre-trained language-image models, such as CLIP, which leverage natural language supervision to enhance image backbone training and facilitate the learning of general visual representations. Concurrently, visual prompt tuning has emerged as a method to minimize tuning overhead for downstream tasks by freezing the pre-trained backbone models and incorporating additional learnable parameters, known as visual prompts, into the model input. This strategy circumvents the need to update the entire neural network, focusing instead on optimizing visual prompts for specific tasks. In this study, we propose a novel tuning scheme, namely Text-guided Visual Prompt Tuning with Masked facial images (T-VPT-M), for both basic and compound FER. Our method utilizes natural language supervision for visual prompt learning and employs a random masking mechanism to adapt visual prompts to diverse informative facial regions. Experimental results on three real-world datasets, encompassing both basic and compound facial expressions, demonstrate the efficacy of the T-VPT-M scheme.
UR - http://www.scopus.com/inward/record.url?scp=85218190621&partnerID=8YFLogxK
U2 - 10.1109/APSIPAASC63619.2025.10849301
DO - 10.1109/APSIPAASC63619.2025.10849301
M3 - Conference article published in proceeding or book
AN - SCOPUS:85218190621
T3 - APSIPA ASC 2024 - Asia Pacific Signal and Information Processing Association Annual Summit and Conference 2024
BT - APSIPA ASC 2024 - Asia Pacific Signal and Information Processing Association Annual Summit and Conference 2024
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2024 Asia Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2024
Y2 - 3 December 2024 through 6 December 2024
ER -