TY - GEN
T1 - Image2Audio: Facilitating Semi-supervised Audio Emotion Recognition with Facial Expression Image
AU - He, Gewen
AU - Liu, Xiaofeng
AU - Fan, Fangfang
AU - You, Jia
N1 - Publisher Copyright:
© 2020 IEEE.
PY - 2020/6
Y1 - 2020/6
N2 - There is a large amount of public available labeled image-based facial expression recognition datasets. How could these images help for the audio emotion recognition with limited labeled data according to their inherent correlations can be a meaningful and challenging task. In this paper, we propose a semi-supervised adversarial network that allows the knowledge transfer from the labeled videos to the heterogeneous labeled audio domain hence enhancing the audio emotion recognition performance. Specifically, face image samples are translated to the spectrograms class-wisely. To harness the translated samples in a sparsely distributed area and construct a tighter decision boundary, we propose to precisely estimate the density on feature space and incorporate the reliable low-density sample with an annealing scheme. Moreover, the unlabeled audios are collected with the high-density path in a graph representation. As a possible "recognition via generation" framework, we empirically demonstrated its effectiveness on several audio emotional recognition benchmarks.
AB - There is a large amount of public available labeled image-based facial expression recognition datasets. How could these images help for the audio emotion recognition with limited labeled data according to their inherent correlations can be a meaningful and challenging task. In this paper, we propose a semi-supervised adversarial network that allows the knowledge transfer from the labeled videos to the heterogeneous labeled audio domain hence enhancing the audio emotion recognition performance. Specifically, face image samples are translated to the spectrograms class-wisely. To harness the translated samples in a sparsely distributed area and construct a tighter decision boundary, we propose to precisely estimate the density on feature space and incorporate the reliable low-density sample with an annealing scheme. Moreover, the unlabeled audios are collected with the high-density path in a graph representation. As a possible "recognition via generation" framework, we empirically demonstrated its effectiveness on several audio emotional recognition benchmarks.
UR - http://www.scopus.com/inward/record.url?scp=85090154430&partnerID=8YFLogxK
U2 - 10.1109/CVPRW50498.2020.00464
DO - 10.1109/CVPRW50498.2020.00464
M3 - Conference article published in proceeding or book
T3 - IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops
SP - 3978
EP - 3983
BT - Proceedings - 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2020
PB - IEEE Computer Society
T2 - 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)
Y2 - 14 June 2020 through 19 June 2020
ER -