TY - GEN
T1 - Deep Spectro-temporal Artifacts for Detecting Synthesized Speech
AU - Liu, Xiaohui
AU - Liu, Meng
AU - Zhang, Lin
AU - Zhang, Linjuan
AU - Li, Kai
AU - Li, Nan
AU - Lee, Kong Aik
AU - Wang, Longbiao
AU - Dang, Jianwu
N1 - Publisher Copyright:
© 2022 Association for Computing Machinery.
PY - 2022/10/14
Y1 - 2022/10/14
N2 - The Audio Deep Synthesis Detection (ADD) Challenge has been held to detect generated human-like speech. With our submitted system, this paper provides an overall assessment of track 1 (Lowquality Fake Audio Detection) and track 2 (Partially Fake Audio Detection). In this paper, spectro-temporal artifacts were detected using raw temporal signals, spectral features, as well as deep embedding features. To address track 1, low-quality data augmentation, domain adaptation via finetuning, and various complementary feature information fusion were aggregated in our system. Furthermore, we analyzed the clustering characteristics of subsystems with different features by visualization method and explained the effectiveness of our proposed greedy fusion strategy. As for track 2, frame transition and smoothing were detected using self-supervised learning structure to capture the manipulation of PF attacks in the time domain. We ranked 4th and 5th in track 1 and track 2, respectively.
AB - The Audio Deep Synthesis Detection (ADD) Challenge has been held to detect generated human-like speech. With our submitted system, this paper provides an overall assessment of track 1 (Lowquality Fake Audio Detection) and track 2 (Partially Fake Audio Detection). In this paper, spectro-temporal artifacts were detected using raw temporal signals, spectral features, as well as deep embedding features. To address track 1, low-quality data augmentation, domain adaptation via finetuning, and various complementary feature information fusion were aggregated in our system. Furthermore, we analyzed the clustering characteristics of subsystems with different features by visualization method and explained the effectiveness of our proposed greedy fusion strategy. As for track 2, frame transition and smoothing were detected using self-supervised learning structure to capture the manipulation of PF attacks in the time domain. We ranked 4th and 5th in track 1 and track 2, respectively.
KW - Audio Deep Synthesis Detection
KW - Domain Adaptation
KW - Frame transition
KW - Greedy Fusion
KW - Self-Supervised Learning
KW - Spectro-temporal
UR - http://www.scopus.com/inward/record.url?scp=85141606437&partnerID=8YFLogxK
U2 - 10.1145/3552466.3556527
DO - 10.1145/3552466.3556527
M3 - Conference article published in proceeding or book
AN - SCOPUS:85141606437
T3 - DDAM 2022 - Proceedings of the 1st International Workshop on Deepfake Detection for Audio Multimedia
SP - 69
EP - 75
BT - DDAM 2022 - Proceedings of the 1st International Workshop on Deepfake Detection for Audio Multimedia
PB - Association for Computing Machinery, Inc
T2 - 1st International Workshop on Deepfake Detection for Audio Multimedia, DDAM 2022
Y2 - 14 October 2022
ER -