TY - GEN
T1 - Dual Parameter-Efficient Fine-Tuning for Speaker Representation Via Speaker Prompt Tuning and Adapters
AU - Li, Zhe
AU - Mak, Man-Wai
AU - Meng, Helen Mei-Ling
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024/4
Y1 - 2024/4
N2 - Fine-tuning a pre-trained Transformer model (PTM) for speech applications in a parameter-efficient manner offers the dual benefits of reducing memory and leveraging the rich feature representations in massive unlabeled datasets. However, existing parameter-efficient fine-tuning approaches either adapt the classification head or the whole PTM. The former is unsuitable when the PTM is used as a feature extractor, and the latter does not leverage the different degrees of feature abstraction at different Transformer layers. We propose two solutions to address these limitations. First, we apply speaker prompt tuning to update the task-specific embeddings of a PTM. The tuning enhances speaker feature relevance in the speaker embeddings through the cross-attention between prompt and speaker features. Second, we insert adapter blocks into the Transformer encoders and their outputs. This novel arrangement enables the fine-tuned PTM to determine the most suitable layers to extract relevant information for the downstream task. Extensive speaker verification experiments on Voxceleb and CU-MARVEL demonstrate higher parameter efficiency and better model adaptability of the proposed methods than the existing ones.
AB - Fine-tuning a pre-trained Transformer model (PTM) for speech applications in a parameter-efficient manner offers the dual benefits of reducing memory and leveraging the rich feature representations in massive unlabeled datasets. However, existing parameter-efficient fine-tuning approaches either adapt the classification head or the whole PTM. The former is unsuitable when the PTM is used as a feature extractor, and the latter does not leverage the different degrees of feature abstraction at different Transformer layers. We propose two solutions to address these limitations. First, we apply speaker prompt tuning to update the task-specific embeddings of a PTM. The tuning enhances speaker feature relevance in the speaker embeddings through the cross-attention between prompt and speaker features. Second, we insert adapter blocks into the Transformer encoders and their outputs. This novel arrangement enables the fine-tuned PTM to determine the most suitable layers to extract relevant information for the downstream task. Extensive speaker verification experiments on Voxceleb and CU-MARVEL demonstrate higher parameter efficiency and better model adaptability of the proposed methods than the existing ones.
KW - Speaker verification
KW - Transformer adapter
KW - parameter-efficient tuning
KW - pre-trained Transformer
KW - prompt tuning
UR - http://www.scopus.com/inward/record.url?scp=85195367346&partnerID=8YFLogxK
U2 - 10.1109/ICASSP48485.2024.10447795
DO - 10.1109/ICASSP48485.2024.10447795
M3 - Conference article published in proceeding or book
T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
SP - 10751
EP - 10755
BT - 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Proceedings
ER -