Abstract
Automatic speaker recognition is the task of identifying or verifying an individual's identity from samples of his/her voice using machine learning algorithms, without any human intervention. Some of the earliest approaches in using machines to aid in speaker recognition can be traced back to work at Bell Laboratories during World War 2 using then newly developed speech spectrograms. Approaches to make speaker recognition fully automatic were first investigated starting in the 1960s using spectrogram template matching (with the unfortunate use of the term “voiceprint” as an analogy to a fingerprint). Research into new methods for automatic speaker recognition continued thereafter as computer processing and storage increased but, similar to other areas of speech processing, published results used private databases and varying experiment methods and metrics making a reliable comparison of techniques quite challenging. To address this deficiency, in 1996 the U.S. National Institute of Standards and Technology (NIST) began holding regular formal speaker recognition evaluations (SRE). The competitive evaluations provide common experiment frameworks and corpora for exploring promising new ideas in speaker recognition, as well as measuring the performance of the latest state of speaker recognition technology. One of the hallmarks of the NIST SREs has been to continually evolve the evaluations to address more challenging data and conditions as the underlying technology improves. Today, with the explosion in compute resources, big data, and the resurrection of data-hungry modeling techniques such as artificial neural networks, NIST and other evaluations are focused on performance in more realistic speakers in the wild scenarios. Two decades into systematic and open competitive evaluations that have demonstrated speaker recognition as a reliable and testable technology for person authentication and with the growing use in commercial and security applications, the time is right to assess the state of speaker recognition technology and evaluations.
This special issue brought together researchers from the speaker recognition and related fields, with the aim of providing up-to-date papers on recent advances in evaluations, databases, implementation, algorithms, and theoretical perspectives on the state-of-the-art in speaker recognition. It comprises six papers, each addressing various aspects of the same question – Two decades into Speaker Recognition Evaluation - are we there yet? The answer is both yes and no. It is apparent that speaker recognition technology has seen significant advancements in the past two decades. The issues of channel and domain mismatch have been addressed to a great extent, and state-of-the-art systems have increasing capability to deal with shorter enrollment and test segments. Moreover, we have witnessed the successful deployment of speaker recognition technology in commercial products, which has been made possible through effective techniques for data augmentation, speaker modeling (e.g., i-vectors and neural network based embeddings such as x-vectors), score normalization, as well as score calibration and fusion. Nevertheless, some open problems remain. The interplay between, and factorization of, spoken text and speaker characteristics remain challenging. In addition, topics such as end-to-end modeling for speaker recognition, audio-visual (or multimodal) speaker recognition using found data, as well as voice biometric security and privacy may continue to drive this field forward in the future.
This special issue brought together researchers from the speaker recognition and related fields, with the aim of providing up-to-date papers on recent advances in evaluations, databases, implementation, algorithms, and theoretical perspectives on the state-of-the-art in speaker recognition. It comprises six papers, each addressing various aspects of the same question – Two decades into Speaker Recognition Evaluation - are we there yet? The answer is both yes and no. It is apparent that speaker recognition technology has seen significant advancements in the past two decades. The issues of channel and domain mismatch have been addressed to a great extent, and state-of-the-art systems have increasing capability to deal with shorter enrollment and test segments. Moreover, we have witnessed the successful deployment of speaker recognition technology in commercial products, which has been made possible through effective techniques for data augmentation, speaker modeling (e.g., i-vectors and neural network based embeddings such as x-vectors), score normalization, as well as score calibration and fusion. Nevertheless, some open problems remain. The interplay between, and factorization of, spoken text and speaker characteristics remain challenging. In addition, topics such as end-to-end modeling for speaker recognition, audio-visual (or multimodal) speaker recognition using found data, as well as voice biometric security and privacy may continue to drive this field forward in the future.
Original language | English |
---|---|
Article number | 101058 |
Journal | Computer Speech and Language |
Volume | 61 |
DOIs | |
Publication status | Published - May 2020 |
ASJC Scopus subject areas
- Theoretical Computer Science
- Software
- Human-Computer Interaction