Abstract
This work explores the use of various Deep Neural Network (DNN) architectures for an end-to-end language identification (LID) task. The approach has been proven to significantly improve the state-of-art in many domains include speech recognition, computer vision and genomics. As an end-to-end system, deep learning removes the burden of hand crafting the feature extraction is conventional approach in LID. This versatility is achieved by training a very deep network to learn distributed representations of speech features with multiple levels of abstraction. In this paper, we show that an end-to-end deep learning system can be used to recognize language from speech utterances with various lengths. Our results show that a combination of three deep architectures: feed-forward network, convolutional network and recurrent network can achieve the best performance compared to other network designs. Additionally, we compare our network performance to state-of-the-art BNF-based i-vector system on NIST 2015 Language Recognition Evaluation corpus. Key to our approach is that we effectively address computational and regularization issues into the network structure to build deeper architecture compare to any previous DNN approaches to language recognition task.
Original language | English |
---|---|
Pages | 109-116 |
Number of pages | 8 |
DOIs | |
Publication status | Published - Jun 2016 |
Externally published | Yes |
Event | Speaker and Language Recognition Workshop, Odyssey 2016 - Bilbao, Spain Duration: 21 Jun 2016 → 24 Jun 2016 |
Conference
Conference | Speaker and Language Recognition Workshop, Odyssey 2016 |
---|---|
Country/Territory | Spain |
City | Bilbao |
Period | 21/06/16 → 24/06/16 |
ASJC Scopus subject areas
- Signal Processing
- Software
- Human-Computer Interaction