Cantonese Natural Language Processing in the Transformers Era: A Survey and Current Challenges

Rong Xiang, Emmanuele Chersoni (Corresponding Author), Yixia Li, Jing Li, Chu-Ren Huang, Yushan Pan, Yushi Li

Research output: Journal article publicationJournal articleAcademic researchpeer-review

1 Citation (Scopus)

Abstract

Despite being spoken by a large population of speakers worldwide, Cantonese is under-resourced in terms of the data scale and diversity compared to other major languages. This limitation has excluded it from the current “pre-training and fine-tuning” paradigm that is dominated by Transformer architectures. In this paper, we provide a comprehensive review on the existing resources and methodologies for Cantonese Natural Language Processing, covering the recent progress in language understanding, text generation and development of language models. We finally discuss two aspects of the Cantonese language that could make it potentially challenging even for state-of-the-art architectures: colloquialism and multilinguality.

Original languageEnglish
JournalLanguage Resources and Evaluation
DOIs
Publication statusPublished - 8 Jun 2024

Keywords

  • Cantonese
  • Code-switching
  • Evaluation resources
  • Multilingualism
  • NLP for social media

ASJC Scopus subject areas

  • Education
  • Library and Information Sciences
  • Language and Linguistics
  • Linguistics and Language

Fingerprint

Dive into the research topics of 'Cantonese Natural Language Processing in the Transformers Era: A Survey and Current Challenges'. Together they form a unique fingerprint.

Cite this