CompLex-ZH: A New Dataset for Lexical Complexity Prediction in Mandarin and Cantonese

Research output: Chapter in book / Conference proceedingConference article published in proceeding or bookAcademic researchpeer-review

Abstract

The prediction of lexical complexity in context is assuming an increasing relevance in Natural Language Processing research, since identifying complex words is often the first step of text simplification pipelines. To the best of our knowledge, though, datasets annotated with complex words are available only for English and for a limited number of Western languages. In our paper, we introduce CompLex-ZH, a dataset including words annotated with complexity scores in sentential contexts for Chinese. Our data include sentences in Mandarin and Cantonese, which were selected from a variety of sources and textual genres. We provide a first evaluation with baselines combining hand-crafted and language models-based features.

Original languageEnglish
Title of host publicationTSAR 2024 - 3rd Workshop on Text Simplification, Accessibility and Readability, Proceedings of the Workshop
EditorsMatthew Shardlow, Horacio Saggion, Fernando Alva-Manchego, Marcos Zampieri, Kai North, Sanja Stajner, Regina Stodden
PublisherAssociation for Computational Linguistics (ACL)
Pages20-26
Number of pages7
ISBN (Electronic)9798891761766
DOIs
Publication statusPublished - Nov 2024
Event3rd Workshop on Text Simplification, Accessibility and Readability, TSAR 2024 - Miami, United States
Duration: 15 Nov 2024 → …

Publication series

NameTSAR 2024 - 3rd Workshop on Text Simplification, Accessibility and Readability, Proceedings of the Workshop

Conference

Conference3rd Workshop on Text Simplification, Accessibility and Readability, TSAR 2024
Country/TerritoryUnited States
CityMiami
Period15/11/24 → …

ASJC Scopus subject areas

  • Artificial Intelligence
  • Computer Graphics and Computer-Aided Design

Fingerprint

Dive into the research topics of 'CompLex-ZH: A New Dataset for Lexical Complexity Prediction in Mandarin and Cantonese'. Together they form a unique fingerprint.

Cite this