Classification of regional and genre varieties of Chinese: A correspondence analysis approach based on comparable balanced corpora

Renkui Hou, Chu Ren Huang

Research output: Journal article publicationJournal articleAcademic researchpeer-review

3 Citations (Scopus)


This paper proposes a robust text classification and correspondence analysis approach to identification of similar languages. In particular, we propose to use the readily available information of clauses and word length distribution to model similar languages. The modeling and classification are based on the hypothesis that languages are self-adaptive complex systems and hence can be classified by dynamic features describing the system, especially in terms of distributional relations of constituents of a system. For similar languages whose grammatical differences are often subtle, classification based on dynamic system features should be more effective. To test this hypothesis, we considered both regional and genre varieties of Mandarin Chinese for classification. The data are extracted from two comparable balanced corpora to minimize possible confounding factors. The two corpora are the Sinica Corpus from Taiwan and the Lancaster Corpus of Mandarin Chinese from Mainland China, and the two genres are reportage and review. Our text classification and correspondence analysis results show that the linguistically felicitous two-level constituency model combining power functions between word and clauses effectively classifies the two varieties of Chinese for both genres. In addition, we found that genres do have compounding effect on classification of regional varieties. In particular, reportage in two varieties is more likely to be classified than review, corroborating the complex system view of language variations. That is, language variations and changes typically do not take place evenly across the board for the complete language system. This further enhances our hypothesis that dynamic complex system features, such as the power functions captured by the Menzerath-Altmann law, provide effective models in classifications of similar languages.

Original languageEnglish
Pages (from-to)613-640
Number of pages28
JournalNatural Language Engineering
Issue number6
Publication statusPublished - Nov 2020


  • Comparable corpora
  • Complex adaptive system
  • Correspondence analysis
  • Keywords:
  • Similar languages
  • Text classification

ASJC Scopus subject areas

  • Software
  • Language and Linguistics
  • Linguistics and Language
  • Artificial Intelligence


Dive into the research topics of 'Classification of regional and genre varieties of Chinese: A correspondence analysis approach based on comparable balanced corpora'. Together they form a unique fingerprint.

Cite this