A study on Chinese register characteristics based on regression analysis and text clustering

Renkui Hou, Chu Ren Huang, Hongchao Liu

Research output: Journal article publicationJournal articleAcademic researchpeer-review

9 Citations (Scopus)

Abstract

This paper reports an innovative Chinese register study based on regression analysis for sentence length distribution and text clustering. Although end of sentence is not conventionally marked in Chinese, we resolve this issue by assuming that segments between periods, question marks, and exclamation marks are sentences, which can be further divided into simple sentences and compound sentences. We also assume that segments between punctuation marks that express pauses in utterances form sentences (i.e., clauses). Using regression analysis, we find that the frequency distribution of sentence and clause lengths in Chinese can be fitted by the formula F = aL b c L , where L is sentence/clause length. Texts from different registers give rise to different fitted values of the parameters, and hence can serve to differentiate these registers. Finally, we use these parameters to represent and cluster texts from different registers. The successful text clustering results further prove that the parameters of the fitted results are reliable linguistic characteristics for different registers. In terms of linguistic theories, our study shows that it is just as effective to model sentence length in Chinese using sociological words (i.e., characters) as it is using linguistic words.

Original languageEnglish
Pages (from-to)1-37
Number of pages37
JournalCorpus Linguistics and Linguistic Theory
Volume15
Issue number1
DOIs
Publication statusPublished - 1 May 2019

Keywords

  • Chinese register
  • Regression analysis
  • Sentence length distribution
  • Text clustering

ASJC Scopus subject areas

  • Language and Linguistics
  • Linguistics and Language

Cite this