Abstract
This paper reports an innovative Chinese register study based on regression analysis for sentence length distribution and text clustering. Although end of sentence is not conventionally marked in Chinese, we resolve this issue by assuming that segments between periods, question marks, and exclamation marks are sentences, which can be further divided into simple sentences and compound sentences. We also assume that segments between punctuation marks that express pauses in utterances form sentences (i.e., clauses). Using regression analysis, we find that the frequency distribution of sentence and clause lengths in Chinese can be fitted by the formula F = aL b c L , where L is sentence/clause length. Texts from different registers give rise to different fitted values of the parameters, and hence can serve to differentiate these registers. Finally, we use these parameters to represent and cluster texts from different registers. The successful text clustering results further prove that the parameters of the fitted results are reliable linguistic characteristics for different registers. In terms of linguistic theories, our study shows that it is just as effective to model sentence length in Chinese using sociological words (i.e., characters) as it is using linguistic words.
Original language | English |
---|---|
Pages (from-to) | 1-37 |
Number of pages | 37 |
Journal | Corpus Linguistics and Linguistic Theory |
Volume | 15 |
Issue number | 1 |
Early online date | 30 Mar 2017 |
DOIs | |
Publication status | Published - 1 May 2019 |
Keywords
- Chinese register
- Regression analysis
- Sentence length distribution
- Text clustering
ASJC Scopus subject areas
- Language and Linguistics
- Linguistics and Language