Abstract
In this article, we propose an innovative and robust approach to stylometric analysis without annotation and leveraging lexical and sub-lexical information. In particular, we propose to leverage the phonological information of tones and rimes in Mandarin Chinese automatically extracted from unannotated texts. The texts from different authors were represented by tones, tone motifs, and word length motifs as well as rimes and rime motifs. Support vector machines and random forests were used to establish the text classification model for authorship attribution. From the results of the experiments, we conclude that the combination of bigrams of rimes, word-final rimes, and segment-final rimes can discriminate the texts from different authors effectively when using random forests to establish the classification model. This robust approach can in principle be applied to other languages with established phonological inventory of onset and rimes.
Original language | English |
---|---|
Pages (from-to) | 49-71 |
Number of pages | 23 |
Journal | Natural Language Engineering |
Volume | 26 |
Issue number | 1 |
Early online date | 10 Apr 2019 |
DOIs | |
Publication status | Published - 1 Jan 2020 |
Keywords
- Author identification
- Quantitative stylistics
- Random forest
- Stylometrics
- SVM
- Tone and rime motifs
ASJC Scopus subject areas
- Software
- Language and Linguistics
- Linguistics and Language
- Artificial Intelligence