Context windows are important for a variety of natural language analysis and processing. A trade-off exists between the task performance and the size of the context. Lucassen and Mercer used mutual information to determine the size of the context for English text. We apply the same technique to determine the Context window size for Chinese text. In addition, we use the association score, proposed by Church. The association score is directly related to the prediction ability of units in the context. To reduce the effects of spurious associations, the association score values at the N% quartile is used, instead of the maximum, and the association score derived from low frequency occurrences (i.e. <5) are discarded. A window size of 9 characters was found to be large enough for most associations between characters themselves, and between words themselves. An alternative approach using the (nonparametric) lambda statistic LB is examined, which overcomes spurious association problems and the averaging effect of mutual information. We conclude that the statistic is more suitable for exhaustive contextual models (e.g. variable N-gram models) whereas the association score is more suitable for non-exhaustive contextual models (e.g. identification of collocation).
|Number of pages||10|
|Journal||International journal of computer processing of languages|
|Publication status||Published - 2001|
- Text Analysis
- Textual Data Mining
- Association Score
- Mutual Information