Determination of context window size

K.Y. Hung, Wing Pong Robert Luk, D. Yeung, Fu Lai Korris Chung, W. Shu

Research output: Journal article publicationJournal articleAcademic researchpeer-review

Abstract

Context windows are important for a variety of natural language analysis and processing. A trade-off exists between the task performance and the size of the context. Lucassen and Mercer used mutual information to determine the size of the context for English text. We apply the same technique to determine the Context window size for Chinese text. In addition, we use the association score, proposed by Church. The association score is directly related to the prediction ability of units in the context. To reduce the effects of spurious associations, the association score values at the N% quartile is used, instead of the maximum, and the association score derived from low frequency occurrences (i.e. <5) are discarded. A window size of 9 characters was found to be large enough for most associations between characters themselves, and between words themselves. An alternative approach using the (nonparametric) lambda statistic LB is examined, which overcomes spurious association problems and the averaging effect of mutual information. We conclude that the statistic is more suitable for exhaustive contextual models (e.g. variable N-gram models) whereas the association score is more suitable for non-exhaustive contextual models (e.g. identification of collocation).
Original languageEnglish
Pages (from-to)71-80
Number of pages10
JournalInternational journal of computer processing of languages
Volume14
Issue number1
DOIs
Publication statusPublished - 2001

Keywords

  • Text Analysis
  • Textual Data Mining
  • Association Score
  • Mutual Information

Cite this