Abstract
This paper presents a system which extracts word-based bi-gram and n-gram collocation information from a 60MB corpus and then locates bi-gram pairs using Strength and Spread as defined in the Xtract system. In order for Xtract to work effectively with Chinese, we have re-adjusted the parameters. To obtain a higher recall rate, we have modified the algorithm to identify collocations with low-frequency of occurrence, a method which works particularly well in the case of bi-grams in which one word is high-frequency and the other low-frequency. In preliminary experiments, our system extracts bi-gram collocations with a precision of 61%: an 8% improvement over the direct use Smadja' Xtract on Chinese. Further. \\'e have improved the recall rate by 4.5% while extracting multi-word collocations with 92% precision.
Original language | English |
---|---|
Title of host publication | NLP-KE 2003 - 2003 International Conference on Natural Language Processing and Knowledge Engineering, Proceedings |
Publisher | IEEE |
Pages | 333-338 |
Number of pages | 6 |
ISBN (Electronic) | 0780379020, 9780780379022 |
DOIs | |
Publication status | Published - 1 Jan 2003 |
Event | International Conference on Natural Language Processing and Knowledge Engineering, NLP-KE 2003 - Beijing Media Center, Beijing, China Duration: 26 Oct 2003 → 29 Oct 2003 |
Conference
Conference | International Conference on Natural Language Processing and Knowledge Engineering, NLP-KE 2003 |
---|---|
Country/Territory | China |
City | Beijing |
Period | 26/10/03 → 29/10/03 |
Keywords
- Chinese collocation
- Information
- Statistical modeling
ASJC Scopus subject areas
- Artificial Intelligence
- Computational Theory and Mathematics
- Software