Improving Xtract for Chinese collocation extraction

Qin Lu, Yin Li, Ruifeng Xu

Research output: Chapter in book / Conference proceedingConference article published in proceeding or bookAcademic researchpeer-review

7 Citations (Scopus)

Abstract

This paper presents a system which extracts word-based bi-gram and n-gram collocation information from a 60MB corpus and then locates bi-gram pairs using Strength and Spread as defined in the Xtract system. In order for Xtract to work effectively with Chinese, we have re-adjusted the parameters. To obtain a higher recall rate, we have modified the algorithm to identify collocations with low-frequency of occurrence, a method which works particularly well in the case of bi-grams in which one word is high-frequency and the other low-frequency. In preliminary experiments, our system extracts bi-gram collocations with a precision of 61%: an 8% improvement over the direct use Smadja' Xtract on Chinese. Further. \\'e have improved the recall rate by 4.5% while extracting multi-word collocations with 92% precision.
Original languageEnglish
Title of host publicationNLP-KE 2003 - 2003 International Conference on Natural Language Processing and Knowledge Engineering, Proceedings
PublisherIEEE
Pages333-338
Number of pages6
ISBN (Electronic)0780379020, 9780780379022
DOIs
Publication statusPublished - 1 Jan 2003
EventInternational Conference on Natural Language Processing and Knowledge Engineering, NLP-KE 2003 - Beijing Media Center, Beijing, China
Duration: 26 Oct 200329 Oct 2003

Conference

ConferenceInternational Conference on Natural Language Processing and Knowledge Engineering, NLP-KE 2003
CountryChina
CityBeijing
Period26/10/0329/10/03

Keywords

  • Chinese collocation
  • Information
  • Statistical modeling

ASJC Scopus subject areas

  • Artificial Intelligence
  • Computational Theory and Mathematics
  • Software

Cite this