TCtract-A collocation extraction approach for noun phrases using shallow parsing rules and statistic models

Wan Yin Li, Qin Lu, James Liu

Research output: Chapter in book / Conference proceedingConference article published in proceeding or bookAcademic researchpeer-review

2 Citations (Scopus)

Abstract

This paper presents a hybrid method for extracting Chinese noun phrase collocations that combines a statistical model with rule-based linguistic knowledge. The algorithm first extracts all the noun phrase collocations from a shallow parsed corpus by using syntactic knowledge in the form of phrase rules. It then removes pseudo collocations by using a set of statistic-based association measures (AMs) as filters. There are two main purposes for the design of this hybrid algorithm: (1) to maintain a reasonable recall while improving the precision, and (2) to investigate the proposed association measures on Chinese noun phrase collocations. The performance is compared with a pure statistical model and a pure rule-based method on a 60MB PoS tagged corpus. The experiment results show that the proposed hybrid method has a higher precision of 92.65% and recall of 47% based on 29 randomly selected noun headwords compared with the precision of 78.87% and recall of 27.19% of a statistics based extraction system. The F-score improvement is 55.7%.
Original languageEnglish
Title of host publicationPACLIC 20 - Proceedings of the 20th Pacific Asia Conference on Language, Information and Computation
Pages109-116
Number of pages8
Publication statusPublished - 1 Dec 2006
Event20th Pacific Asia Conference on Language, Information and Computation, PACLIC 20 - Wuhan, China
Duration: 1 Nov 20063 Nov 2006

Conference

Conference20th Pacific Asia Conference on Language, Information and Computation, PACLIC 20
Country/TerritoryChina
CityWuhan
Period1/11/063/11/06

Keywords

  • Association measures
  • Collocation extraction
  • Phrase rules
  • Typed collocations

ASJC Scopus subject areas

  • Language and Linguistics
  • Computer Science (miscellaneous)

Cite this