Abstract
This paper presents a hybrid method for extracting Chinese noun phrase collocations that combines a statistical model with rule-based linguistic knowledge. The algorithm first extracts all the noun phrase collocations from a shallow parsed corpus by using syntactic knowledge in the form of phrase rules. It then removes pseudo collocations by using a set of statistic-based association measures (AMs) as filters. There are two main purposes for the design of this hybrid algorithm: (1) to maintain a reasonable recall while improving the precision, and (2) to investigate the proposed association measures on Chinese noun phrase collocations. The performance is compared with a pure statistical model and a pure rule-based method on a 60MB PoS tagged corpus. The experiment results show that the proposed hybrid method has a higher precision of 92.65% and recall of 47% based on 29 randomly selected noun headwords compared with the precision of 78.87% and recall of 27.19% of a statistics based extraction system. The F-score improvement is 55.7%.
Original language | English |
---|---|
Title of host publication | PACLIC 20 - Proceedings of the 20th Pacific Asia Conference on Language, Information and Computation |
Pages | 109-116 |
Number of pages | 8 |
Publication status | Published - 1 Dec 2006 |
Event | 20th Pacific Asia Conference on Language, Information and Computation, PACLIC 20 - Wuhan, China Duration: 1 Nov 2006 → 3 Nov 2006 |
Conference
Conference | 20th Pacific Asia Conference on Language, Information and Computation, PACLIC 20 |
---|---|
Country/Territory | China |
City | Wuhan |
Period | 1/11/06 → 3/11/06 |
Keywords
- Association measures
- Collocation extraction
- Phrase rules
- Typed collocations
ASJC Scopus subject areas
- Language and Linguistics
- Computer Science (miscellaneous)