Abstract
语义记忆是人类理解自然语言的基础.人类理解语言的过程可以看作是对词义进行编码、对语义记忆进行检索,进而对词义进行解码的过程.因此,对词义进行合理地表示是计算机理解语言的关键步骤.该文总结分析了已有的词义表示方法与人脑词义表征的关系,针对汉语词汇的歧义现象,重点阐述了如何从歧义词所处的上下文中最大限度地自动获取关于歧义词的词义信息,并将这些信息整合,通过一系列的特征集合表示歧义词的词义.具体地说,该文将出现在歧义词上下文语境中有明确含义的实词作为模型的输入,同时在上下文中获取可以表示歧义词词义的其他特征,最终将这两种信息通过贝叶斯概率模型整合在一起,共同实现歧义词的词义表示和归纳.实验表明,该文提出的方法可以得到更好的词义表示和归纳效果.
Semantic memory is the foundation of human language understanding. Human brain needs to encode, retrieve and decode word meanings for language understanding. The semantic representation is the key step to develop natural language processing systems. Some studies have shown that the formation of concepts is affected by the interaction of human brain and the real world, and the concepts in human brain contain rich forms of information including vision, perception and language. Based on the distributional hypothesis which states that "similar words occur in similar contexts", the concepts are represented as vectors by calculating the co-occurrence frequency of each word and its statistical features. In this way, word representation in computer can be seen as the semantic representation in human brain. This article mainly focuses on how to represent word senses and do word senses induction in natural language text. We first investigate the relation between computational models of word representation and semantic representation in human brain. Based on word similarity experiments, we have verified that word representations by statistical methods can capture the relationship of similarity between words in human brain. In the view of Chinese word sense disambiguation, this paper studies the methods to find the semantic features of ambiguities from context automatically. Bayesian probability model can learn word representations and do word sense induction together. Specifically, in order to do word sense induction, Bayesian probability model clusters words with the same topic. The words within the same topic can be seen as the representation of the topic. In the task of word sense induction, the topics are mapped to word senses in evaluation. Therefore, we use latent Dirichlet allocation model to learn word sense representation from large scale of corpus without annotation. On the basis of word sense representation, we do word sense induction on the testing data. In order to better capture the meaning of ambiguous words, this article builds a Dual Latent Dirichlet Allocation (Dual-LDA) model with two input channels. Specifically, we propose an approach to extract content words that have clear meaning in the context and the words that can distinguish the ambiguous words in the context. Then we combine them as two inputs of Bayesian probability model to represent the word senses and induce the word sense. In the experiment, we use the SogouLab data (sogouCS) as our training corpus and extract 120 thousand sentences which contain the target ambiguous words. The ambiguous words are from word sense induction task in CLP2010, which contain 50 sentences for each ambiguous word and each sentence is annotated with the sense of the ambiguous word. For evaluation, we choose K-Means clustering model and latent Dirichlet allocation model as baseline and use the accuracy as evaluation metric. The experimental results show that the proposed Dual-LDA model achieves the best results among other models. This indicates that Dual-LDA model can get better word representations by integrating two different information extracted from context. What is more, the better word representations can improve the performance of word sense induction.
| Translated title of the contribution | A dual-LDA method on chinese word sense representation and induction |
|---|---|
| Original language | Chinese (Simplified) |
| Pages (from-to) | 1652-1666 |
| Number of pages | 15 |
| Journal | Jisuanji Xuebao/Chinese Journal of Computers |
| Volume | 39 |
| Issue number | 8 |
| DOIs | |
| Publication status | Published - 1 Aug 2016 |
| Externally published | Yes |
Keywords
- Dual Dirichlet analysis
- Latent Dirichlet allocation
- Semantic representation
- Word sense disambiguation
- Word sense induction
ASJC Scopus subject areas
- Software
- Hardware and Architecture
- Computer Networks and Communications
- Computer Graphics and Computer-Aided Design
Fingerprint
Dive into the research topics of 'A dual-LDA method on chinese word sense representation and induction'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver