Many existing clustering algorithms have been used to identify coexpressed genes in gene expression data. These algorithms are used mainly to partition data in the sense that each gene is allowed to belong only to one cluster. Since proteins typically interact with different groups of proteins in order to serve different biological roles, the genes that produce these proteins are therefore expected to coexpress with more than one group of genes. In other words, some genes are expected to belong to more than one cluster. This poses a challenge to gene expression data clustering as there is a need for overlapping clusters to be discovered in a noisy environment. For this task, we propose an effective information theoretical approach, which consists of an initial clustering phase and a second reclustering phase, in this paper. The proposed approach has been tested with both simulated and real expression data. Experimental results show that it can improve the performances of existing clustering algorithms and is able to effectively uncover interesting patterns in noisy gene expression data so that, based on these patterns, overlapping clusters can be discovered.
- Data mining
- Gene expression data clustering
- Information theory
ASJC Scopus subject areas
- Biomedical Engineering