Abstract
This paper proposes a new method called multimodal recurrent neural networks (RNNs) for RGB-D scene semantic segmentation. It is optimized to classify image pixels given two input sources: RGB color channels and depth maps. It simultaneously performs training of two RNNs that are crossly connected through information transfer layers, which are learnt to adaptively extract relevant cross-modality features. Each RNN model learns its representations from its own previous hidden states and transferred patterns from the other RNNs previous hidden states; thus, both model-specific and cross-modality features are retained. We exploit the structure of quad-directional 2D-RNNs to model the short- and long-range contextual information in the 2D input image. We carefully designed various baselines to efficiently examine our proposed model structure. We test our multimodal RNNs method on popular RGB-D benchmarks and show how it outperforms previous methods significantly and achieves competitive results with other state-of-the-art works.
Original language | English |
---|---|
Pages (from-to) | 1656-1671 |
Number of pages | 16 |
Journal | IEEE Transactions on Multimedia |
Volume | 20 |
Issue number | 7 |
DOIs | |
Publication status | Published - Jul 2018 |
Externally published | Yes |
Keywords
- CNNs
- Multimodal learning
- RGB-D scene labeling
- RNNs
ASJC Scopus subject areas
- Signal Processing
- Media Technology
- Computer Science Applications
- Electrical and Electronic Engineering