Abstract
Place recognition in LiDAR maps plays a vital role in assisting localization, especially in GPS-denied circumstances. While many efforts have been made toward pure LiDAR-based place recognition, these approaches are often hindered by high computational costs and operational burden on the driving agent. To alleviate these limitations, we explore an alternative approach for large-scale cross-modal localization by matching real-time RGB images to pre-existing LiDAR 3D point cloud maps. Specifically, we present a unified place descriptor representation learning method for cross modalities using Siamese architecture, which reformulates place recognition as a similarity modeling retrieval task. To address the inherent modality differences between visual images and point clouds, we first transform unordered point clouds into a range-view representation, facilitating effective cross-modal metric learning. Subsequently, we introduce a Transformer-Mamba Mixer module that integrates selective scanning and attention mechanisms to capture both intra-context and inter-context embeddings, enabling the generation of place descriptors. To further enrich and generate global location descriptors, we propose a semantic-promoted descriptor enhancer grounded in semantic distribution estimation. Finally, a contrastive learning paradigm is employed to perform cross-modal place recognition, identifying the most similar descriptors across modalities. Extensive experiments demonstrate the superiority of our proposed method in comparison to state-of-the-art methods. The details are available at https://github.com/emilyemliyM/Cross-PRNet.
| Original language | English |
|---|---|
| Article number | 103351 |
| Journal | Information Fusion |
| Volume | 124 |
| DOIs | |
| Publication status | Published - Dec 2025 |
Keywords
- Contrastive learning
- Cross-modality
- Descriptor representation
- Place recognition
ASJC Scopus subject areas
- Software
- Signal Processing
- Information Systems
- Hardware and Architecture