TY - GEN
T1 - GaussianCross: Cross-modal Self-supervised 3D Representation Learning via Gaussian Splatting
AU - Yao, Lei
AU - Wang, Yi
AU - Zhang, Yi
AU - Liu, Moyun
AU - Chau, Lap Pui
N1 - Publisher Copyright:
© 2025 ACM.
PY - 2025/10/27
Y1 - 2025/10/27
N2 - The significance of informative and robust point representations has been widely acknowledged for 3D scene understanding. Despite existing self-supervised pre-training counterparts demonstrating promising performance, the model collapse and structural information deficiency remain prevalent due to insufficient point discrimination difficulty, yielding unreliable expressions and suboptimal performance. In this paper, we present GaussianCross, a novel cross-modal self-supervised 3D representation learning architecture integrating feed-forward 3D Gaussian Splatting (3DGS) techniques to address current challenges. GaussianCross seamlessly converts scale-inconsistent 3D point clouds into a unified cuboid-normalized Gaussian representation without missing details, enabling stable and generalizable pre-training. Subsequently, a tri-attribute adaptive distillation splatting module is incorporated to construct a 3D feature field, facilitating synergetic feature capturing of appearance, geometry, and semantic cues to maintain cross-modal consistency. To validate GaussianCross, we perform extensive evaluations on various benchmarks, including ScanNet, ScanNet200, and S3DIS. In particular, GaussianCross shows a prominent parameter and data efficiency, achieving superior performance through linear probing (<0.1% parameters) and limited data training (1% of scenes) compared to state-of-the-art methods. Furthermore, GaussianCross demonstrates strong generalization capabilities, improving the full fine-tuning accuracy by 9.3% mIoU and 6.1% AP50 on ScanNet200 semantic and instance segmentation tasks, respectively, supporting the effectiveness of our approach. The code, weights, and visualizations are publicly available at https://rayyoh.github.io/GaussianCross/.
AB - The significance of informative and robust point representations has been widely acknowledged for 3D scene understanding. Despite existing self-supervised pre-training counterparts demonstrating promising performance, the model collapse and structural information deficiency remain prevalent due to insufficient point discrimination difficulty, yielding unreliable expressions and suboptimal performance. In this paper, we present GaussianCross, a novel cross-modal self-supervised 3D representation learning architecture integrating feed-forward 3D Gaussian Splatting (3DGS) techniques to address current challenges. GaussianCross seamlessly converts scale-inconsistent 3D point clouds into a unified cuboid-normalized Gaussian representation without missing details, enabling stable and generalizable pre-training. Subsequently, a tri-attribute adaptive distillation splatting module is incorporated to construct a 3D feature field, facilitating synergetic feature capturing of appearance, geometry, and semantic cues to maintain cross-modal consistency. To validate GaussianCross, we perform extensive evaluations on various benchmarks, including ScanNet, ScanNet200, and S3DIS. In particular, GaussianCross shows a prominent parameter and data efficiency, achieving superior performance through linear probing (<0.1% parameters) and limited data training (1% of scenes) compared to state-of-the-art methods. Furthermore, GaussianCross demonstrates strong generalization capabilities, improving the full fine-tuning accuracy by 9.3% mIoU and 6.1% AP50 on ScanNet200 semantic and instance segmentation tasks, respectively, supporting the effectiveness of our approach. The code, weights, and visualizations are publicly available at https://rayyoh.github.io/GaussianCross/.
KW - 3dgs
KW - representation learning
KW - self-supervised learning
UR - https://www.scopus.com/pages/publications/105024075199
U2 - 10.1145/3746027.3754773
DO - 10.1145/3746027.3754773
M3 - Conference article published in proceeding or book
AN - SCOPUS:105024075199
T3 - MM 2025 - Proceedings of the 33rd ACM International Conference on Multimedia, Co-Located with MM 2025
SP - 6500
EP - 6509
BT - MM 2025 - Proceedings of the 33rd ACM International Conference on Multimedia, Co-Located with MM 2025
PB - Association for Computing Machinery, Inc
T2 - 33rd ACM International Conference on Multimedia, MM 2025
Y2 - 27 October 2025 through 31 October 2025
ER -