VL-N3RD-Bench

Benchmarking Vision-Language Navigation with 3D Gaussian Reconstruction for Deployment

Yu-Lun Chou^*, Lu-Ching Wang^*, Ying-Sheng Luo^*, Yu-Tang Liu, Jeng-Lin Li

Inventec Corporation
^*Indicates Equal Contribution

Abstract

Recent advances in large language models (LLMs) have expanded robotic capabilities by bridging perception and action. Vision-language navigation (VLN) enables embodied agents to follow high-level semantic instructions in complex and unseen environments. Fine-tuning VLA models requires photorealistic training data, as insufficient visual fidelity may exacerbate sim-to-real gaps and degrade action execution. However, achieving such realism in simulation is challenging, and training is often conducted using synthetic views or egocentric video datasets. Gaussian Splatting (GS) has recently emerged as a promising solution for high-fidelity 3D scene reconstruction, offering a scalable alternative for generating training environments. Nevertheless, the impact of GS reconstruction quality on downstream VLN performance has not been systematically investigated. In this work, we benchmark state-of-the-art GS pipelines using public and customized datasets. We evaluate reconstruction quality via standard image-based metrics and computational efficiency. Finally, we use VL-N3RD-Bench to evaluate a pretrained VLA model on a Unitree A1 robot across simulated and physical environments. Our results demonstrate that overall reconstruction quality can align with stronger indoor VLN performance, but this relationship is inconsistent, and strong image-based metrics alone do not guarantee better outdoor navigation when geometry and traversability become more important.

Framework

VL-N3RD-Bench reconstructs scenes from multi-view images using multiple 3D Gaussian Splatting pipelines, manually aligns them with collision geometry generated from SplaTAM, and evaluates downstream navigation in simulation and the real world.

Representative VLN experiments and reconstruction comparisons

Representative real-world VLN experiments in the Office scene, including trajectory frames, language instructions, command outputs, and render comparisons across reconstruction pipelines.

VLN Performance

Here we present successful and failed cases of the evaluated VLN pipelines across simulated environments reconstructed by 3DGUT and their real-world counterparts. For simulation results from other pipelines, see here.