Vision-Language Navigation (VLN)

Researching VLN for robotic navigation in unseen environments.

This research is currently in progress. I am exploring how quadruped robots can follow natural-language instructions to navigate novel environments using vision-language models.

Introduction

The framework is built on top of NaVILA, which was proposed by Cheng et al. NaVILA is a universal Vision-Language-Action (VLA) model for legged robots to perform navigation through natural-language instructions. For more details, please refer to the their website embeded above. Below are recent deployments demonstrating vision-language navigation on a Unitree A1 quadruped in indoor environments.

Simple VLN tasks following natural-language navigation instructions
Long-Horizon VLN task

System Requirements

We install the conda environment for NaVILA and run the VLN model on the server with RTX 5090 GPU. Since the paper used 40-series for the setup, there's some incompatibility with the 50-series. For a detailed setup guide, please refer to this instruction written by Richard Wang.

More updates are coming soon. Stay tuned!