Increased Reasoning for Long-Horizon Tasks in a VLA

This research-focused internship addressed a critical challenge in robotics: enabling robots to perform complex, long-horizon tasks in dynamic environments using natural language commands. While modern Visual-Language-Action (VLA) models excel at short-horizon behaviors, their performance often degrades over longer, multi-step sequences that require sustained reasoning and contextual awareness.

My work explored the hypothesis that these limitations could be mitigated by integrating a Large Language Model (LLM) into an existing VLA framework to provide higher-level planning and reasoning. The project was centered on NVIDIA’s GR00T, a scalable, multi-task model for robotic manipulation.

The core of my investigation was to augment GR00T with LLM-based reasoning and systematically evaluate its effect on performance. This involved conducting experiments on a real-world robotic platform, the Franka Emika Research 3 arm, to perform a series of long-horizon pick-and-place tasks. By analyzing the success rates, partial completions, and failures, I aimed to contribute a deeper understanding of how language-guided reasoning interacts with perceptual and motor components in physically grounded settings. This project provided valuable insights into the benefits and limitations of integrating LLMs into embodied VLA systems for complex, real-world manipulation.