Integrating Large Language Models (LLMs) with Visual-Language-Action (VLA) models like GR00T to enable reasoning for complex robotic tasks.
Current Visual-Language-Action (VLA) models excel at short, atomic actions (e.g., "pick up the cup"). However, they struggle significantly with long-horizon tasks that require multi-step reasoning, memory, and contextual awareness over time In dynamic environments.
This research hypothesized that integrating a Large Language Model (LLM) as a high-level "reasoner" could guide the VLA policies, breaking down complex instructions into manageable sub-goals.
We implemented a layered architecture:
Systematic evaluation on a Franka Emika Research 3 arm.
The study provided critical insights into the "reasoning gap" in current VLA models.
We demonstrated that LLMs improve success rates in tasks requiring logical ordering of actions, but latency and grounding remain key challenges.
Affiliation Eurecat - Robotics & Automation
Focus Embodied AI, Foundation Models
Period Feb 2025 – Oct 2025