Back to Projects

VLA from Scratch
with PushT & SAC

Building a Vision-Language-Action (VLA) model pipeline starting from pure Reinforcement Learning to gather expert demonstrations.

RL (SAC) PushT Env Diffusion Policy
PushT Task

01. The Pipeline

Training a VLA model usually requires massive amounts of human-annotated data. In this project, I explore an autonomous pipeline to generate this data synthetically using Reinforcement Learning.

  • Step 1: Train a Soft Actor-Critic (SAC) agent to solve the PushT task.
  • Step 2: Use the trained "expert" policy to collect thousands of successful demonstrations.
  • Step 3: Train a Diffusion Policy / VLA on this synthetic dataset.

02. RL Expert (SAC)

Soft Actor-Critic

I implemented SAC from scratch in PyTorch to master the `gym-pusht` environment. The agent observes the state [agent_pos, block_pos, block_angle] and outputs continuous actions.

~176
Mean-100 Reward
95%
Success Criteria
3k
Episodes
1M
Replay Buffer
sac/actor.py
def forward(self, state):
    x = F.relu(self.fc1(state))
    x = F.relu(self.fc2(x))
    mean = self.mean(x)
    log_std = self.log_std(x)
    log_std = torch.clamp(log_std, min=LOG_SIG_MIN, max=LOG_SIG_MAX)
    
    # Reparameterization trick
    std = log_std.exp()
    normal = Normal(mean, std)
    x_t = normal.rsample()  
    action = torch.tanh(x_t)
    
    return action, log_prob, mean

03. From RL to VLA

Once the SAC agent converges, I use it to collect a dataset of "expert rollouts". This dataset is then used to train a conditional diffusion policy.

The goal is to build a foundation for a Visual-Language-Action model where the "Language" component can guide the high-level goals (e.g., "Push the T-block to the target"), while the diffusion policy handles the low-level motor control consistent with the visual input.