Building a Vision-Language-Action (VLA) model pipeline starting from pure Reinforcement Learning to gather expert demonstrations.
Training a VLA model usually requires massive amounts of human-annotated data. In this project, I explore an autonomous pipeline to generate this data synthetically using Reinforcement Learning.
I implemented SAC from scratch in PyTorch to master the `gym-pusht` environment. The agent observes the state [agent_pos, block_pos, block_angle] and outputs continuous actions.
def forward(self, state):
x = F.relu(self.fc1(state))
x = F.relu(self.fc2(x))
mean = self.mean(x)
log_std = self.log_std(x)
log_std = torch.clamp(log_std, min=LOG_SIG_MIN, max=LOG_SIG_MAX)
# Reparameterization trick
std = log_std.exp()
normal = Normal(mean, std)
x_t = normal.rsample()
action = torch.tanh(x_t)
return action, log_prob, mean
Once the SAC agent converges, I use it to collect a dataset of "expert rollouts". This dataset is then used to train a conditional diffusion policy.
The goal is to build a foundation for a Visual-Language-Action model where the "Language" component can guide the high-level goals (e.g., "Push the T-block to the target"), while the diffusion policy handles the low-level motor control consistent with the visual input.