I placed š„ 2nd in the LeHome Challenge (ICRA 2026), and š„ 1st of 62 teams in the first simulation round. Now I'm open-sourcing the full solution ā code, tech report, and final weights.
The task: teach a cheap two-armed robot (SO-ARM101) to fold 4 garment types ā long/short tops and pants. Garment category is hidden at eval. Round 1 in sim (auto-scored), round 2 on a real robot (jury-scored).
I trained a VLA policy with an RL loop on top. The key ideas:
š§ The policy is its own value function. From the same forward pass that picks the next action chunk, cheap heads predict success probability, task completion %, garment type, and future keypoint distances + a Q-residual. Those become the advantage signal for RL ā no separate critic.
š A fully asynchronous RL loop coordinated only through the HF Hub: 1 trainer (H200) ships a fresh checkpoint ~every 40 min while N rollout workers (and a human doing teleop / DAgger corrections) collect data in parallel. Nobody waits ā it uses the off-policy nature of the loop to the fullest.
š Binary success is too sparse, so I densify it into per-frame advantage via GAE ā from objective keypoint checkpoints, the success-probability value baseline, and completion %.
šļø The RL combines AWR + RECAP. I also tune the inference knobs ā execution length, playback speed, inpainting overlap, CFG scale, best-of-N ā with a per-parameter Thompson-sampling bandit folded into rollout collection.
š§ Round 2: with only ~1 week and no access to the eval robot ā so the pipeline was sim ā my robot ā their robot, leaning on heavy augmentation to make the policy more robust.
š BEHAVIOR Challenge 1st Place ā Solution Summary
My team recently won 1st place in the BEHAVIOR Challenge at NeurIPS. The competition focused on training a single policy to complete 50 long-horizon household tasks in simulation.
We built an end-to-end policy based on Pi0.5 with a bunch of custom modifications. Everything is open-sourced, and it should be useful for anyone exploring VLAs or adapting them to specific tasks.
Key Architecture Changes: - Replaced language model with 50 trainable task embeddings (no text at all) - Correlated noise for Flow Matching: ϵ ⼠N(0, 0.5I + 0.5Σ) using dataset action covariance - Learnable mixed-layer attention: each action expert layer attends to a trainable mix of all VLM layers - System 2 stage tracking: model predicts task stage, we smooth it with voting and feed it back as context
Training: - Multi-sample Flow Matching: 15 FM samples per VLM pass to reduce gradient variance - Delta action space + per-timestamp normalization - FAST auxiliary loss and stage prediction loss - Trained on 224Ć224 RGB + proprioception only - We use 4 fine-tuned checkpoints, all derived from a multi-task model trained on all 50 tasks
Inference Optimizations: - Soft inpainting: predict 30 actions, execute 26, use 4 as an input for the next chunk - Correlation-aware guidance of inpainting to keep action chunks smooth - 1.3Ć speedup via cubic spline compression - General correction rule: reopen gripper after failed grasps
I am presenting Decoder-Only Transformer (DOT) Policy a simple Behavioral Control policy that outperforms SOTA models on two simple benchmark tasks:
ā PushT (pushing an object to a goal) ā 84% success on keypoints, 74% on images (previous best: 75% / 69%) ā ALOHA Insert (precise bimanual insertion) ā 30% success (previous best: ~21%)
The best part? DOT is much smaller (sometimes 100 times less parameters) than previous SOTA models, trains faster, and avoids complexity: š« No generative models (Diffusion, VAE, GANs) š« No discretization/tokenization of actions š« No reinforcement learning or multi-stage training ā Just learns from human demos, plain and simple
This is still early ā more complex real-life tasks need testing, and no guarantees it will actually work well there, but I think it's interesting to share. Sometimes, simpler approaches can be just as effective (or even better) than complex ones.