LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning

LaST-R1 teaser figure
The Overview of LaST-R1. (a) Unlike vanilla RL baselines that strictly optimize actions, (b) our approach utilizes LAPO to jointly optimize an adaptive latent CoT alongside physical execution. By bridging cognitive reasoning and control, LaST-R1 achieves (c) faster convergence speed, higher success rate in simulation, and (d) stronger generalization capabilities in real-world.

LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning

Hao Chen1,2* Jiaming Liu2*† Zhonghao Yan2* Nuowei Han2* Renrui Zhang1† Chenyang Gu2 Jialin Gao1 Ziyu Guo1 Siyuan Qian2 Yinxi Wang2 Peng Jia3 Shanghang Zhang2✉ Pheng-Ann Heng1
1The Chinese University of Hong Kong 2State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University 3Simplexity Robotics
*Equal contribution Project lead Corresponding author

Headline Results

99.9%
Average success rate on LIBERO
(new state-of-the-art)
+22.5%
Average improvement over SOTA full-size SFT
1 traj.
Single expert trajectory per task for warm-up
Generalization
evaluated in simulation (Sim) and physical (Real) deployment

LIBERO Benchmark — Comparison with SFT and RL Baselines

Success Rate (%) across four task suites. Bold marks the best per column; our method ranks #1 on every suite.

Model Paradigm Spatial Object Goal Long Average
OpenVLASFT 84.788.479.253.776.5
GR00T-N1SFT 94.497.693.090.693.9
π0SFT 96.898.895.885.294.2
π0.5SFT 98.898.298.092.496.9
OpenVLA-OFTSFT 97.698.497.994.597.1
GRAPERL 88.592.183.157.280.2
TGRPORL 90.492.281.059.280.7
VLA-RLRL 90.291.882.259.881.0
SimpleVLA-RLRL 98.298.798.891.796.9
RLinf-GRPORL 98.999.798.393.697.6
πRLRL 99.6100.099.694.098.3
LaST-R1 (Ours) RL 99.8 100.0 100.0 99.8 99.9

full-trajectory warm-up  ·  two-camera-view training  ·  others use single-trajectory, single-view data, matching our setting.

Online RL learning curves on LIBERO
One-shot Online RL learning curves on LIBERO. LaST-R1+LAPO (red) vs. the standard Action-Only+PPO baseline (blue) across all four task suites.
Faster Convergence

LaST-R1 trained with LAPO achieves faster convergence and higher success rates than Action-Only+PPO across the four LIBERO suites. The paper attributes this sample efficiency to latent CoT optimization: environmental rewards shape both reasoning embeddings and action sequences, smoothing online RL instead of updating only physical actions.

Higher Final Accuracy

At convergence, LaST-R1+LAPO ranks first across Spatial, Object, Goal, and Long in the LIBERO comparison table, reaching a 99.9% average success rate. The paper highlights the Long suite as the clearest gap, reporting 99.8% for LaST-R1 and 94.0% for πRL, consistent with stronger long-horizon manipulation.

Stronger Generalization

LaST-R1 achieves zero-shot generalization to unseen objects, backgrounds, and lighting conditions after RL post-training. The paper reports that LaST-R1 confines real-world performance drops to within 15% for novel objects and remains robust under background and lighting variation.

These trends follow the paper's result analysis: latent CoT optimization smooths the RL optimization landscape and enables sample-efficient online learning. Appendix analyses further examine adaptive reasoning lengths and optimized execution episode lengths.

Generalization analysis on LIBERO
Generalization analysis on LIBERO. While Action-Only+PPO stagnates on held-out tasks, LaST-R1 with LAPO demonstrates continuous OOD improvement.

Real-World Experiments

We deploy LaST-R1 on Franka Research 3 hardware across four manipulation tasks — one single-arm and three dual-arm: Insert hexagon block, Open bag zipper, Wipe vase with sponge, and Open bottle cap. LaST-R1 uses a few-shot SFT warm-up with 30 expert trajectories followed by LoRA-based online RL, and is compared with the SOTA VLA model π0.5 trained by full-size SFT on 100 expert trajectories. All policies are evaluated over 20 rollouts at varied positions under original and OOD settings.

Success Rate (%) — SOTA SFT vs. Few-shot SFT→RL

Numbers in parentheses show the drop from each method's Original result under OOD perturbations. LaST-R1 improves average Original success from 52.5% after warm-up to 93.75% after RL, surpassing π0.5 at 71.25%.

Methods Insert hexagon block Open bag zipper
OriginalUnseen-Object-Background-Lighting OriginalUnseen-Object-Background-Lighting
π0.5 (Full-size SFT) 65 35 (-30%) 55 (-10%) 40 (-25%) 75 30 (-45%) 70 (-5%) 60 (-15%)
LaST-R1 (Few-shot SFT→RL) 45→90 75 (-15%) 85 (-5%) 80 (-10%) 55→95 80 (-15%) 95 (-0%) 90 (-5%)
Methods Wipe vase with sponge Open bottle cap
OriginalUnseen-Object-Background-Lighting OriginalUnseen-Object-Background-Lighting
π0.5 (Full-size SFT) 75 45 (-30%) 65 (-10%) 50 (-25%) 70 50 (-20%) 55 (-15%) 55 (-15%)
LaST-R1 (Few-shot SFT→RL) 65→95 80 (-15%) 90 (-5%) 95 (-0%) 45→95 95 (-0%) 80 (-15%) 85 (-10%)

Each cell averages 20 independent rollouts at varied tabletop positions. The standard deviation over three independent runs is 1.25% for LaST-R1 after RL and 4.5% for π0.5.

Open Bottle Cap dual-arm

Original Rollout
Original setting
Generalization (Object · Background · Lighting)
Unseen object
Unseen background
Unseen lighting

Open Bag Zipper dual-arm

Original Rollout
Original setting
Generalization (Object · Background · Lighting)
Unseen object
Unseen background
Unseen lighting

Wipe Vase with Sponge dual-arm

Original Rollout
Original setting
Generalization (Object · Background · Lighting)
Unseen object
Unseen background
Unseen lighting

Insert Hexagon Block single-arm

Original Rollout
Original setting
Generalization (Object · Background · Lighting)
Unseen object
Unseen background
Unseen lighting

Abstract

Robotic foundation models require reasoning over complex visual scenes to execute adaptive actions in dynamic environments. While recent studies on latent-reasoning Vision-Language-Action (VLA) models have demonstrated the capability to capture fine-grained physical dynamics, they remain predominantly confined to static imitation learning, severely limiting their adaptability and generalization.

In this paper, we present LaST-R1, a novel reinforcement learning (RL) post-training framework designed to effectively harness "latent reasoning-before-acting" policies. Specifically, we propose Latent-to-Action Policy Optimization (LAPO), a core RL algorithm that jointly optimizes the latent reasoning process and action generation. By explicitly embedding latent Chain-of-Thought (CoT) reasoning directly within the RL optimization loop, LAPO stimulates profound physical world modeling, which in turn drives robust execution in interactive environments. Furthermore, an adaptive latent CoT mechanism is introduced, allowing the policy to dynamically modulate its reasoning horizon based on diverse environment states.

Experiments show that LaST-R1 achieves a near-perfect 99.9% average success rate on the LIBERO benchmark with only one-shot supervised warm-up, significantly improving convergence speed and performance over prior state-of-the-art (SOTA) methods. In real-world deployments, LaST-R1 yields up to a 22.5% average improvement over a SOTA supervised fine-tuning approach across four complex tasks, including both single-arm and dual-arm settings. Finally, LaST-R1 demonstrates strong generalization across simulated and real-world environments.

BibTeX

@article{chen2026last,
  title={LaST-R1: Reinforcing Action via Adaptive Physical Latent Reasoning for VLA Models},
  author={Chen, Hao and Liu, Jiaming and Yan, Zhonghao and Han, Nuowei and Zhang, Renrui and Gu, Chenyang and Gao, Jialin and Guo, Ziyu and Qian, Siyuan and Wang, Yinxi and others},
  journal={arXiv preprint arXiv:2604.28192},
  year={2026}
}