SOE: Sample-Efficient Robot Policy Self-Improvement via On-Manifold Exploration

Yang Jin1, Jun Lv1,3, Han Xue1, Wendi Chen1,2, Chuan Wen1\(\dagger\), Cewu Lu1,2,3\(\dagger\)
1Shanghai Jiao Tong University, 2Shanghai Innovation Institute, 3Noematrix Ltd.
\(\dagger\)Equal Advising


By constraining exploration to the manifold of valid actions, our approach generates diverse yet temporally coherent behaviors, enabling structured and efficient exploration. The collected rollout data is then used to refine the policy, leading to sample-efficient policy self-improvement.

Abstract

Intelligent agents progress by continually refining their capabilities through actively exploring environments. Yet robot policies often lack sufficient exploration capability due to action mode collapse. Existing methods that encourage exploration typically rely on random perturbations, which are unsafe and induce unstable, erratic behaviors, thereby limiting their effectiveness. We propose Self-Improvement via On-Manifold Exploration (SOE), a framework that enhances policy exploration and improvement in robotic manipulation. SOE learns a compact latent representation of task-relevant factors and constrains exploration to the manifold of valid actions, ensuring safety, diversity, and effectiveness. It can be seamlessly integrated with arbitrary policy models as a plug-in module, augmenting exploration without degrading the base policy performance. Moreover, the structured latent space enables human-guided exploration, further improving efficiency and controllability. Extensive experiments in both simulation and real-world tasks demonstrate that SOE consistently outperforms prior methods, achieving higher task success rates, smoother and safer exploration, and superior sample efficiency. These results establish on-manifold exploration as a principled approach to sample-efficient policy self-improvement.

Core Idea

We perform exploration in a compact latent space, which is optimized to preserve only
task-essential information in observation while discarding irrelevant details,
ensuring exploration remains constrained to the task-relevant manifold.
Implemented as a plug-in module on top of existing imitation learning policies, SOE enables
diverse action generation without compromising the base policy's performance.

Experiments

We evaluate our method SOE on four simulated robot manipulation tasks from robomimic and three real-world tasks including mug hanging, toaster loading, and lamp capping. Extensive experiments demonstrate that our method is capable of generating diverse, structured action proposals that support effective exploration.



Compared to prior exploration methods, our approach consistently achieves higher success rates, smoother motions, and reduced rollout requirements, highlighting on-manifold exploration as a principled approach to sample-efficient robot policy self-improvement.


Our experiment also reveals that the learned latent space naturally disentangles task-relevant factors into several distinct dimensions, each corresponding to a specific pattern of variation. This property enables human-guided exploration, where users can steer exploration toward desired directions, further improving sample efficiency and controllability.

For more information, please read our paper.

Citation


@misc{jin2025soe,
      title={SOE: Sample-Efficient Robot Policy Self-Improvement via On-Manifold Exploration}, 
      author={Yang Jin and Jun Lv and Han Xue and Wendi Chen and Chuan Wen and Cewu Lu},
      year={2025},
      eprint={2509.19292},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2509.19292}, 
}