SIME: Enhancing Policy Self-Improvement with Modal-level Exploration

Abstract

Self-improvement requires robotic systems to initially learn from human-provided data and then gradually enhance their capabilities through interaction with the environment. This is similar to how humans improve their skills through continuous practice. However, achieving effective self-improvement is challenging, primarily because robots tend to repeat their existing abilities during interactions, often failing to generate new, valuable data for learning. In this paper, we identify the key to successful self-improvement: modal-level exploration and data selection. By incorporating a modal-level exploration mechanism during policy execution, the robot can produce more diverse and multi-modal interactions. At the same time, we select the most valuable trials and high-quality segments from these interactions for learning. We successfully demonstrate effective robot self-improvement on both simulation benchmarks and real-world experiments. The capability for self-improvement will enable us to develop more robust and high-success-rate robotic control strategies at a lower cost.

Overview

Starting from a robot policy learned from human-provided demonstrations,
SIME enables the robot to explore multi-modal interaction behaviors in the reasoning space,
collect and select the most valuable trajectories and segments, and refine the policy.

Key Findings

Previous works and our experiments demonstrate that naively fine-tuning policies using self-collected interaction data is often ineffective. This is because imitation learning policies often produce repetitive or deterministic behaviors, resulting in limited diversity in the collected data.

Our key finding in this paper is that introducing modal-level exploration during interaction can significantly increase the diversity of the self-collected data and improve self-improvement effectiveness.

For example, in the real-world cup stacking task, without the introduced modal-level exploration, the original Diffusion Policy tends to output action sequences with low diversity, often failing to complete the task even given multiple attempts.

While with the modal-level exploration, even though the same policy model is used, the robot is able to discover solutions that are valuable for the next round of learning.

We are also surprised to find that after the introduction of modal-level exploration, a clear multi-modal behavior emerges. Starting from the same initial state, the robot is able to grab the cup from different sides.

By training the policy with self-collected data and performing only one round of self-improvement, SIME outperforms the baseline method by a significant margin, both on state-based and image-based tasks, in simulation and real world.

As the number of iterations increases, the advantage of SIME over baseline continues to grow.

We conduct a thorough ablation study to validate the effectiveness of several key components in SIME. Our studies also highlight the critical role of data selection in self-improvement.

For more information, please read our paper.

Citation


@misc{jin2025sime,
    title={SIME: Enhancing Policy Self-Improvement with Modal-level Exploration}, 
    author={Yang Jin and Jun Lv and Wenye Yu and Hongjie Fang and Yong-Lu Li and Cewu Lu},
    year={2025},
    eprint={2505.01396},
    archivePrefix={arXiv},
    primaryClass={cs.RO},
    url={https://arxiv.org/abs/2505.01396}, 
}