Robotic manipulation policies are advancing rapidly, but their direct evaluation in the real world remains costly, time-consuming, and difficult to reproduce, particularly for tasks involving deformable objects. Simulation provides a scalable and systematic alternative, yet existing simulators often fail to capture the coupled visual and physical complexity of soft-body interactions. We present a real-to-sim policy evaluation framework that constructs soft-body digital twins from real-world videos and renders robots, objects, and environments with photorealistic fidelity using 3D Gaussian Splatting. We validate our approach on representative deformable manipulation tasks, including plush toy packing, rope routing, and T-block pushing, demonstrating that simulated rollouts correlate strongly with real-world execution performance and reveal key behavioral patterns of learned policies. Our results suggest that combining physics-informed reconstruction with high-quality rendering enables reproducible, scalable, and accurate evaluation of robotic manipulation policies.
(Video has sound.)
We present a pipeline that evaluates real-world robot policies in simulation using Gaussian Splatting-based rendering and soft-body digital twins.
To close the color gap, we optimize a color transformation that aligns GS colors to the real camera domain. We show the side-by-side comparison between the renderings before transformation (raw GS colors) and after transformation (aligned with real cameras).
To close the dynamics gap, we optimize the physical parameters of the soft-body digital twin to better match real-world behavior. We show the side-by-side comparison between the simulated results before optimization (w/o physics optimization) and after optimization (w/ physics optimization), along with the real-world ground truth. All three videos are generated using the same robot trajectory. We can see that physics optimization significantly improves the alignment between simulation and reality.
Toy packing - w/o physics optimization (left) vs. w/ physics optimization (middle) vs. real world (right)
Rope routing - w/o physics optimization (left) vs. w/ physics optimization (middle) vs. real world (right)
T-block pushing - w/o physics optimization (left) vs. w/ physics optimization (middle) vs. real world (right)
Finally, we can evaluate policy performance in a closed-loop fashion by taking the observations from the simulator. As shown in the videos, when running the same policy from the same initial state, our simulation framework can closely replicate the real-world outcomes.
Toy packing - sim vs. real policy rollout
Rope routing - sim vs. real policy rollout
T-block pushing - sim vs. real policy rollout
In the figure above, we show the correlation between simulation and real-world policy performance. Left: Simulation success rates (y-axis) vs. real-world success rates (x-axis) for toy packing, rope routing, and T-block pushing, across multiple state-of-the-art imitation learning policies and checkpoints. The tight clustering along the diagonal indicates that, even with binary success metrics, our simulator faithfully reproduces real-world behaviors across tasks and policy robustness levels. Right: Compared with IsaacLab, which models rope routing and push-T tasks, our approach yields substantially stronger sim-to-real correlation, highlighting the benefit of realistic rendering and dynamics.
In the figure above, we show the per-policy, per-task performance across training. x-axis: training iterations, y-axis: success rates. Simulation (blue) and real-world (orange) success rates are shown across iterations. Improvements in simulation consistently correspond to improvements in the real world, establishing a positive correlation and demonstrating that our simulator can be a reliable tool for evaluating/selecting policies.
Real-world evaluation of visuomotor policies is known to exhibit high variance, making it difficult to draw reliable conclusions. To reduce variance and ensure rigor in our empirical evaluation, we follow best practices suggested in [1,2]. Specifically, we first sample object initial states in simulation and render them from the same camera viewpoint as the physical setup. A real-time visualizer overlays these simulated states onto the live camera stream (yellow-shaded video, left), enabling a human operator to manually adjust the objects to match the simulated configuration. This process ensures that the initial states in simulation and reality are closely aligned. After that, we run the policy in both simulation and reality to evaluate the performance.
[1] T. L. Team et al., A careful examination of large behavior models for multitask dexterous manipulation, arXiv preprint arXiv:2507.05331, 2025.
[2] H. Kress-Gazit et al., Robot learning as an empirical science: Best practices for policy evaluation, arXiv preprint arXiv:2409.09491, 2024.
To better understand the data distribution used for both policy training and evaluation, we visualize the coverage of initial states in each setting. In our tasks, evaluation states are sampled to align with the training distribution, ensuring fair and consistent basis for comparison between simulation and reality.
Training initial state distribution
Evaluation initial state distribution
A collage of complete training demonstrations can be found in the individual dataset links below.
We release all artifacts required to reproduce our results in a single
Hugging Face Collection:
shashuo0104/real-to-sim-policy-eval
.
The collection aggregates datasets, and policy, PhysTwin, and GS model checkpoints for all three tasks.
@article{zhang2025real,
title={Real-to-Sim Robot Policy Evaluation with Gaussian Splatting Simulation of Soft-Body Interactions},
author={Zhang, Kaifeng and Sha, Shuo and Jiang, Hanxiao and Loper, Matthew and Song, Hyunjong and Cai, Guangyan and Xu, Zhuo and Hu, Xiaochen and Zheng, Changxi and Li, Yunzhu},
journal={arXiv preprint arXiv:2511.04665},
year={2025}
}