Real-to-Sim Robot Policy Evaluation with Gaussian Splatting Simulation of Soft-Body Interactions

Kaifeng Zhang^1,2* Shuo Sha^1,2* Hanxiao Jiang¹ Matthew Loper² Hyunjong Song² Guangyan Cai² Zhuo Xu³ Xiaochen Hu² Changxi Zheng^1,2 Yunzhu Li^1,2

¹ Columbia University ² SceniX Inc. ³ Google DeepMind
^* indicates equal contribution. Work was partially done while interning at SceniX Inc.

Paper arXiv Code X/Twitter Data&Ckpts

Policy evaluation in sim environments constructed from real world, using Gaussian Splatting for rendering and soft-body digital twin for dynamics.

Interactive Gaussian Visualization

Reconstructed GS simulation environment. Best viewed on computers.

Choose task

Teleoperation in Sim

The reconstructed environment can be controlled with keyboard and puppeteer arms, with online rendering.

Teleoperation in sim with keyboard (left) and GELLO (right). Videos 5x speed.

Abstract

Robotic manipulation policies are advancing rapidly, but their direct evaluation in the real world remains costly, time-consuming, and difficult to reproduce, particularly for tasks involving deformable objects. Simulation provides a scalable and systematic alternative, yet existing simulators often fail to capture the coupled visual and physical complexity of soft-body interactions. We present a real-to-sim policy evaluation framework that constructs soft-body digital twins from real-world videos and renders robots, objects, and environments with photorealistic fidelity using 3D Gaussian Splatting. We validate our approach on representative deformable manipulation tasks, including plush toy packing, rope routing, and T-block pushing, demonstrating that simulated rollouts correlate strongly with real-world execution performance and reveal key behavioral patterns of learned policies. Our results suggest that combining physics-informed reconstruction with high-quality rendering enables reproducible, scalable, and accurate evaluation of robotic manipulation policies.

Video

(Video has sound.)

Method Overview

We present a pipeline that evaluates real-world robot policies in simulation using Gaussian Splatting-based rendering and soft-body digital twins.

Color Alignment

To close the color gap, we optimize a color transformation that aligns GS colors to the real camera domain. We show the side-by-side comparison between the renderings before transformation (raw GS colors) and after transformation (aligned with real cameras).

Physics Optimization

To close the dynamics gap, we optimize the physical parameters of the soft-body digital twin to better match real-world behavior. We show the side-by-side comparison between the simulated results before optimization (w/o physics optimization) and after optimization (w/ physics optimization), along with the real-world ground truth. All three videos are generated using the same robot trajectory. We can see that physics optimization significantly improves the alignment between simulation and reality.

Toy packing - w/o physics optimization (left) vs. w/ physics optimization (middle) vs. real world (right)

Rope routing - w/o physics optimization (left) vs. w/ physics optimization (middle) vs. real world (right)

T-block pushing - w/o physics optimization (left) vs. w/ physics optimization (middle) vs. real world (right)

Closed-Loop Policy Evaluation

Finally, we can evaluate policy performance in a closed-loop fashion by taking the observations from the simulator. As shown in the videos, when running the same policy from the same initial state, our simulation framework can closely replicate the real-world outcomes.

Toy packing - sim vs. real policy rollout

Rope routing - sim vs. real policy rollout

T-block pushing - sim vs. real policy rollout

Quantitative Results

In the figure above, we show the correlation between simulation and real-world policy performance. Left: Simulation success rates (y-axis) vs. real-world success rates (x-axis) for toy packing, rope routing, and T-block pushing, across multiple state-of-the-art imitation learning policies and checkpoints. The tight clustering along the diagonal indicates that, even with binary success metrics, our simulator faithfully reproduces real-world behaviors across tasks and policy robustness levels. Right: Compared with IsaacLab, which models rope routing and push-T tasks, our approach yields substantially stronger sim-to-real correlation, highlighting the benefit of realistic rendering and dynamics.

In the figure above, we show the per-policy, per-task performance across training. x-axis: training iterations, y-axis: success rates. Simulation (blue) and real-world (orange) success rates are shown across iterations. Improvements in simulation consistently correspond to improvements in the real world, establishing a positive correlation and demonstrating that our simulator can be a reliable tool for evaluating/selecting policies.

Empirical Practice

Evaluation Protocol

Real-world evaluation of visuomotor policies is known to exhibit high variance, making it difficult to draw reliable conclusions. To reduce variance and ensure rigor in our empirical evaluation, we follow best practices suggested in [1,2]. Specifically, we first sample object initial states in simulation and render them from the same camera viewpoint as the physical setup. A real-time visualizer overlays these simulated states onto the live camera stream (yellow-shaded video, left), enabling a human operator to manually adjust the objects to match the simulated configuration. This process ensures that the initial states in simulation and reality are closely aligned. After that, we run the policy in both simulation and reality to evaluate the performance.

[1] T. L. Team et al., A careful examination of large behavior models for multitask dexterous manipulation, arXiv preprint arXiv:2507.05331, 2025.
[2] H. Kress-Gazit et al., Robot learning as an empirical science: Best practices for policy evaluation, arXiv preprint arXiv:2409.09491, 2024.

Data Distribution

To better understand the data distribution used for both policy training and evaluation, we visualize the coverage of initial states in each setting. In our tasks, evaluation states are sampled to align with the training distribution, ensuring fair and consistent basis for comparison between simulation and reality.

Training initial state distribution

Evaluation initial state distribution

A collage of complete training demonstrations can be found in the individual dataset links below.

Toy Packing Dataset

Front cam collage

Wrist cam collage

Rope Routing Dataset

Front cam collage

Wrist cam collage

T-Block Pushing Dataset

Front cam collage

Wrist cam collage

Open Artifacts: Hugging Face Collection

We release all artifacts required to reproduce our results in a single Hugging Face Collection: shashuo0104/real-to-sim-policy-eval . The collection aggregates datasets, and policy, PhysTwin, and GS model checkpoints for all three tasks.

Browse the HF Collection

BibTeX


  @article{zhang2025real,
    title={Real-to-Sim Robot Policy Evaluation with Gaussian Splatting Simulation of Soft-Body Interactions},
    author={Zhang, Kaifeng and Sha, Shuo and Jiang, Hanxiao and Loper, Matthew and Song, Hyunjong and Cai, Guangyan and Xu, Zhuo and Hu, Xiaochen and Zheng, Changxi and Li, Yunzhu},
    journal={arXiv preprint arXiv:2511.04665},
    year={2025}
  }

Real-to-Sim Robot Policy Evaluation with Gaussian Splatting Simulation of Soft-Body Interactions

1 Columbia University 2 SceniX Inc. 3 Google DeepMind * indicates equal contribution. Work was partially done while interning at SceniX Inc.

Policy evaluation in sim environments constructed from real world, using Gaussian Splatting for rendering and soft-body digital twin for dynamics.

Interactive Gaussian Visualization

Reconstructed GS simulation environment. Best viewed on computers.

Teleoperation in Sim

The reconstructed environment can be controlled with keyboard and puppeteer arms, with online rendering.

Abstract

Video

Method Overview

Color Alignment

Physics Optimization

Closed-Loop Policy Evaluation

Quantitative Results

Empirical Practice

Evaluation Protocol

Data Distribution

Open Artifacts: Hugging Face Collection

BibTeX

¹ Columbia University ² SceniX Inc. ³ Google DeepMind
^* indicates equal contribution. Work was partially done while interning at SceniX Inc.