CRAFT: Video Diffusion for Bimanual Robot Data Generation

This website is not completed and we plan to add more videos. We uploaded a VERY ROUGH draft so that early reviewers can get a reference to our method. Thank you for spending your time to review our paper. We plan to complete the website by the end of March.

Anonymous Authors

IROS 2026

arXiv

Paper

Video Code

SCROLL

Method Videos

A quick tour of CRAFT generations across six augmentation axes.

Jump to Method

Object pose · Stack bowls

Lighting · Stack bowls

Cross-embodiment · Stack bowls

Object pose · Lift roller

Object pose · Place cans

Canny-edge conditioning

Beyond Franka · Background

Background · Stack bowls

Object color · Stack bowls

Wrist + 3rd · Stack bowls

Object color · Lift roller

Background · Place cans

Object distractors

Different reference image

Key Contributions

Canny-Edge Conditioning

A novel method utilizing Canny edge images as control input to condition video generative models, producing high-quality and diverse robot videos.

Unified Bimanual Pipeline

A new pipeline for bimanual cross-embodiment manipulation that performs six types of image augmentations in a single framework.

Strong Empirical Results

Simulation and real-world experiments demonstrating that policies trained on CRAFT-generated data significantly outperform baselines.

Method

Figure 2: CRAFT pipeline. (1) Trajectory Expansion via Real2Sim digital twin. (2) Video Generation with Canny-edge conditioning. (3) Augmented Dataset Construction across six axes. (4) Generated Dataset for policy training.

Trajectory Expansion

Real-world teleoperation data is collected, and a digital twin pipeline transfers objects and robot into simulation (Real2Sim) for large-scale data generation.

Video Generation

Simulation trajectories are rendered into source videos and converted into Canny-edge controls, then combined with a reference image + language instruction to condition a video diffusion model.

Augmented Dataset Construction

Generated videos cover six variation axes: object pose, lighting, object color, background, cross-embodiment, and wrist + third-person views.

Generated Dataset

Synthesized videos are paired with action labels from simulation trajectories to produce action-consistent demonstrations \(\mathcal{D}^{\text{gen}}\) for downstream ACT policy training.

Why Canny-Edges?

Raw Simulation Images

Retain too much low-level detail, causing the diffusion model to struggle with salient structural features such as gripper-object contact.

Canny-Edge Representations

Discard irrelevant details while preserving robot arm and object structure, giving clear guidance and allowing free variation of backgrounds, object colors, and lighting through prompting.

Stack Two Bowls — Canny-edge conditioned generation

The video above was produced by the video generation model using the language instruction below (together with Canny-edge control).

Language instruction used for this generation

Overhead shot with balanced, even lighting. Neutral background with clear
illumination across the task area. COMPLETELY STATIONARY background curtains and table - absolutely
no movement, no swaying, no wind, frozen and still like a photograph throughout the entire
video. The background fabric and table fabric remains perfectly static while only the robot arms move.

Two white industrial robotic arms with visible structural detail are positioned symmetrically.
Two light-colored shallow gray bowls are clearly visible on the dark fabric
surface with defined edges.

CRITICAL: The robotic grippers make firm, realistic contact with the bowl edges. Fingers
wrap securely around the bowl rims with no gaps or floating. Natural grasping mechanics
with proper finger placement and stable grip throughout the entire motion. Physical contact
remains consistent with no slipping or separation between gripper and bowl surfaces.

The scene has well-balanced lighting that clearly shows all objects - robot arms, bowls,
and surface - with good detail and natural contrast. No harsh shadows or overexposed areas.
Clean, professional appearance with sufficient light to see all action clearly. Sharp focus
on the manipulation task.

ONLY the robot arms and bowls move - all background elements including curtains, fabric,
and surfaces remain completely frozen and motionless throughout the entire sequence.

4K quality.

Experiment Setup

Video Generation

Generate photorealistic, action-consistent robot videos from simulator rollouts using a pre-trained diffusion model.

Wan2.1-Fun-Control 1.3B Canny-edge control 512×512 input

Policy Training & Evaluation

Train ACT on real + generated demonstrations and evaluate robustness under controlled distribution shifts.

ACT Policy RoboTwin Benchmark Success rate (%)

Core setup

Model

Wan2.1-Fun-Control 1.3B

Input

512×512

Policy

ACT

Benchmark

RoboTwin

Tasks

Lift Roller

Coordinated bimanual task where both arms simultaneously grasp and lift.

Place Cans in Plasticbox

Parallel task where both arms independently pick up cans and place them into a container.

Stack Two Bowls

Sequential task where two bowls must be stacked on top of each other in order.

Real-World Augmentation Results

Success rates (%). Each method evaluated under test conditions varying only along that dimension. CRAFT (Ours) uses 1000 generated demos + real-world collected demos. Cross-Embodiment: xArm7 → Franka Panda transfer.

Lighting

Background

Camera View

Object Color

Wrist + 3rd Person

Cross-Embodiment

Real-World Video Rollouts

Policy rollouts on physical hardware for each augmentation type. Policies are trained with CRAFT-generated data and evaluated under the corresponding test condition.

Object Pose — Unseen object positions

Lift Roller

Place Cans

Stack Bowls

Lighting — Unseen ambient illumination

Blue lighting

Green lighting

Object Color — Unseen object colors (trained on red, tested on gray)

Placeholder

Background — Unseen background scenes

Placeholder

Cross-Embodiment — xArm7 demos transferred to Franka

Placeholder

Wrist + 3rd Person — Multi-view policy execution

Placeholder

Currently showing teaser as placeholder. Add your real-world rollout videos to static/videos/rollouts/<augmentation_type>/ and update the source paths above.

Cross-Embodiment Generation

We enable cross-embodiment transfer by retargeting source-robot demonstrations to a target robot using forward and inverse kinematics, mapping end-effector poses to new joint configurations while preserving gripper actions. In our setup the source robot is the xArm7 and the target is the Franka Panda; we generate photorealistic videos for the target robot only, so xArm7 source demonstrations are not shown here. We plan to add videos of the real-world xArm7 demonstrations in a future update.

Sample 1

Sample 2

Sample 1

Sample 2

Sample 1

Sample 2

Stress Testing Video Generation Model

Additional stress tests of our video generation: generation beyond the Franka platform, with object distractors, and with different reference images. Click a tab to view each category.

Generation Beyond Franka

Examples of video generation beyond the default Franka setup: different robot arms (e.g. single-arm xArm7), backgrounds, and object appearances, alongside the original generation for comparison.

Single arm xArm7 with ocean background.

Pink generation object.

Original generation.

Object Distractor

Shows how we can generate random objects on the surface.

Object distractor example

Different Reference Image

Our reference image can be anything as it doesn’t have to be a black curtain. We removed the black curtain and show it's generation results. Below: the reference image given to the model and the generated video.

Reference image (input to model)

Generated video

Abstract

Bimanual robot learning from demonstrations is fundamentally limited by the cost and narrow visual diversity of real-world data, which constrains policy robustness across viewpoints, object configurations, and embodiments. We present Canny-guided Robot Data Generation using Video Diffusion Transformers (CRAFT), a video diffusion-based framework for scalable bimanual demonstration generation that synthesizes temporally coherent manipulation videos while producing action labels. By conditioning video diffusion on edge-based structural cues extracted from simulator-generated trajectories, CRAFT produces physically plausible trajectory variations and supports a unified augmentation pipeline spanning object pose changes, lighting and background variations, cross-embodiment transfer, and multi-view synthesis. We leverage a pre-trained video diffusion model to convert simulated videos, along with action labels from the simulation trajectories, into action-consistent demonstrations. Starting from only a few real-world demonstrations, CRAFT generates a large, visually diverse set of photorealistic training data, bypassing the need to replay demonstrations on the real robot (Sim2Real). Across simulated and real-world bimanual tasks, CRAFT improves success rates over existing augmentation strategies and straightforward data scaling, demonstrating that diffusion-based video generation can substantially expand demonstration diversity and improve generalization for coordinated dual-arm manipulation.

CRAFT: Video Diffusion for Bimanual Robot Data Generation

Method Videos

Key Contributions

Canny-Edge Conditioning

Unified Bimanual Pipeline

Strong Empirical Results

Method

Why Canny-Edges?

Raw Simulation Images

Canny-Edge Representations

Stack Two Bowls — Canny-edge conditioned generation

Experiment Setup

Video Generation

Policy Training & Evaluation

Tasks

Real-World Augmentation Results

Lighting

Background

Camera View

Object Color

Wrist + 3rd Person

Cross-Embodiment

Real-World Video Rollouts

Object Pose — Unseen object positions

Lighting — Unseen ambient illumination

Object Color — Unseen object colors (trained on red, tested on gray)

Background — Unseen background scenes

Cross-Embodiment — xArm7 demos transferred to Franka

Wrist + 3rd Person — Multi-view policy execution

Object Pose Generation

Lighting Generation

Object Color Generation

Background Generation

Cross-Embodiment Generation

Wrist + 3rd Person View Generation

Stress Testing Video Generation Model

Generation Beyond Franka

Object Distractor

Different Reference Image

Abstract

BibTeX