Synthetic Dataset Generation for Industrial Robotics
The setup: Ricoh’s robotics group needed to put an object detector on a Universal Robots 20 6-DOF industrial arm in a warehouse-style assembly cell. The objects of interest were specific to the cell, the lighting was specific to the cell, and the labeled training data needed to reflect both. Real-world data collection for this kind of task is the obvious bottleneck. Every pose and lighting variation costs hours, and every relabeling cycle costs more.
The two-month brief was to architect and build a synthetic-data pipeline that could replace most of that physical capture. Detector ships on the arm; training data is grown on the GPU.
Pipeline overview
Four stages, end-to-end:
- Scene authoring. A 3D representation of the assembly cell (the robot, the workspace, the objects of interest, the camera mounting positions) built in OpenUSD as the interchange format. OpenUSD pulls double-duty as the asset format and the runtime scene graph; both Omniverse Kit and Isaac Sim consume USD natively, so the same scene description drives both rendering and physics simulation.
- Diffusion seeding. NVIDIA Cosmos, used as a conditioned diffusion foundation model, takes the simulated scene and generates photorealistic variants that retain the scene’s geometric ground truth. Each generated frame inherits the bounding boxes from the source scene without manual labeling.
- Auto-labeling. Because the boxes come from the scene graph rather than from a separate labeling pass, every generated frame is labeled at the source. There’s no human in the loop after the scene is authored.
- Detector training. A YOLOv11 architecture, modified for the specific class set and image geometry, trained on the auto-labeled corpus. The output detector is the artifact that ships back to the arm.
The chain looks linear, but the loop closes back at stage 1: any failure mode the trained detector exhibits in physical evaluation can be addressed by editing the scene authoring (more poses, different lighting envelopes, additional distractor objects) and regenerating.
Tooling choices
A few of the tooling decisions are worth flagging because they were load-bearing for the throughput target.
OpenUSD as the spine. Choosing USD as the scene representation early made the rest of the pipeline tractable. Asset interchange between Isaac Sim and Omniverse Kit just works when both are consuming the same USD. Without that, the scene authoring step would have multiplied with each downstream tool, and synchronizing changes across them would have eaten the iteration budget.
NVIDIA Cosmos for diffusion seeding. Cosmos is purpose-built for physical-world video and image generation, conditioned on simulation output. The alternative — training a domain-specific diffusion model from scratch — was outside the two-month budget. Using a foundation model for the seeding step let the engineering effort concentrate on scene authoring quality and the detector’s training loop, which were the load-bearing parts.
Isaac Sim for the simulation layer. Isaac Sim handles the physical-realism simulation of the cell (robot kinematics, object dynamics, sensor placement) and renders frames that get fed to the diffusion stage. Doing this in Isaac instead of a custom renderer means the simulation already understands the URDF for the UR20, so robot pose variations during data generation are physically valid by construction rather than by post-hoc filtering.
YOLOv11. The detector architecture choice was driven by inference latency on the deployment target and by the maturity of the training tooling. The output head was modified for the cell’s specific class set; the rest of the architecture stayed near-stock.
Scale
The pipeline produced over 10,000 auto-labeled synthetic RGB frames in the operational run. The compute backend was Google Cloud VMs with NVIDIA H100 GPUs; the seed-and-render step is the bottleneck and benefits directly from the GPU upgrade. Storage and orchestration ran in Docker on GCP.
What I learned
A few things I’d carry forward:
Auto-labeling at the scene-authoring layer changes the iteration economics. The traditional synthetic-data flow is “render, then label”, which retains a labeling bottleneck even when the rendering itself is free. Carrying the bounding boxes through from the scene graph removes the bottleneck entirely. The cost per new variant becomes the cost of editing the scene, not the cost of labeling.
Diffusion-foundation seeding closes most of the realism gap that a physics-based renderer struggles with. The texture, lighting, and material variation that a hand-tuned shader has trouble expressing comes nearly free out of a conditioned diffusion model. The tradeoff is some loss of strict physical correctness in the rendered frames, which is fine for object detection but would matter for tasks that depend on photometric measurements (depth from shading, polarimetric inference, anything that treats pixel intensity as a calibrated quantity).
OpenUSD is the right interchange format if you’re touching more than one piece of NVIDIA’s robotics stack. The temptation to roll your own scene format dies the first time you have to round-trip a robot definition between two tools. USD also gave the scene a structure that the team could reason about visually instead of through code diffs.
The detector’s failure modes are debuggable through the scene authoring. Most of the time the detector regressed on a class, the fix lived upstream: a missing pose distribution, a lighting envelope that didn’t match the cell, a distractor object that wasn’t represented. Treating the scene graph as the locus of debugging (rather than the model weights) was the right mental model and saved a lot of training cycles.
The pipeline shipped. The detector deployed on the arm, trained on the synthetic corpus, performs the task. The longer-term value is the pipeline itself: each new object class for the cell now costs orders of magnitude less to bring up than a real-world capture campaign would.