VideoCAD: A Dataset and Model for Learning Long‑Horizon 3D CAD UI Interactions from Video

NeurIPS 2025 Datasets and Benchmarks Track

1Massachusetts Institute of Technology

*Equal contribution

VideoCAD — 41K+ Onshape CAD modeling videos with synchronized UI actions and target images.

Abstract

Computer-Aided Design (CAD) is a time-consuming and complex process, requiring precise, long-horizon user interactions with intricate 3D interfaces. While recent advances in AI-driven user interface (UI) agents show promise, most existing datasets and methods focus on short, low-complexity tasks in mobile or web applications, failing to capture the demands of professional engineering tools.

In this work, we introduce VideoCAD, the first attempt at engineering UI interaction learning for precision tasks. Specifically, VideoCAD is a large-scale synthetic dataset consisting of over 41K annotated video recordings of CAD operations, generated using an automated framework for collecting high-fidelity UI action data from human-made CAD designs.

Compared to existing datasets, VideoCAD offers an order of magnitude higher complexity in UI interaction learning for real-world engineering tasks, with a significantly longer temporal horizon than other datasets. We demonstrate two important downstream applications of VideoCAD: (1) learning UI interactions from professional precision 3D CAD tools and (2) a visual question answering (VQA) benchmark designed to evaluate multimodal large language models’ (LLMs) spatial reasoning and video understanding abilities.

To learn UI interactions, we propose VideoCADFormer — a state-of-the-art model that learns CAD interactions directly from video and outperforms multiple behavior cloning baselines. Both VideoCADFormer and the VQA benchmark derived from VideoCAD reveal key challenges in the current state of video-based UI understanding, including the need for precise action grounding, multimodal and spatial reasoning, and long-horizon dependencies.


Video

Comparison with Other UI Interaction Datasets

Dataset # Samples Avg. Length 3D Reasoning UI Actions Visual Context
OSWorld36915*--
Mind2Web2,3507.31,135
WebArena812----
AgentStudio30430*--
GUI-WORLD12,37910.97--
VideoCAD41,0051866,740

* Max length when average not reported

Dataset Generation Pipeline

VideoCAD pipeline

VIDEOCAD dataset pipeline: human-authored CAD sequences are converted into UI instructions and executed via a rule-based bot to record videos. Quality filtering (DINOv2), keyframe extraction, and action alignment produce structured video-action pairs.

VideoCADFormer

Using VideoCAD, we develop VideoCADFormer, a causal autoregressive transformer model that predicts low-level CAD UI actions directly from images using behavior cloning.

Conditioned on a target CAD image and a window of past frames, the model outputs the next action and can plan long-horizon CAD sequences.

BibTeX

@inproceedings{manvideocad,
  title={VideoCAD: A Dataset and Model for Learning Long-Horizon 3D CAD UI Interactions from Video},
  author={Man, Brandon and Nehme, Ghadi and Alam, Md Ferdous and Ahmed, Faez},
  booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track}
}