VideoCAD: A Dataset and Model for Learning Long‑Horizon 3D CAD UI Interactions from Video

Abstract

Computer-Aided Design (CAD) is a time-consuming and complex process, requiring precise, long-horizon user interactions with intricate 3D interfaces. While recent advances in AI-driven user interface (UI) agents show promise, most existing datasets and methods focus on short, low-complexity tasks in mobile or web applications, failing to capture the demands of professional engineering tools.

In this work, we introduce VideoCAD, the first attempt at engineering UI interaction learning for precision tasks. Specifically, VideoCAD is a large-scale synthetic dataset consisting of over 41K annotated video recordings of CAD operations, generated using an automated framework for collecting high-fidelity UI action data from human-made CAD designs.

Compared to existing datasets, VideoCAD offers an order of magnitude higher complexity in UI interaction learning for real-world engineering tasks, with a significantly longer temporal horizon than other datasets. We demonstrate two important downstream applications of VideoCAD: (1) learning UI interactions from professional precision 3D CAD tools and (2) a visual question answering (VQA) benchmark designed to evaluate multimodal large language models’ (LLMs) spatial reasoning and video understanding abilities.

To learn UI interactions, we propose VideoCADFormer — a state-of-the-art model that learns CAD interactions directly from video and outperforms multiple behavior cloning baselines. Both VideoCADFormer and the VQA benchmark derived from VideoCAD reveal key challenges in the current state of video-based UI understanding, including the need for precise action grounding, multimodal and spatial reasoning, and long-horizon dependencies.

Video

Comparison with Other UI Interaction Datasets

Dataset	# Samples	Avg. Length	3D Reasoning	UI Actions	Visual Context
OSWorld	369	15*	❌	✅	--
Mind2Web	2,350	7.3	❌	❌	1,135
WebArena	812	--	❌	❌	--
AgentStudio	304	30*	❌	✅	--
GUI-WORLD	12,379	10.97	✅	✅	--
VideoCAD	41,005	186	✅	✅	6,740

^* Max length when average not reported

Dataset Generation Pipeline

VIDEOCAD dataset pipeline: human-authored CAD sequences are converted into UI instructions and executed via a rule-based bot to record videos. Quality filtering (DINOv2), keyframe extraction, and action alignment produce structured video-action pairs.

VideoCADFormer

Using VideoCAD, we develop VideoCADFormer, a causal autoregressive transformer model that predicts low-level CAD UI actions directly from images using behavior cloning.

Conditioned on a target CAD image and a window of past frames, the model outputs the next action and can plan long-horizon CAD sequences.

BibTeX

@inproceedings{manvideocad,
  title={VideoCAD: A Dataset and Model for Learning Long-Horizon 3D CAD UI Interactions from Video},
  author={Man, Brandon and Nehme, Ghadi and Alam, Md Ferdous and Ahmed, Faez},
  booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track}
}