ComAct — Reframing Professional Software Manipulation via COM-as-Action Paradigm

Abstract

Existing computer-use agents remain fundamentally limited in professional software manipulation: GUI-based agents suffer from fragile visual grounding and long-horizon error accumulation, while API-based approaches struggle with heterogeneous protocols and inaccessible commercial interfaces. We identify the Component Object Model (COM) as a unified executable abstraction, proposing COM-as-Action — a new paradigm that reframes professional software interaction as deterministic program synthesis rather than sequential visual control. We introduce ComCADBench, the first benchmark for agents operating real industrial CAD software, and develop ComActor, a self-correcting agent trained through a progressive three-stage framework on ComForge, a scalable platform for training in Windows containers. ComActor achieves state-of-the-art performance on ComCADBench, with strong resilience in long-horizon tasks where baselines collapse, and generalizes to external CAD benchmarks.

Overview

Figure 1: Comparison of GUI, API/MCP, and COM action spaces.

Left: Comparison of existing computer-use paradigms and our proposed ComAct paradigm. GUI- based agents rely on fragile visual grounding and suffer from long-horizon error accumulation, while API-based agents are constrained by fragmented and limited inter- faces. In contrast, we leverage the COM as a unified semantic programmatic interface, enabling executable program synthesis and cross-application workflows. Right: Overview of our ComAct framework, consisting of three components: a data construction pipeline that synthesizes verified instruction–code pairs; a three-stage progressive training framework (text-to-code SFT, agentic SFT, and GRPO with continuous geometric reward); and a scalable infrastructure supporting 1000+ parallel real Windows environments for training and evaluation.

Results

Table 1. ComActor (9B) sets a new state of the art across every category on ComCADBench, beating GPT-5 and Claude-Sonnet-4.6 without few-shot prompting or retrieval augmentation — the gap is largest on long-horizon multi-task pipelines, where baseline performance collapses.

Figure 4: An execution trajectory of the agent completing a multi-task modeling and engineering drawing pipeline.

Figure 4. A real execution trajectory: the agent writes a COM script, reads back a traceback from the terminal, diagnoses and rewrites the failing step, then carries on to the next task and signals DONE once the part and dimensioned drawing are exported.

BibTeX

@misc{ai2026comactreframingprofessionalsoftware,
    title={ComAct: Reframing Professional Software Manipulation via COM-as-Action Paradigm},
    author={Jiaxin Ai and Tao Hu and Xuemeng Yang and Shu Zou and Hairong Zhang and Daocheng Fu and Yu Yang and Hongbin Zhou and Nianchen Deng and Pinlong Cai and Zhongyuan Wang and Botian Shi and Kaipeng Zhang and Licheng Wen},
    year={2026},
    eprint={2606.13239},
    archivePrefix={arXiv},
    primaryClass={cs.SE},
    url={https://arxiv.org/abs/2606.13239},
}

Updates

Abstract

Overview

Results

BibTeX