AC-DiT: Adaptive Coordination Diffusion Transformer

Official code for our NeurIPS 2025 paper on end-to-end VLA for mobile manipulation.

AC-DiT is the official codebase for our NeurIPS 2025 paper on end-to-end Vision-Language-Action model for mobile manipulation.

Overview

AC-DiT proposes an adaptive coordination diffusion transformer framework for mobile manipulation, featuring:

Two-stage action generation: Coarse prediction followed by diffusion-based refinement
Multimodal input processing: Jointly processing vision, language, and proprioceptive inputs
End-to-end training: From perception to action generation in a unified framework

My Contributions

Responsible for core module implementation and engineering
Implemented multimodal input processing pipeline
Designed two-stage action generation (coarse prediction + diffusion refinement)
Built simulation and real-robot evaluation scripts and configurations

Links

GitHub: https://github.com/PKU-HMI-Lab/AC-DiT