AC-DiT: Adaptive Coordination Diffusion Transformer
Official code for our NeurIPS 2025 paper on end-to-end VLA for mobile manipulation.
AC-DiT is the official codebase for our NeurIPS 2025 paper on end-to-end Vision-Language-Action model for mobile manipulation.
Overview
AC-DiT proposes an adaptive coordination diffusion transformer framework for mobile manipulation, featuring:
- Two-stage action generation: Coarse prediction followed by diffusion-based refinement
- Multimodal input processing: Jointly processing vision, language, and proprioceptive inputs
- End-to-end training: From perception to action generation in a unified framework
My Contributions
- Responsible for core module implementation and engineering
- Implemented multimodal input processing pipeline
- Designed two-stage action generation (coarse prediction + diffusion refinement)
- Built simulation and real-robot evaluation scripts and configurations