AC-DiT: Adaptive Coordination Diffusion Transformer

Official code for our NeurIPS 2025 paper on end-to-end VLA for mobile manipulation.

AC-DiT is the official codebase for our NeurIPS 2025 paper on end-to-end Vision-Language-Action model for mobile manipulation.

Overview

AC-DiT proposes an adaptive coordination diffusion transformer framework for mobile manipulation, featuring:

  • Two-stage action generation: Coarse prediction followed by diffusion-based refinement
  • Multimodal input processing: Jointly processing vision, language, and proprioceptive inputs
  • End-to-end training: From perception to action generation in a unified framework

My Contributions

  • Responsible for core module implementation and engineering
  • Implemented multimodal input processing pipeline
  • Designed two-stage action generation (coarse prediction + diffusion refinement)
  • Built simulation and real-robot evaluation scripts and configurations