We propose AC-DiT, an end-to-end vision-language-action model framework for mobile manipulation. It features a two-stage action generation mechanism (coarse prediction + diffusion refinement) and achieves significantly better performance on multiple benchmarks and real-robot experiments compared to existing methods.
@inproceedings{qian2025acdit,title={AC-DiT: Adaptive Coordination Diffusion Transformer for Mobile Manipulation},author={Qian, Siyuan and others},booktitle={Advances in Neural Information Processing Systems (NeurIPS)},year={2025},}
RSS
RoboMIND: Benchmark on Multi-embodiment Intelligence Normative Data for Robot Manipulation
We propose RoboMIND, a multi-embodiment robot teleoperation dataset covering 107K demonstration trajectories, 479 tasks, 96 object categories, and 4 robot morphologies, including failure cases and digital twin environments. Experiments verify that it significantly improves VLA model success rates and generalization capabilities, making it one of the largest and highest-quality datasets of its kind.
@inproceedings{robomind2025,title={RoboMIND: Benchmark on Multi-embodiment Intelligence Normative Data for Robot Manipulation},author={Qian, Siyuan and others},booktitle={Robotics: Science and Systems (RSS)},year={2025},}
2024
VLM
Chain of Thought Prompt Tuning in Vision Language Models
We propose a Chain-of-Thought prompt tuning method that introduces CoT into vision-language models by jointly leveraging visual and textual embedding information. This significantly improves generalization and transfer in image classification tasks, and demonstrates stronger reasoning performance in image-text retrieval and visual question answering tasks, achieving the first successful application of this method in visual tasks.
@article{qian2024cot,title={Chain of Thought Prompt Tuning in Vision Language Models},author={Qian, Siyuan and others},year={2024},}
2023
Nat. Comput. Sci.
Implicit Neural Image Field for Biological Microscopy Image Compression
We propose an adaptive compression pipeline based on implicit neural representations (INR) that supports arbitrary-shape images and pixel-level decompression. It achieves controllable high compression ratios (up to 512x) and demonstrates effectiveness on various real biological microscopy images, significantly reducing storage and sharing burden while preserving critical analytical information.
@article{qian2023inr,title={Implicit Neural Image Field for Biological Microscopy Image Compression},author={Qian, Siyuan and others},journal={Nature Computational Science},year={2023},}