MLLM-Driven Autonomous Driving: Closed-Loop Decision-Making with Language-State Alignment and Human Instruction Integration
Loading...
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
ACM Journal on Autonomous Transportation Systems
Abstract
The core challenge of autonomous driving is achieving real-time decision-making and safety control in complex, dynamic environments, especially in long-tail scenarios where conventional systems often fail due to limited adaptability, poor human-vehicle interaction, and low interpretability. This paper introduces MLLM-CLAD, a closed-loop decision-making framework powered by a multimodal large language model, which addresses semantic gaps and feedback limitations through a co-optimization strategy that integrates behavior-state alignment and language instruction fusion. MLLM-CLAD integrates three core modules: (1) The Mapping-Enhanced Language-State Alignment Module employs a deterministic mapping pipeline from language instructions to control signals, which enables highly accurate mapping from LLM semantic outputs to executable control signals. (2) The Hierarchical Dynamic Instruction Integration Module robustly categorizes and dynamically prioritizes human language instructions into different semantic types. (3) The Multimodal Fusion Module leverages a lightweight spatiotemporal Q-Former to jointly encode images and LiDAR point clouds, aligned with language in latent space for improved temporal reasoning and cross-modal consistency. Evaluated in the CARLA Town05 Long, MLLM-CLAD achieves a 78.3% Driving Score and reduces collision rates by 35% in safety-critical scenarios. On the LangAuto benchmarks, MLLM-CLAD attains a 78.5% long-tail scenario pass rate. Furthermore, the end-to-end latency is reduced to 128 milliseconds. This work establishes a scalable path toward safe, low-latency, and interpretable autonomous driving in simulation, with a clear roadmap for real-world extension.
Description
This article was originally published in ACM Journal on Autonomous Transportation Systems . The version of record is available at: https://doi.org/10.1145/3811541
This work is licensed under a Creative Commons Attribution 4.0 International License. © 2026 Copyright held by the owner/author(s).
Citation
Yang Wu and Chien-Chung Shen. 2026. MLLM-Driven Autonomous Driving: Closed-Loop Decision-Making with Language-State Alignment and Human Instruction Integration. ACM J. Auton. Transport. Syst. Just Accepted (April 2026). https://doi.org/10.1145/3811541
Endorsement
Review
Supplemented By
Referenced By
Creative Commons license
Except where otherwised noted, this item's license is described as Attribution 4.0 United States

