MLLM-Driven Autonomous Driving: Closed-Loop Decision-Making with Language-State Alignment and Human Instruction Integration

Loading...
Thumbnail Image

Journal Title

Journal ISSN

Volume Title

Publisher

ACM Journal on Autonomous Transportation Systems

Abstract

The core challenge of autonomous driving is achieving real-time decision-making and safety control in complex, dynamic environments, especially in long-tail scenarios where conventional systems often fail due to limited adaptability, poor human-vehicle interaction, and low interpretability. This paper introduces MLLM-CLAD, a closed-loop decision-making framework powered by a multimodal large language model, which addresses semantic gaps and feedback limitations through a co-optimization strategy that integrates behavior-state alignment and language instruction fusion. MLLM-CLAD integrates three core modules: (1) The Mapping-Enhanced Language-State Alignment Module employs a deterministic mapping pipeline from language instructions to control signals, which enables highly accurate mapping from LLM semantic outputs to executable control signals. (2) The Hierarchical Dynamic Instruction Integration Module robustly categorizes and dynamically prioritizes human language instructions into different semantic types. (3) The Multimodal Fusion Module leverages a lightweight spatiotemporal Q-Former to jointly encode images and LiDAR point clouds, aligned with language in latent space for improved temporal reasoning and cross-modal consistency. Evaluated in the CARLA Town05 Long, MLLM-CLAD achieves a 78.3% Driving Score and reduces collision rates by 35% in safety-critical scenarios. On the LangAuto benchmarks, MLLM-CLAD attains a 78.5% long-tail scenario pass rate. Furthermore, the end-to-end latency is reduced to 128 milliseconds. This work establishes a scalable path toward safe, low-latency, and interpretable autonomous driving in simulation, with a clear roadmap for real-world extension.

Description

This article was originally published in ACM Journal on Autonomous Transportation Systems . The version of record is available at: https://doi.org/10.1145/3811541 This work is licensed under a Creative Commons Attribution 4.0 International License. © 2026 Copyright held by the owner/author(s).

Citation

Yang Wu and Chien-Chung Shen. 2026. MLLM-Driven Autonomous Driving: Closed-Loop Decision-Making with Language-State Alignment and Human Instruction Integration. ACM J. Auton. Transport. Syst. Just Accepted (April 2026). https://doi.org/10.1145/3811541

Endorsement

Review

Supplemented By

Referenced By

Creative Commons license

Except where otherwised noted, this item's license is described as Attribution 4.0 United States