MLLM-Driven Autonomous Driving: Closed-Loop Decision-Making with Language-State Alignment and Human Instruction Integration
| dc.contributor.author | Wu, Yang | |
| dc.contributor.author | Shen, Chien-Chung | |
| dc.date.accessioned | 2026-05-01T20:05:02Z | |
| dc.date.issued | 2026-04-16 | |
| dc.description | This article was originally published in ACM Journal on Autonomous Transportation Systems . The version of record is available at: https://doi.org/10.1145/3811541 This work is licensed under a Creative Commons Attribution 4.0 International License. © 2026 Copyright held by the owner/author(s). | |
| dc.description.abstract | The core challenge of autonomous driving is achieving real-time decision-making and safety control in complex, dynamic environments, especially in long-tail scenarios where conventional systems often fail due to limited adaptability, poor human-vehicle interaction, and low interpretability. This paper introduces MLLM-CLAD, a closed-loop decision-making framework powered by a multimodal large language model, which addresses semantic gaps and feedback limitations through a co-optimization strategy that integrates behavior-state alignment and language instruction fusion. MLLM-CLAD integrates three core modules: (1) The Mapping-Enhanced Language-State Alignment Module employs a deterministic mapping pipeline from language instructions to control signals, which enables highly accurate mapping from LLM semantic outputs to executable control signals. (2) The Hierarchical Dynamic Instruction Integration Module robustly categorizes and dynamically prioritizes human language instructions into different semantic types. (3) The Multimodal Fusion Module leverages a lightweight spatiotemporal Q-Former to jointly encode images and LiDAR point clouds, aligned with language in latent space for improved temporal reasoning and cross-modal consistency. Evaluated in the CARLA Town05 Long, MLLM-CLAD achieves a 78.3% Driving Score and reduces collision rates by 35% in safety-critical scenarios. On the LangAuto benchmarks, MLLM-CLAD attains a 78.5% long-tail scenario pass rate. Furthermore, the end-to-end latency is reduced to 128 milliseconds. This work establishes a scalable path toward safe, low-latency, and interpretable autonomous driving in simulation, with a clear roadmap for real-world extension. | |
| dc.identifier.citation | Yang Wu and Chien-Chung Shen. 2026. MLLM-Driven Autonomous Driving: Closed-Loop Decision-Making with Language-State Alignment and Human Instruction Integration. ACM J. Auton. Transport. Syst. Just Accepted (April 2026). https://doi.org/10.1145/3811541 | |
| dc.identifier.issn | 2833-0528 | |
| dc.identifier.uri | https://udspace.udel.edu/handle/19716/37032 | |
| dc.language.iso | en_US | |
| dc.publisher | ACM Journal on Autonomous Transportation Systems | |
| dc.rights | Attribution 4.0 United States | en |
| dc.rights.uri | http://creativecommons.org/licenses/by/4.0/us/ | |
| dc.subject | Autonomous driving | |
| dc.subject | Multimodal large language models | |
| dc.subject | Closed-loop decision-making | |
| dc.subject | Linguistic fusion | |
| dc.subject | Vehicle safety and intelligent control | |
| dc.title | MLLM-Driven Autonomous Driving: Closed-Loop Decision-Making with Language-State Alignment and Human Instruction Integration | |
| dc.type | Article |
Files
Original bundle
1 - 1 of 1
Loading...
- Name:
- MLLM-Driven Autonomous Driving Closed-Loop Decision-Making.pdf
- Size:
- 2.85 MB
- Format:
- Adobe Portable Document Format
- Description:
- Main article
License bundle
1 - 1 of 1
Loading...
- Name:
- license.txt
- Size:
- 2.2 KB
- Format:
- Item-specific license agreed upon to submission
- Description:
