MLLM-Driven Autonomous Driving: Closed-Loop Decision-Making with Language-State Alignment and Human Instruction Integration

dc.contributor.authorWu, Yang
dc.contributor.authorShen, Chien-Chung
dc.date.accessioned2026-05-01T20:05:02Z
dc.date.issued2026-04-16
dc.descriptionThis article was originally published in ACM Journal on Autonomous Transportation Systems . The version of record is available at: https://doi.org/10.1145/3811541 This work is licensed under a Creative Commons Attribution 4.0 International License. © 2026 Copyright held by the owner/author(s).
dc.description.abstractThe core challenge of autonomous driving is achieving real-time decision-making and safety control in complex, dynamic environments, especially in long-tail scenarios where conventional systems often fail due to limited adaptability, poor human-vehicle interaction, and low interpretability. This paper introduces MLLM-CLAD, a closed-loop decision-making framework powered by a multimodal large language model, which addresses semantic gaps and feedback limitations through a co-optimization strategy that integrates behavior-state alignment and language instruction fusion. MLLM-CLAD integrates three core modules: (1) The Mapping-Enhanced Language-State Alignment Module employs a deterministic mapping pipeline from language instructions to control signals, which enables highly accurate mapping from LLM semantic outputs to executable control signals. (2) The Hierarchical Dynamic Instruction Integration Module robustly categorizes and dynamically prioritizes human language instructions into different semantic types. (3) The Multimodal Fusion Module leverages a lightweight spatiotemporal Q-Former to jointly encode images and LiDAR point clouds, aligned with language in latent space for improved temporal reasoning and cross-modal consistency. Evaluated in the CARLA Town05 Long, MLLM-CLAD achieves a 78.3% Driving Score and reduces collision rates by 35% in safety-critical scenarios. On the LangAuto benchmarks, MLLM-CLAD attains a 78.5% long-tail scenario pass rate. Furthermore, the end-to-end latency is reduced to 128 milliseconds. This work establishes a scalable path toward safe, low-latency, and interpretable autonomous driving in simulation, with a clear roadmap for real-world extension.
dc.identifier.citationYang Wu and Chien-Chung Shen. 2026. MLLM-Driven Autonomous Driving: Closed-Loop Decision-Making with Language-State Alignment and Human Instruction Integration. ACM J. Auton. Transport. Syst. Just Accepted (April 2026). https://doi.org/10.1145/3811541
dc.identifier.issn2833-0528
dc.identifier.urihttps://udspace.udel.edu/handle/19716/37032
dc.language.isoen_US
dc.publisherACM Journal on Autonomous Transportation Systems
dc.rightsAttribution 4.0 United Statesen
dc.rights.urihttp://creativecommons.org/licenses/by/4.0/us/
dc.subjectAutonomous driving
dc.subjectMultimodal large language models
dc.subjectClosed-loop decision-making
dc.subjectLinguistic fusion
dc.subjectVehicle safety and intelligent control
dc.titleMLLM-Driven Autonomous Driving: Closed-Loop Decision-Making with Language-State Alignment and Human Instruction Integration
dc.typeArticle

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
MLLM-Driven Autonomous Driving Closed-Loop Decision-Making.pdf
Size:
2.85 MB
Format:
Adobe Portable Document Format
Description:
Main article

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
2.2 KB
Format:
Item-specific license agreed upon to submission
Description: