mjf-su/PhysicalAI-base-VLA

VISIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Mar 25, 2026License:mitArchitecture:Transformer Open Weights Cold

PhysicalAI-reason-VLA by mjf-su is a 4 billion parameter vision-language driving policy model, fine-tuned from PhysicalAI-base-VLA (based on Qwen3-VL-4B-Thinking) with a 32768 token context length. It uniquely integrates structured chain-of-thought reasoning and discrete driving decisions, trained on 10,000 Gemini-annotated driving scenes. This model excels at generating explicit driving justifications, specific longitudinal and lateral actions, and future trajectory waypoints from camera images and past vehicle motion.

Loading preview...

PhysicalAI-reason-VLA: Vision-Language Driving Policy with Reasoning

PhysicalAI-reason-VLA is a 4 billion parameter vision-language model developed by mjf-su, specifically fine-tuned for autonomous driving policy. Building upon the PhysicalAI-base-VLA (which itself is based on Qwen3-VL-4B-Thinking), this model introduces a novel approach by incorporating structured chain-of-thought reasoning and discrete driving decisions.

Key Capabilities

  • Integrated Reasoning and Action: Generates a detailed reasoning trace (<think>), explicit longitudinal and lateral driving decisions (<action>), and future trajectory waypoints (<wp>) in sequence.
  • Discrete Decision Tokens: Utilizes genuine single tokens for a comprehensive set of driving actions (e.g., <stop>, <turn_left>, <lane_keep>), allowing efficient probability measurement over the full decision space.
  • Contextual Understanding: Processes forward-facing camera images and past ego-vehicle waypoints to inform its driving policy.

Training Details

This model was fine-tuned using supervised fine-tuning (SFT) via TRL on 10,000 Gemini-annotated driving scenes from the mjf-su/PhysicalAI-reason-US dataset. The training focused on generating chain-of-thought labels on real US driving data over 2 epochs.

Good for

  • Developing autonomous driving systems that require explicit, interpretable decision-making.
  • Research into explainable AI for robotic control and vision-language navigation.
  • Applications needing a model that can not only predict trajectories but also justify its actions with structured reasoning.