shreethar/stage1_unsloth

VISIONConcurrency Cost:1Model Size:4.5BQuant:BF16Ctx Length:32kTool Calling:SupportedPublished:May 12, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

shreethar/stage1_unsloth is a 4.5 billion parameter, natively multimodal Vision-Language-Action (VLA) model developed by Shreethar at Universiti Teknikal Malaysia Melaka (UTeM). Based on Qwen3.5-4B, it has been fine-tuned via supervised instruction tuning on eight robot-domain datasets to establish foundational robotic knowledge. This model serves as Stage 1 of the ReasonFlow VLA pipeline, specializing in robot grounding for tasks like trajectory prediction, affordance grounding, and task planning.

Loading preview...

ReasonFlow VLA — Stage 1: Robot Grounding SFT

This model, shreethar/stage1_unsloth, is the initial checkpoint for ReasonFlow VLA, a multi-stage Vision-Language-Action system developed by Shreethar at UTeM. It is a Qwen3.5-4B (natively multimodal) model, fine-tuned using Supervised Fine-Tuning (SFT) via Unsloth.

Key Capabilities & Training

  • Natively Multimodal: Processes both vision and language inputs, with an image resolution of 448 × 448.
  • Robot-Domain Grounding: Fine-tuned on approximately 560,000 samples across eight specialized robot-domain datasets.
  • Diverse Robotic Tasks: Training data covers:
    • 2D end-effector trajectory prediction (MolmoAct Trajectory)
    • Robot visual question answering (RoboVQA, Pixmo Cap-QA, Pixmo AMA)
    • Failure analysis and correction QA (RoboFAC)
    • Affordance bounding box prediction (ShareRobot Affordance)
    • Multi-step task planning QA (ShareRobot Planning)
    • Dense image captioning (Pixmo Cap)
  • Instruction-Tuned: All samples follow a two-turn chat format, enabling the model to output normalized waypoint lists for trajectory tasks and free-form text for QA tasks.
  • Foundational Knowledge: Establishes core robotic understanding before further stages involving RL or distillation.

Project Context

This model represents Stage 1 of the ReasonFlow VLA pipeline, focusing on robot grounding. It is designed to be the shared initialization point for both Teacher and Student models in the subsequent Stage 2 (GRPO Teacher-Student Distillation), which is currently in progress. The full project repository is available on GitHub.