Name: shreethar/stage1_unsloth API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: shreethar

ReasonFlow VLA — Stage 1: Robot Grounding SFT

This model, shreethar/stage1_unsloth, is the initial checkpoint for ReasonFlow VLA, a multi-stage Vision-Language-Action system developed by Shreethar at UTeM. It is a Qwen3.5-4B (natively multimodal) model, fine-tuned using Supervised Fine-Tuning (SFT) via Unsloth.

Key Capabilities & Training

Natively Multimodal: Processes both vision and language inputs, with an image resolution of 448 × 448.
Robot-Domain Grounding: Fine-tuned on approximately 560,000 samples across eight specialized robot-domain datasets.
Diverse Robotic Tasks: Training data covers:
- 2D end-effector trajectory prediction (MolmoAct Trajectory)
- Robot visual question answering (RoboVQA, Pixmo Cap-QA, Pixmo AMA)
- Failure analysis and correction QA (RoboFAC)
- Affordance bounding box prediction (ShareRobot Affordance)
- Multi-step task planning QA (ShareRobot Planning)
- Dense image captioning (Pixmo Cap)
Instruction-Tuned: All samples follow a two-turn chat format, enabling the model to output normalized waypoint lists for trajectory tasks and free-form text for QA tasks.
Foundational Knowledge: Establishes core robotic understanding before further stages involving RL or distillation.

Project Context

This model represents Stage 1 of the ReasonFlow VLA pipeline, focusing on robot grounding. It is designed to be the shared initialization point for both Teacher and Student models in the subsequent Stage 2 (GRPO Teacher-Student Distillation), which is currently in progress. The full project repository is available on GitHub.

Overview

ReasonFlow VLA — Stage 1: Robot Grounding SFT

Key Capabilities & Training

Project Context

Full Model Card (README)