SCAI-JHU/MindZero-gw-tom-Qwen3-VL-8B-Instruct

VISIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:32kTool Calling:SupportedPublished:May 26, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

SCAI-JHU/MindZero-gw-tom-Qwen3-VL-8B-Instruct is an 8 billion parameter vision-language model, a MindZero checkpoint fine-tuned from Qwen3-VL-8B-Instruct by SCAI-JHU. It is specifically trained using self-supervised reinforcement learning for online Theory-of-Mind (ToM) reasoning in gridworld environments. This model excels at inferring mental states to predict actions without explicit mental-state annotations, achieving 92.3 on Gridworld-QA.

Loading preview...

MindZero-gw-tom-Qwen3-VL-8B-Instruct Overview

This model is an 8 billion parameter vision-language model developed by SCAI-JHU, building upon the Qwen3-VL-8B-Instruct architecture. It is a specialized MindZero checkpoint, uniquely trained for online Theory-of-Mind (ToM) reasoning within gridworld environments. The core innovation lies in its self-supervised reinforcement learning approach, which enables the model to perform robust mental reasoning without requiring any explicit mental-state annotations during training.

Key Capabilities

  • Online Theory-of-Mind Reasoning: Learns to infer mental states (e.g., beliefs, intentions) of agents in real-time based on observed actions.
  • Self-Supervised Learning: Utilizes a novel training mechanism where the model is rewarded for generating mental-state hypotheses that maximize the likelihood of observed actions, as estimated by a planner.
  • Efficient Inference: After training, the model internalizes this reasoning process, allowing for fast, single-pass inference of mental states.
  • Vision-Language Integration: As a VL model, it processes both visual and textual inputs, crucial for understanding gridworld scenarios.

Performance

On the Gridworld-QA benchmark, this 8B parameter model achieves a score of 92.3, demonstrating its proficiency in mental reasoning tasks within these environments. For comparison, its 4B counterpart achieved 95.0.

Good For

  • Research and development in AI agents requiring advanced Theory-of-Mind capabilities.
  • Applications involving understanding and predicting agent behavior in structured, interactive environments like gridworlds.
  • Exploring self-supervised learning paradigms for complex cognitive tasks.