SCAI-JHU/MindZero-gw-tom-Qwen3-VL-8B-Instruct
SCAI-JHU/MindZero-gw-tom-Qwen3-VL-8B-Instruct is an 8 billion parameter vision-language model, a MindZero checkpoint fine-tuned from Qwen3-VL-8B-Instruct by SCAI-JHU. It is specifically trained using self-supervised reinforcement learning for online Theory-of-Mind (ToM) reasoning in gridworld environments. This model excels at inferring mental states to predict actions without explicit mental-state annotations, achieving 92.3 on Gridworld-QA.
Loading preview...
MindZero-gw-tom-Qwen3-VL-8B-Instruct Overview
This model is an 8 billion parameter vision-language model developed by SCAI-JHU, building upon the Qwen3-VL-8B-Instruct architecture. It is a specialized MindZero checkpoint, uniquely trained for online Theory-of-Mind (ToM) reasoning within gridworld environments. The core innovation lies in its self-supervised reinforcement learning approach, which enables the model to perform robust mental reasoning without requiring any explicit mental-state annotations during training.
Key Capabilities
- Online Theory-of-Mind Reasoning: Learns to infer mental states (e.g., beliefs, intentions) of agents in real-time based on observed actions.
- Self-Supervised Learning: Utilizes a novel training mechanism where the model is rewarded for generating mental-state hypotheses that maximize the likelihood of observed actions, as estimated by a planner.
- Efficient Inference: After training, the model internalizes this reasoning process, allowing for fast, single-pass inference of mental states.
- Vision-Language Integration: As a VL model, it processes both visual and textual inputs, crucial for understanding gridworld scenarios.
Performance
On the Gridworld-QA benchmark, this 8B parameter model achieves a score of 92.3, demonstrating its proficiency in mental reasoning tasks within these environments. For comparison, its 4B counterpart achieved 95.0.
Good For
- Research and development in AI agents requiring advanced Theory-of-Mind capabilities.
- Applications involving understanding and predicting agent behavior in structured, interactive environments like gridworlds.
- Exploring self-supervised learning paradigms for complex cognitive tasks.