AlphaMaze: Text-Based Visual Reasoning
AlphaMaze, developed by Menlo Research, is a 1.5 billion parameter model focused on advancing visual reasoning in Large Language Models. Unlike approaches that rely on image generation, AlphaMaze challenges models to solve mazes presented purely as text, assessing their ability to construct an internal "mental map" and plan movements.
Key Capabilities
- Text-based Visual Reasoning: Excels at interpreting and navigating mazes described entirely through text tokens, demonstrating genuine spatial understanding.
- GRPO Enhanced: Utilizes Generalized Relative Policy Optimization (GRPO) to refine maze-solving strategies, showing improved performance over training steps.
- Focus on Internal Mapping: Evaluates a model's capacity for spatial reconstruction and planning from textual input, moving beyond simple multiple-choice assessments.
- Open-Source Datasets: Accompanied by released datasets (Maze-Reasoning-v0.1, Maze-Reasoning-Reset-v0.1, Maze-Reasoning-GRPO-v0.1) to support reproducibility and further research.
Training Insights
Initial Supervised Fine-Tuning (SFT) experiments revealed that adding maze-specific tokens did not improve performance; surprisingly, the model performed strongly with pure text descriptions, indicating an inherent ability to learn spatial relationships from text alone. The base model for AlphaMaze-v0.2-1.5B is DeepSeek-R1-Distill-Qwen-1.5B.
Use Cases
This model is particularly well-suited for research and applications requiring LLMs to demonstrate:
- Advanced spatial understanding from textual descriptions.
- Planning and navigation based on abstract representations.
- Evaluation of reasoning capabilities without multimodal inputs.