spiral-rl/Spiral-Qwen3-4B

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Jun 29, 2025License:apache-2.0Architecture:Transformer0.0K Open Weights Warm

Spiral-Qwen3-4B is a 4 billion parameter language model developed by spiral-rl, based on the Qwen3 architecture, and trained with the SPIRAL self-play framework. This model learns advanced reasoning strategies by playing multi-turn, zero-sum games against continuously improving versions of itself, eliminating the need for human supervision. It excels at developing transferable reasoning capabilities, showing substantial gains on math and general reasoning benchmarks. The model supports a 40960 token context length and is optimized for autonomous reasoning development through competitive self-play.

Loading preview...

Model Overview

Spiral-Qwen3-4B is a 4 billion parameter language model developed by spiral-rl, built upon the Qwen3 base architecture. Its core innovation lies in its training methodology: the SPIRAL framework, which utilizes self-play on multi-turn, zero-sum games (such as TicTacToe, Kuhn Poker, and Simple Negotiation). This approach allows the model to learn and develop sophisticated reasoning strategies without relying on expert-curated problem-answer pairs or domain-specific reward engineering.

Key Capabilities & Training

  • Autonomous Reasoning Development: SPIRAL enables models to learn by playing against continuously improving versions of themselves, generating an infinite curriculum of progressively challenging problems.
  • Transferable Reasoning: Through zero-sum self-play, the model develops advanced reasoning strategies that lead to substantial gains on a range of math and general reasoning benchmarks.
  • Actor-Learner Architecture: Employs a scalable actor-learner architecture where parallel actors sample trajectories from diverse games, and a centralized learner processes these using Role-conditioned Advantage Estimation (RAE) for on-policy reinforcement learning updates.
  • High Context Length: Supports a context length of 40960 tokens, allowing for processing longer inputs and maintaining conversational coherence.

Ideal Use Cases

  • Research in AI Reasoning: Excellent for exploring autonomous reasoning development and self-play reinforcement learning.
  • Complex Problem Solving: Suitable for tasks requiring advanced logical deduction and strategic thinking, particularly in game-like or adversarial scenarios.
  • Benchmarking Reasoning Abilities: Can be used to evaluate and compare reasoning capabilities against other models, especially in mathematical and general reasoning domains.