spiral-rl/Spiral-Qwen3-4B

Warm
Public
4B
BF16
40960
License: apache-2.0
Hugging Face
Overview

Model Overview

Spiral-Qwen3-4B is a 4 billion parameter language model developed by spiral-rl, built upon the Qwen3 base architecture. Its core innovation lies in its training methodology: the SPIRAL framework, which utilizes self-play on multi-turn, zero-sum games (such as TicTacToe, Kuhn Poker, and Simple Negotiation). This approach allows the model to learn and develop sophisticated reasoning strategies without relying on expert-curated problem-answer pairs or domain-specific reward engineering.

Key Capabilities & Training

  • Autonomous Reasoning Development: SPIRAL enables models to learn by playing against continuously improving versions of themselves, generating an infinite curriculum of progressively challenging problems.
  • Transferable Reasoning: Through zero-sum self-play, the model develops advanced reasoning strategies that lead to substantial gains on a range of math and general reasoning benchmarks.
  • Actor-Learner Architecture: Employs a scalable actor-learner architecture where parallel actors sample trajectories from diverse games, and a centralized learner processes these using Role-conditioned Advantage Estimation (RAE) for on-policy reinforcement learning updates.
  • High Context Length: Supports a context length of 40960 tokens, allowing for processing longer inputs and maintaining conversational coherence.

Ideal Use Cases

  • Research in AI Reasoning: Excellent for exploring autonomous reasoning development and self-play reinforcement learning.
  • Complex Problem Solving: Suitable for tasks requiring advanced logical deduction and strategic thinking, particularly in game-like or adversarial scenarios.
  • Benchmarking Reasoning Abilities: Can be used to evaluate and compare reasoning capabilities against other models, especially in mathematical and general reasoning domains.