sfewf/qwen3-4b-math-RL

TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kTool Calling:SupportedPublished:Jun 1, 2026License:mitArchitecture:Transformer Open Weights Cold

The sfewf/qwen3-4b-math-RL model is a Qwen3-4b variant that has undergone additional Reinforcement Learning (RL) training specifically on mathematical datasets. This model is optimized for reasoning and mathematical tasks, featuring a 'Max-Thinking' mode for thorough problem decomposition and a default mode for concise, accurate responses. It demonstrates improved efficiency in reasoning processes across both modes, making it suitable for applications requiring robust mathematical problem-solving.

Loading preview...

Qwen3-4b-math-RL: Enhanced for Mathematical Reasoning

The sfewf/qwen3-4b-math-RL model is a specialized version of the Qwen3-4b architecture, distinguished by its post-RL training on extensive mathematical datasets. This optimization aims to significantly improve its performance in complex reasoning and mathematical problem-solving.

Key Capabilities & Features

  • Max-Thinking Mode: Inspired by DeepSeek V4, this mode encourages an "absolute maximum" reasoning effort. It prompts the model to thoroughly decompose problems, rigorously test logic, and explicitly document its entire deliberation process, including intermediate steps and rejected hypotheses. This ensures comprehensive and verifiable reasoning.
  • Default Mode with Length-Penalty: When not in Max-Thinking mode, the model is trained with a length-penalty, enabling it to produce shorter, more concise responses while striving to maintain high accuracy.
  • Improved Reasoning Efficiency: Observations indicate that the RL training has led to more efficient reasoning processes in both default and Max-Thinking modes.

Performance Highlights

Evaluations demonstrate strong performance, particularly with the Max-Thinking mode:

  • GSM8K: Achieves 0.9172 (standard) and 0.9327 (max-effort) accuracy.
  • MATH-lighteval: Scores 0.8019 (standard) and 0.8505 (max-effort) accuracy.
  • BBH: Reaches 0.7963 (standard) and 0.8709 (max-effort) accuracy.
  • GPQA: Shows 0.2667 (standard) and 0.3125 (max-effort) accuracy.

Ideal Use Cases

This model is particularly well-suited for applications requiring:

  • Advanced Mathematical Problem Solving: Excels in tasks demanding detailed, step-by-step mathematical reasoning.
  • Complex Reasoning Tasks: Benefits from the Max-Thinking mode for problems requiring deep logical analysis and comprehensive deliberation.
  • Educational Tools: Can be used to generate detailed explanations for mathematical solutions.
  • Automated Code Generation (Math-related): Potentially useful for generating code snippets for mathematical algorithms or proofs.