Kudod/NuminaMath-Qwen2.5-1.5B-GRPO-test-v1

TEXT GENERATIONConcurrency Cost:1Model Size:1.5BQuant:BF16Ctx Length:32kPublished:Jan 27, 2026Architecture:Transformer Cold

Kudod/NuminaMath-Qwen2.5-1.5B-GRPO-test-v1 is a 1.5 billion parameter language model, fine-tuned from Qwen/Qwen2.5-1.5B-Instruct. This model utilizes the GRPO (Gradient-based Reward Policy Optimization) method, as introduced in the DeepSeekMath paper, to enhance its mathematical reasoning capabilities. With a context length of 131072 tokens, it is optimized for tasks requiring advanced mathematical problem-solving and logical deduction.

Loading preview...

Model Overview

Kudod/NuminaMath-Qwen2.5-1.5B-GRPO-test-v1 is a specialized language model, fine-tuned from the Qwen/Qwen2.5-1.5B-Instruct base model. Its primary distinction lies in its training methodology, which incorporates GRPO (Gradient-based Reward Policy Optimization). This technique, detailed in the DeepSeekMath paper, is designed to significantly improve a model's proficiency in mathematical reasoning.

Key Capabilities

  • Enhanced Mathematical Reasoning: The model's training with GRPO specifically targets and improves its ability to understand and solve complex mathematical problems.
  • Qwen2.5 Architecture: Built upon the Qwen2.5-1.5B-Instruct foundation, it inherits the general language understanding and generation capabilities of the Qwen family.
  • Extended Context Length: Features a substantial context window of 131072 tokens, allowing it to process and reason over lengthy problem descriptions or complex mathematical proofs.

Training Details

The model was fine-tuned using the TRL framework (Transformer Reinforcement Learning), leveraging specific versions of libraries including TRL 0.25.1, Transformers 4.57.1, and Pytorch 2.9.1. The GRPO method, central to its mathematical performance, is a key innovation from the DeepSeekMath research.

Ideal Use Cases

This model is particularly well-suited for applications requiring robust mathematical problem-solving, logical deduction, and handling extensive textual context in technical or scientific domains. Its GRPO-enhanced training makes it a strong candidate for tasks where precise mathematical understanding is critical.