Kazuki1450/Qwen3-1.7B-Base_dsum_3_6_1p0_0p5_1p0_grpo_42_rule
Kazuki1450/Qwen3-1.7B-Base_dsum_3_6_1p0_0p5_1p0_grpo_42_rule is a 2 billion parameter language model, fine-tuned from Qwen/Qwen3-1.7B-Base. This model utilizes the GRPO (Gradient-based Reward Policy Optimization) method, as introduced in the DeepSeekMath paper, to enhance its capabilities. It is specifically optimized for tasks requiring advanced reasoning, particularly in mathematical contexts, leveraging its 32K token context length.
Loading preview...
Model Overview
This model, Kazuki1450/Qwen3-1.7B-Base_dsum_3_6_1p0_0p5_1p0_grpo_42_rule, is a fine-tuned variant of the Qwen3-1.7B-Base architecture, featuring approximately 2 billion parameters and a 32,768 token context length. It was developed by Kazuki1450 and trained using the TRL (Transformers Reinforcement Learning) library.
Key Differentiator: GRPO Training
The primary distinction of this model lies in its training methodology. It employs GRPO (Gradient-based Reward Policy Optimization), a technique detailed in the research paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" (arXiv:2402.03300). This method is designed to significantly improve a model's reasoning abilities, particularly in complex domains like mathematics.
Capabilities
- Enhanced Reasoning: Leverages GRPO for improved logical and mathematical reasoning. While the specific benchmarks are not provided, the underlying method targets these areas.
- Base Model Foundation: Built upon the robust Qwen3-1.7B-Base, inheriting its general language understanding and generation capabilities.
- Extended Context: Supports a substantial context window of 32,768 tokens, allowing for processing longer inputs and maintaining coherence over extended interactions.
Good For
- Mathematical Problem Solving: Ideal for applications requiring advanced mathematical reasoning or logical deduction, given its GRPO-based training.
- Complex Query Handling: Suitable for tasks where understanding and generating responses based on intricate logical structures are crucial.
- Research and Experimentation: Provides a foundation for further research into GRPO-enhanced models and their application in reasoning tasks.