Kazuki1450/Qwen3-1.7B-Base_csum_6_10_1p0_0p0_1p0_grpo_42_rule
Kazuki1450/Qwen3-1.7B-Base_csum_6_10_1p0_0p0_1p0_grpo_42_rule is a fine-tuned version of the Qwen/Qwen3-1.7B-Base model, developed by Kazuki1450. This model was trained using the TRL framework and incorporates the GRPO method, which is designed to enhance mathematical reasoning capabilities. It is specifically optimized for tasks requiring advanced mathematical problem-solving, building upon the foundational Qwen3-1.7B-Base architecture.
Loading preview...
Model Overview
This model, developed by Kazuki1450, is a specialized fine-tuned variant of the Qwen/Qwen3-1.7B-Base language model. It leverages the TRL (Transformers Reinforcement Learning) framework for its training process.
Key Differentiator: GRPO Method
The primary distinction of this model lies in its application of the GRPO (Gradient-based Reward Policy Optimization) method during training. GRPO is a technique introduced in the research paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" (arXiv:2402.03300). This indicates a strong focus on improving the model's proficiency in complex mathematical reasoning tasks.
Training Details
- Base Model: Qwen/Qwen3-1.7B-Base
- Training Framework: TRL (version 0.29.0)
- Optimization Method: GRPO, as detailed in the DeepSeekMath paper.
- Framework Versions:
- TRL: 0.29.0
- Transformers: 4.57.3
- Pytorch: 2.9.0
- Datasets: 4.0.0
- Tokenizers: 0.22.1
Potential Use Cases
Given its training with the GRPO method, this model is likely well-suited for applications requiring:
- Mathematical problem-solving
- Logical reasoning in quantitative contexts
- Tasks that benefit from enhanced numerical understanding