Kazuki1450/Qwen3-1.7B-Base_csum_3_10_tok_accuracy_1p0_0p0_1p0_grpo_42_rule is a 2 billion parameter language model, fine-tuned from Qwen/Qwen3-1.7B-Base. This model was trained using the GRPO method, which is designed to enhance mathematical reasoning capabilities. With a context length of 32768 tokens, it is optimized for tasks requiring robust logical and mathematical processing. It is particularly suited for applications demanding improved accuracy in complex reasoning problems.
Loading preview...
Model Overview
This model, Kazuki1450/Qwen3-1.7B-Base_csum_3_10_tok_accuracy_1p0_0p0_1p0_grpo_42_rule, is a fine-tuned variant of the Qwen3-1.7B-Base model, developed by Kazuki1450. It leverages a 2 billion parameter architecture and supports a substantial context length of 32768 tokens.
Key Capabilities
- Enhanced Mathematical Reasoning: The model was specifically trained using the GRPO (Gradient-based Reward Policy Optimization) method. This technique, introduced in the DeepSeekMath paper, aims to significantly improve a model's ability to handle mathematical and logical reasoning tasks.
- Fine-tuned with TRL: The training process utilized the TRL library, a framework for applying Reinforcement Learning from Human Feedback (RLHF) and other advanced training techniques to transformer models.
Training Details
The training procedure for this model is publicly visualized via Weights & Biases, offering transparency into its development. The GRPO method is a core component of its training, focusing on pushing the limits of mathematical reasoning in open language models.
Good For
- Applications requiring improved accuracy in mathematical problem-solving.
- Tasks that benefit from logical reasoning and structured thinking.
- Developers looking for a Qwen3-1.7B-Base variant with specialized reasoning enhancements.