Overview
This model, Kazuki1450/Qwen3-1.7B-Base_dsum_3_6_1p0_0p5_1p0_grpo_dr_grpo_42_rule, is a fine-tuned variant of the Qwen3-1.7B-Base architecture, developed by Kazuki1450. It has been specifically trained using the GRPO (Gradient-based Reasoning Policy Optimization) method, a technique introduced in the research paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" (arXiv:2402.03300).
Key Capabilities
- Enhanced Mathematical Reasoning: The primary differentiator of this model is its fine-tuning with GRPO, which aims to improve its ability to handle complex mathematical and logical reasoning tasks.
- Base Model Architecture: Built upon the Qwen3-1.7B-Base, it inherits the foundational language understanding and generation capabilities of the Qwen family.
- TRL Framework: Training was conducted using the TRL (Transformers Reinforcement Learning) library, indicating a focus on reinforcement learning from human feedback or similar optimization strategies.
Training Details
The model's training procedure involved the GRPO method, which is designed to push the boundaries of mathematical reasoning in language models. The training utilized specific versions of frameworks including TRL (0.29.0), Transformers (4.57.3), Pytorch (2.9.0), Datasets (4.0.0), and Tokenizers (0.22.1).
Good For
- Applications requiring improved mathematical problem-solving.
- Tasks that benefit from enhanced logical reasoning.
- Developers looking for a compact model (2B parameters) with specialized reasoning capabilities.