Kazuki1450/Qwen3-1.7B-Base_csum_3_10_1p0_0p0_1p0_grpo_42_rule

TEXT GENERATIONConcurrency Cost:1Model Size:2BQuant:BF16Ctx Length:32kPublished:Mar 18, 2026Architecture:Transformer Cold

Kazuki1450/Qwen3-1.7B-Base_csum_3_10_1p0_0p0_1p0_grpo_42_rule is a 2 billion parameter language model fine-tuned from Qwen/Qwen3-1.7B-Base. This model was trained using the TRL framework and incorporates the GRPO method, which is designed to enhance mathematical reasoning capabilities. It is suitable for tasks requiring improved logical and mathematical problem-solving, building upon the base Qwen3-1.7B architecture.

Loading preview...

Overview

This model, Kazuki1450/Qwen3-1.7B-Base_csum_3_10_1p0_0p0_1p0_grpo_42_rule, is a fine-tuned variant of the Qwen/Qwen3-1.7B-Base architecture, featuring approximately 2 billion parameters and a context length of 32768 tokens. It was developed by Kazuki1450 and trained using the TRL (Transformers Reinforcement Learning) framework.

Key Capabilities

  • Enhanced Mathematical Reasoning: A primary differentiator of this model is its training with the GRPO (Gradient-based Reward Policy Optimization) method. GRPO, introduced in the context of DeepSeekMath, aims to significantly improve a model's ability to handle mathematical reasoning tasks.
  • Base Model Foundation: Built upon the Qwen3-1.7B-Base, it inherits the general language understanding and generation capabilities of its foundational model.

Training Details

The model's training procedure leveraged the TRL library (version 0.29.0) and incorporated the GRPO method, as detailed in the paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models". This specific training approach suggests an optimization for tasks that benefit from structured reasoning and problem-solving.

Good For

  • Applications requiring improved mathematical and logical reasoning.
  • Developers looking for a compact model (2B parameters) with specialized reasoning enhancements.
  • Experimentation with models trained using advanced reinforcement learning techniques like GRPO for specific task improvements.