Kazuki1450/Qwen3-1.7B-Base_csum_6_10_tok_aligned_1p0_0p0_1p0_grpo_42_rule
TEXT GENERATIONConcurrency Cost:1Model Size:2BQuant:BF16Ctx Length:32kPublished:Jan 12, 2026Architecture:Transformer Cold

Kazuki1450/Qwen3-1.7B-Base_csum_6_10_tok_aligned_1p0_0p0_1p0_grpo_42_rule is a 2 billion parameter language model, fine-tuned from Qwen/Qwen3-1.7B-Base. This model utilizes the GRPO method, as introduced in the DeepSeekMath paper, to enhance its capabilities. With a context length of 40960 tokens, it is specifically optimized for tasks requiring advanced mathematical reasoning. Its training methodology suggests a focus on improving numerical and logical problem-solving within open language models.

Loading preview...

Model Overview

This model, developed by Kazuki1450, is a fine-tuned variant of the Qwen3-1.7B-Base architecture, featuring approximately 2 billion parameters and a substantial context length of 40960 tokens. Its development leveraged the TRL framework for training.

Key Capabilities

  • Mathematical Reasoning: The model's training incorporates the GRPO (Gradient-based Reward Policy Optimization) method, a technique highlighted in the "DeepSeekMath" paper for pushing the limits of mathematical reasoning in open language models. This suggests an enhanced ability to process and solve complex mathematical problems.
  • Base Model Enhancement: It builds upon the foundational capabilities of the Qwen3-1.7B-Base model, implying a strong general language understanding and generation base, now specialized for mathematical tasks.

Training Details

The model was trained using the TRL (Transformer Reinforcement Learning) framework, with specific versions of TRL (0.23.0), Transformers (4.57.1), PyTorch (2.7.1+cu128), Datasets (4.4.1), and Tokenizers (0.22.1) utilized during its development. The application of GRPO is a central aspect of its fine-tuning process, aiming to improve performance in areas related to mathematical and logical problem-solving.