Kazuki1450/Qwen3-1.7B-Base_dsum_3_6_0p5_0p0_1p0_grpo_sapo_42_rule

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:2BQuant:BF16Ctx Length:32kPublished:Mar 26, 2026Architecture:Transformer Warm

Kazuki1450/Qwen3-1.7B-Base_dsum_3_6_0p5_0p0_1p0_grpo_sapo_42_rule is a 2 billion parameter language model fine-tuned from Qwen/Qwen3-1.7B-Base. This model was trained using the GRPO method, as introduced in the DeepSeekMath paper, to enhance mathematical reasoning capabilities. With a context length of 32768 tokens, it is optimized for tasks requiring robust logical and mathematical processing. It is suitable for applications where a smaller model with strong reasoning skills is beneficial.

Loading preview...

Model Overview

Kazuki1450/Qwen3-1.7B-Base_dsum_3_6_0p5_0p0_1p0_grpo_sapo_42_rule is a 2 billion parameter language model, fine-tuned from the base Qwen/Qwen3-1.7B-Base architecture. This model distinguishes itself through its specialized training procedure, which incorporates the GRPO (Gradient Regularized Policy Optimization) method.

Key Capabilities

  • Enhanced Mathematical Reasoning: The model's training with GRPO, a technique detailed in the DeepSeekMath paper, suggests an optimization for tasks requiring logical and mathematical problem-solving.
  • Base Model Foundation: Built upon the Qwen3-1.7B-Base, it inherits the foundational language understanding and generation capabilities of the Qwen family.
  • Fine-tuned with TRL: The model was fine-tuned using the TRL library, indicating a reinforcement learning approach to align its outputs.

When to Use This Model

This model is particularly well-suited for use cases where:

  • Mathematical or Logical Tasks: Applications that benefit from improved reasoning, especially in mathematical contexts, could leverage this model's GRPO-enhanced training.
  • Resource-Constrained Environments: As a 2 billion parameter model, it offers a balance between performance and computational efficiency compared to larger models.
  • Exploration of GRPO Benefits: Developers interested in experimenting with models trained using advanced policy optimization techniques for reasoning tasks.