Kazuki1450/Qwen3-1.7B-Base_dsum_3_6_fnr_eng_1p0_0p0_1p0_grpo_42_rule

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:2BQuant:BF16Ctx Length:32kPublished:Mar 26, 2026Architecture:Transformer Warm

Kazuki1450/Qwen3-1.7B-Base_dsum_3_6_fnr_eng_1p0_0p0_1p0_grpo_42_rule is a 2 billion parameter language model fine-tuned from Qwen/Qwen3-1.7B-Base. This model was trained using the GRPO method, which is designed to enhance mathematical reasoning capabilities. It is suitable for tasks requiring improved reasoning, particularly in mathematical contexts, building upon the base Qwen3 architecture.

Loading preview...

Model Overview

This model, developed by Kazuki1450, is a fine-tuned version of the Qwen/Qwen3-1.7B-Base architecture, featuring approximately 2 billion parameters and a 32768 token context length. It has been specifically trained using the GRPO (Gradient-based Reward Policy Optimization) method, as introduced in the paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models".

Key Capabilities

  • Enhanced Mathematical Reasoning: The application of the GRPO training method suggests an optimization for tasks that involve mathematical reasoning, building on the foundational capabilities of the Qwen3-1.7B-Base model.
  • Fine-tuned Performance: Leverages the TRL (Transformers Reinforcement Learning) framework for its fine-tuning process, indicating a focus on improving specific task performance.

Training Details

The model's training procedure utilized the GRPO method, which is detailed in the DeepSeekMath research. This approach aims to improve the model's ability to handle complex mathematical problems. The training was conducted using TRL version 0.29.0, Transformers 4.57.3, Pytorch 2.9.0, Datasets 4.0.0, and Tokenizers 0.22.1.

Good for

  • Applications requiring a compact model with improved mathematical reasoning abilities.
  • Experimentation with models fine-tuned using advanced reinforcement learning techniques like GRPO.