Kazuki1450/Qwen3-1.7B-Base_csum_3_10_tok_dollars_1p0_0p0_1p0_grpo_42_rule
Kazuki1450/Qwen3-1.7B-Base_csum_3_10_tok_dollars_1p0_0p0_1p0_grpo_42_rule is a 2 billion parameter language model, fine-tuned from Qwen/Qwen3-1.7B-Base with a 32768 token context length. This model utilizes the GRPO (Gradient-based Reward Policy Optimization) method, known for enhancing mathematical reasoning in language models. It is specifically optimized for tasks requiring improved reasoning capabilities, building upon the base Qwen3 architecture.
Loading preview...
Model Overview
This model, Kazuki1450/Qwen3-1.7B-Base_csum_3_10_tok_dollars_1p0_0p0_1p0_grpo_42_rule, is a fine-tuned variant of the Qwen3-1.7B-Base architecture, featuring approximately 2 billion parameters and a substantial 32768 token context window. It was developed by Kazuki1450 and fine-tuned using the TRL library.
Key Differentiator: GRPO Fine-tuning
A core aspect of this model is its training methodology, which incorporates GRPO (Gradient-based Reward Policy Optimization). This technique, detailed in the research paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models," is designed to significantly enhance a model's mathematical reasoning abilities. By applying GRPO, this Qwen3-based model aims to improve performance on tasks that demand robust logical and mathematical processing.
Training Details
The model was trained using TRL (Transformers Reinforcement Learning) and leverages framework versions including TRL 0.29.0, Transformers 4.57.3, Pytorch 2.9.0, Datasets 4.0.0, and Tokenizers 0.22.1. The training process is publicly viewable via Weights & Biases.
Use Cases
Given its GRPO-enhanced training, this model is particularly well-suited for applications requiring:
- Mathematical problem-solving
- Logical reasoning tasks
- Complex analytical queries
Developers can quickly integrate the model using the provided transformers pipeline for text generation tasks.