ahme0599/Qwen_Qwen2.5-1.5B-Instruct-GRPO-vanilla_G_4

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:1.5BQuant:BF16Ctx Length:32kPublished:Dec 15, 2025Architecture:Transformer Warm

The ahme0599/Qwen_Qwen2.5-1.5B-Instruct-GRPO-vanilla_G_4 is a 1.5 billion parameter instruction-tuned causal language model, fine-tuned from Qwen/Qwen2.5-1.5B-Instruct. This model utilizes the GRPO (Gradient-based Reward Policy Optimization) method, originally introduced for mathematical reasoning in DeepSeekMath. It is optimized for instruction following tasks, leveraging its training with TRL to enhance its response generation capabilities.

Loading preview...

Model Overview

This model, ahme0599/Qwen_Qwen2.5-1.5B-Instruct-GRPO-vanilla_G_4, is a 1.5 billion parameter instruction-tuned language model. It is a fine-tuned variant of the base Qwen/Qwen2.5-1.5B-Instruct model, developed by Qwen.

Key Differentiator: GRPO Training

The primary distinction of this model lies in its training methodology. It has been fine-tuned using GRPO (Gradient-based Reward Policy Optimization), a method first detailed in the research paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models." This suggests an optimization focus that could enhance its ability to follow complex instructions and potentially improve reasoning capabilities, especially in structured or logical tasks, although its specific application here is for general instruction following.

Training Framework

The model's fine-tuning process leveraged the TRL (Transformer Reinforcement Learning) library, indicating a reinforcement learning approach to align the model with human preferences or specific task objectives. The training utilized TRL version 0.25.1, along with Transformers 4.57.3, Pytorch 2.9.1, Datasets 4.4.1, and Tokenizers 0.22.1.

Potential Use Cases

Given its instruction-tuned nature and GRPO training, this model is suitable for:

  • General instruction following: Responding to user prompts and queries.
  • Conversational AI: Engaging in dialogue based on given instructions.
  • Reasoning tasks: Potentially performing well on tasks requiring structured thought, influenced by its GRPO heritage.