thangvip/qwen2.5-1.5b-grpo-sgd-linear

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:1.5BQuant:BF16Ctx Length:32kPublished:Feb 17, 2026Architecture:Transformer Warm

The thangvip/qwen2.5-1.5b-grpo-sgd-linear model is a 1.5 billion parameter language model, fine-tuned from Qwen/Qwen2.5-1.5B-Instruct. It was trained using the GRPO (Gradient-based Reinforcement Learning with Policy Optimization) method, as introduced in the DeepSeekMath paper, which focuses on enhancing mathematical reasoning capabilities. This model is specifically optimized for tasks requiring improved reasoning, leveraging its specialized training procedure. With a context length of 32768 tokens, it is suitable for applications benefiting from advanced reasoning and mathematical problem-solving.

Loading preview...

Model Overview

This model, thangvip/qwen2.5-1.5b-grpo-sgd-linear, is a specialized 1.5 billion parameter language model derived from the Qwen2.5-1.5B-Instruct base. It has undergone fine-tuning using the TRL (Transformers Reinforcement Learning) library.

Key Differentiator: GRPO Training

The most significant aspect of this model is its training methodology. It leverages GRPO (Gradient-based Reinforcement Learning with Policy Optimization), a technique detailed in the research paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" (arXiv:2402.03300). This indicates a focus on enhancing the model's reasoning abilities, particularly in complex domains like mathematics.

Potential Use Cases

Given its GRPO-based training, this model is likely well-suited for:

  • Mathematical problem-solving: Tasks requiring logical deduction and numerical reasoning.
  • Complex reasoning tasks: Scenarios where understanding intricate relationships and drawing conclusions is crucial.
  • Instruction following: As it's fine-tuned from an instruct model, it should maintain strong instruction adherence, now augmented with improved reasoning.

Technical Details

  • Base Model: Qwen/Qwen2.5-1.5B-Instruct
  • Training Framework: TRL (Transformers Reinforcement Learning)
  • Context Length: 32768 tokens

Developers can quickly integrate this model using the transformers library for text generation tasks, as demonstrated in the quick start example.