AbbottYang/Qwen2-0.5B-GRPO-test

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:0.5BQuant:BF16Ctx Length:32kPublished:Mar 18, 2026Architecture:Transformer Warm

AbbottYang/Qwen2-0.5B-GRPO-test is a 0.5 billion parameter causal language model, fine-tuned from Qwen/Qwen2-0.5B-Instruct. This model was trained using the GRPO method, as introduced in the DeepSeekMath paper, which focuses on enhancing mathematical reasoning. It features a 32768 token context length and is optimized for tasks requiring improved reasoning capabilities.

Loading preview...

Model Overview

AbbottYang/Qwen2-0.5B-GRPO-test is a 0.5 billion parameter language model, fine-tuned from the base Qwen/Qwen2-0.5B-Instruct model. It leverages a 32768 token context window.

Key Differentiator: GRPO Training

This model's primary distinction is its training methodology. It was fine-tuned using GRPO (Gradient-based Reward Policy Optimization), a method detailed in the paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" (arXiv:2402.03300). This training approach is designed to enhance the model's reasoning capabilities, particularly in mathematical contexts.

Training Framework

The fine-tuning process was conducted using the TRL library, with specific versions:

  • TRL: 0.29.0
  • Transformers: 5.3.0
  • Pytorch: 2.7.0+cu128

Potential Use Cases

Given its GRPO-based training, this model is potentially suitable for:

  • Tasks requiring improved logical reasoning.
  • Applications where mathematical problem-solving is a component.
  • As a base for further fine-tuning on specific reasoning-intensive datasets.