SantiagoC/palindrome-grpo-v5

TEXT GENERATIONConcurrency Cost:1Model Size:0.5BQuant:BF16Ctx Length:32kPublished:May 5, 2026Architecture:Transformer Cold

SantiagoC/palindrome-grpo-v5 is a 0.5 billion parameter instruction-tuned causal language model, fine-tuned from Qwen/Qwen2.5-0.5B-Instruct. This model was trained using the GRPO method, which is designed to enhance mathematical reasoning capabilities. With a context length of 32768 tokens, it is optimized for tasks requiring improved reasoning, particularly in mathematical contexts.

Loading preview...

Model Overview

SantiagoC/palindrome-grpo-v5 is a 0.5 billion parameter language model, fine-tuned from the Qwen/Qwen2.5-0.5B-Instruct base model. It leverages the TRL (Transformers Reinforcement Learning) framework for its training process.

Key Differentiator: GRPO Training

What sets this model apart is its training methodology. It was fine-tuned using GRPO (Gradient-based Reward Policy Optimization), a method introduced in the paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models". This approach aims to significantly improve the model's capabilities in mathematical reasoning tasks.

Usage

Developers can quickly integrate this model using the transformers library for text generation tasks. An example Python snippet is provided for immediate use, demonstrating how to load the model and tokenizer for inference.

Training Details

The model was trained with specific versions of key frameworks:

  • TRL: 1.3.0
  • Transformers: 5.8.0
  • Pytorch: 2.11.0
  • Datasets: 4.8.5
  • Tokenizers: 0.22.2

This model is particularly suited for applications where enhanced mathematical reasoning and instruction following are critical, building upon the robust foundation of the Qwen2.5 architecture.