abhi14/test-grpo-delete-me

TEXT GENERATIONConcurrency Cost:1Model Size:1.5BQuant:BF16Ctx Length:32kPublished:Apr 24, 2026Architecture:Transformer Cold

The abhi14/test-grpo-delete-me model is a 1.5 billion parameter instruction-tuned causal language model, fine-tuned from Qwen/Qwen2.5-1.5B-Instruct. It was trained using the TRL framework and incorporates the GRPO method, which is designed to enhance mathematical reasoning capabilities. This model is optimized for tasks requiring improved mathematical problem-solving and logical deduction.

Loading preview...

Overview

This model, abhi14/test-grpo-delete-me, is a 1.5 billion parameter language model fine-tuned from the Qwen/Qwen2.5-1.5B-Instruct base model. Its development utilized the TRL framework for training.

Key Differentiator: GRPO Training

A significant aspect of this model is its training methodology, which incorporates GRPO (Gradient-based Reward Policy Optimization). GRPO is a method introduced in the paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models". This suggests an optimization focus on enhancing the model's capabilities in mathematical reasoning and problem-solving.

Training Frameworks

  • TRL: Version 1.2.0
  • Transformers: Version 5.6.2
  • Pytorch: Version 2.11.0
  • Datasets: Version 4.8.4
  • Tokenizers: Version 0.22.2

Potential Use Cases

Given its fine-tuning from an instruction-following model and the application of GRPO, this model is likely well-suited for:

  • Tasks requiring mathematical reasoning.
  • Instruction-following in contexts that benefit from logical deduction.
  • Applications where a smaller, specialized model for numerical or logical problems is preferred.