shawntzx/Qwen2.5-3B-GRPO-3_5_8_6k

TEXT GENERATIONConcurrency Cost:1Model Size:3.1BQuant:BF16Ctx Length:32kTool Calling:SupportedPublished:Mar 5, 2025Architecture:Transformer Cold

shawntzx/Qwen2.5-3B-GRPO-3_5_8_6k is a 3.1 billion parameter causal language model, fine-tuned from Qwen/Qwen2.5-3B-Instruct. This model utilizes the GRPO (Gradient-based Reward Policy Optimization) method, originally introduced for mathematical reasoning, to enhance its capabilities. With a context length of 32768 tokens, it is optimized for tasks requiring advanced reasoning, particularly in areas where GRPO's training methodology provides an advantage.

Loading preview...

Model Overview

shawntzx/Qwen2.5-3B-GRPO-3_5_8_6k is a 3.1 billion parameter language model, fine-tuned from the Qwen/Qwen2.5-3B-Instruct base model. This model distinguishes itself by incorporating the GRPO (Gradient-based Reward Policy Optimization) training method. GRPO, detailed in the paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" (arXiv:2402.03300), is designed to improve reasoning capabilities, particularly in complex domains.

Key Capabilities

  • Enhanced Reasoning: Leverages the GRPO training methodology to potentially improve performance on tasks requiring structured thought and problem-solving.
  • Qwen2.5 Base: Benefits from the strong foundational capabilities of the Qwen2.5-3B-Instruct model.
  • Extended Context: Supports a substantial context length of 32768 tokens, allowing for processing longer inputs and maintaining coherence over extended dialogues or documents.

Training Details

The model was fine-tuned using the TRL library, with specific framework versions including TRL 0.15.0.dev0, Transformers 4.49.0.dev0, Pytorch 2.5.1, Datasets 3.2.0, and Tokenizers 0.21.0.

Good For

  • Applications requiring improved reasoning, especially in areas where GRPO's benefits are applicable.
  • Tasks that can leverage a 3.1 billion parameter model with a large context window for detailed understanding and generation.
  • Developers looking to experiment with models trained using advanced optimization techniques like GRPO.