narcolepticchicken/occ-grpo-baseline

TEXT GENERATIONConcurrency Cost:1Model Size:3.1BQuant:BF16Ctx Length:32kTool Calling:SupportedPublished:Jun 30, 2026Architecture:Transformer Cold

The narcolepticchicken/occ-grpo-baseline model is a fine-tuned version of Qwen/Qwen2.5-3B-Instruct, developed by narcolepticchicken. This model was trained using the GRPO method, which is designed to enhance mathematical reasoning capabilities. It is optimized for tasks requiring advanced reasoning, particularly in mathematical contexts, leveraging its Qwen2.5-3B architecture.

Loading preview...

Model Overview

The narcolepticchicken/occ-grpo-baseline is an instruction-tuned language model based on the Qwen/Qwen2.5-3B-Instruct architecture. It has been fine-tuned using the TRL (Transformers Reinforcement Learning) framework.

Key Capabilities

  • Enhanced Mathematical Reasoning: This model's primary differentiator is its training with the GRPO (Gradient-based Reward Policy Optimization) method. GRPO, introduced in the "DeepSeekMath" paper, aims to push the limits of mathematical reasoning in open language models.
  • Instruction Following: As a fine-tuned instruction model, it is designed to follow user prompts effectively, similar to its base model, Qwen2.5-3B-Instruct.

Training Details

The model was trained using specific versions of popular frameworks:

  • TRL: 1.7.0
  • Transformers: 5.12.1
  • Pytorch: 2.12.1
  • Datasets: 5.0.0
  • Tokenizers: 0.22.2

Use Cases

This model is particularly well-suited for applications requiring robust mathematical problem-solving and reasoning tasks, benefiting from its specialized GRPO training. Developers can integrate it using the Hugging Face pipeline for text generation.