narcolepticchicken/occ-grpo-baseline
The narcolepticchicken/occ-grpo-baseline model is a fine-tuned version of Qwen/Qwen2.5-3B-Instruct, developed by narcolepticchicken. This model was trained using the GRPO method, which is designed to enhance mathematical reasoning capabilities. It is optimized for tasks requiring advanced reasoning, particularly in mathematical contexts, leveraging its Qwen2.5-3B architecture.
Loading preview...
Model Overview
The narcolepticchicken/occ-grpo-baseline is an instruction-tuned language model based on the Qwen/Qwen2.5-3B-Instruct architecture. It has been fine-tuned using the TRL (Transformers Reinforcement Learning) framework.
Key Capabilities
- Enhanced Mathematical Reasoning: This model's primary differentiator is its training with the GRPO (Gradient-based Reward Policy Optimization) method. GRPO, introduced in the "DeepSeekMath" paper, aims to push the limits of mathematical reasoning in open language models.
- Instruction Following: As a fine-tuned instruction model, it is designed to follow user prompts effectively, similar to its base model, Qwen2.5-3B-Instruct.
Training Details
The model was trained using specific versions of popular frameworks:
- TRL: 1.7.0
- Transformers: 5.12.1
- Pytorch: 2.12.1
- Datasets: 5.0.0
- Tokenizers: 0.22.2
Use Cases
This model is particularly well-suited for applications requiring robust mathematical problem-solving and reasoning tasks, benefiting from its specialized GRPO training. Developers can integrate it using the Hugging Face pipeline for text generation.