narcolepticchicken/occ-grpo-occ
The narcolepticchicken/occ-grpo-occ model is a 3.1 billion parameter instruction-tuned language model, fine-tuned from Qwen/Qwen2.5-3B-Instruct. It was trained using the GRPO (Gradient-based Reward Policy Optimization) method, as introduced in the DeepSeekMath paper, which focuses on enhancing mathematical reasoning. This model is optimized for tasks requiring improved reasoning capabilities, leveraging its Qwen2.5 base and specialized training approach.
Loading preview...
Overview
narcolepticchicken/occ-grpo-occ is a 3.1 billion parameter instruction-tuned language model, built upon the robust Qwen/Qwen2.5-3B-Instruct architecture. This model distinguishes itself through its specialized training methodology, employing GRPO (Gradient-based Reward Policy Optimization). The GRPO method, detailed in the paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" (arXiv:2402.03300), is designed to enhance reasoning capabilities, particularly in complex domains.
Key Capabilities
- Enhanced Reasoning: Fine-tuned with GRPO, suggesting an optimization for tasks that benefit from improved logical and analytical processing.
- Instruction Following: As an instruction-tuned model, it is designed to accurately interpret and execute user prompts.
- Qwen2.5 Base: Benefits from the strong foundational capabilities of the Qwen2.5-3B-Instruct model, including a 32768-token context length.
Training Details
The model was trained using the TRL (Transformers Reinforcement Learning) framework, specifically version 1.7.0. The application of GRPO indicates a focus on refining the model's policy based on reward signals, a technique often used to improve performance in specific, challenging tasks like mathematical reasoning.
Good For
- Applications requiring a compact yet capable model for reasoning-intensive tasks.
- Scenarios where the base Qwen2.5-3B-Instruct model's performance needs a boost in logical coherence or problem-solving.
- Developers interested in exploring models fine-tuned with advanced reinforcement learning techniques like GRPO.