cheongmyeong17/Qwen2.5-MATH-1.5B-GRPO-Best
cheongmyeong17/Qwen2.5-MATH-1.5B-GRPO-Best is a 1.5 billion parameter language model, fine-tuned from Qwen/Qwen2.5-Math-1.5B-Instruct. Developed by cheongmyeong17, this model specializes in mathematical reasoning tasks. It was trained using the GRPO method on the hendrycks-math-with-answers dataset, making it optimized for solving complex mathematical problems.
Loading preview...
Model Overview
cheongmyeong17/Qwen2.5-MATH-1.5B-GRPO-Best is a 1.5 billion parameter language model derived from Qwen/Qwen2.5-Math-1.5B-Instruct. This model has been specifically fine-tuned for enhanced mathematical reasoning capabilities.
Key Capabilities
- Mathematical Reasoning: Optimized for solving mathematical problems, leveraging training on the hendrycks-math-with-answers dataset.
- GRPO Training: Utilizes the GRPO (Gradient-based Reward Policy Optimization) method, as introduced in the DeepSeekMath paper, to improve performance in mathematical contexts.
- Instruction Following: Inherits instruction-following capabilities from its base model, Qwen2.5-Math-1.5B-Instruct.
Training Details
The model was trained using the TRL (Transformer Reinforcement Learning) framework. The training procedure involved fine-tuning with GRPO, a technique designed to push the limits of mathematical reasoning in language models. This specialized training makes it particularly adept at handling mathematical queries and problem-solving.