Model Overview
SoheylM/DeepSeek-R1-Distill-Qwen-14B-GRPO is a 14 billion parameter language model derived from the deepseek-ai/DeepSeek-R1-Distill-Qwen-14B base model. It has been fine-tuned using the GRPO (Generative Reinforcement Learning with Policy Optimization) method, as introduced in the research paper DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.
Key Capabilities
- Enhanced Mathematical Reasoning: The model's training on the IDEALLab/OpenR1-EPS-5k dataset, combined with the GRPO method, specifically targets and improves its ability to handle complex mathematical problems and logical reasoning tasks.
- Instruction Following: Fine-tuning with TRL (Transformer Reinforcement Learning) helps the model follow instructions effectively for text generation tasks.
- Large Context Window: Supports a context length of 32768 tokens, allowing for processing and generating longer, more complex inputs and outputs.
Training Details
The model was trained using the TRL library (version 0.17.0.dev0) and leverages the GRPO technique. This approach is designed to push the boundaries of mathematical reasoning in open language models, making it suitable for applications where precise and accurate mathematical problem-solving is crucial.
Good For
- Applications requiring strong mathematical reasoning.
- Tasks involving complex logical deduction.
- Generating responses to intricate questions that benefit from a deep understanding of mathematical concepts.