kangdawei/DRA-GRPO-8B
The kangdawei/DRA-GRPO-8B model is an 8 billion parameter language model fine-tuned from deepseek-ai/DeepSeek-R1-Distill-Llama-8B. It was trained using the GRPO method on the knoveleng/open-rs dataset, specializing in mathematical reasoning tasks. With a 32768 token context length, this model is optimized for complex problem-solving and advanced reasoning capabilities.
Loading preview...
Model Overview
The kangdawei/DRA-GRPO-8B is an 8 billion parameter language model derived from the deepseek-ai/DeepSeek-R1-Distill-Llama-8B architecture. It has been specifically fine-tuned using the GRPO (Gradient-based Reward Policy Optimization) method, as introduced in the DeepSeekMath paper, to enhance its mathematical reasoning capabilities.
Key Capabilities
- Mathematical Reasoning: Optimized for complex mathematical problem-solving through the GRPO training approach.
- Large Context Window: Supports a substantial context length of 32768 tokens, enabling processing of extensive inputs for reasoning tasks.
- Fine-tuned Performance: Leverages the
knoveleng/open-rsdataset for specialized training, improving performance in its target domain.
Training Details
The model's training procedure utilized the TRL framework (version 0.16.0.dev0) and incorporated the GRPO method, which is designed to push the limits of mathematical reasoning in open language models. This makes DRA-GRPO-8B a strong candidate for applications requiring advanced analytical and problem-solving skills.