zhaohq/PureRL-1.5B-v6b3-bare-fmt03
The zhaohq/PureRL-1.5B-v6b3-bare-fmt03 model is a 1.5 billion parameter language model, fine-tuned from Qwen/Qwen2.5-Math-1.5B. Developed by zhaohq, it leverages the TRL framework and GRPO training method, as introduced in the DeepSeekMath paper. This model is specifically optimized for mathematical reasoning tasks, building upon its Qwen2.5-Math base to enhance performance in complex problem-solving within a 32768 token context length.
Loading preview...
Model Overview
The zhaohq/PureRL-1.5B-v6b3-bare-fmt03 is a 1.5 billion parameter language model, fine-tuned by zhaohq from the base model Qwen/Qwen2.5-Math-1.5B. It is designed to excel in mathematical reasoning tasks, inheriting and enhancing the capabilities of its mathematical-focused predecessor.
Key Training Details
This model was trained using the TRL (Transformer Reinforcement Learning) framework. A significant aspect of its training procedure is the application of GRPO, a method detailed in the research paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" (arXiv:2402.03300). This indicates a specialized approach to reinforcement learning from human feedback or similar techniques, aimed at improving mathematical problem-solving abilities.
Capabilities and Use Cases
Given its foundation and specialized training, this model is particularly suited for:
- Mathematical Reasoning: Solving complex mathematical problems and generating logical steps.
- Instruction Following: Responding to user prompts in a structured and coherent manner, especially for analytical questions.
- Research and Development: Serving as a base for further experimentation in reinforcement learning for mathematical domains.
With a context length of 32768 tokens, it can process and generate relatively long and detailed responses, which is beneficial for multi-step mathematical derivations or explanations.