zhaohq/PureRL-1.5B-v12A-lam002
zhaohq/PureRL-1.5B-v12A-lam002 is a 1.5 billion parameter language model developed by zhaohq, fine-tuned from Qwen/Qwen2.5-Math-1.5B. It utilizes the GRPO training method, as introduced in the DeepSeekMath paper, to enhance its capabilities. With a context length of 32768 tokens, this model is primarily optimized for mathematical reasoning and complex problem-solving tasks.
Loading preview...
PureRL-1.5B-v12A-lam002 Overview
This model, developed by zhaohq, is a 1.5 billion parameter language model fine-tuned from the Qwen/Qwen2.5-Math-1.5B base. It leverages the GRPO (Generalized Reinforcement Learning with Policy Optimization) training method, a technique highlighted in the DeepSeekMath paper, to improve its performance. The model supports a substantial context length of 32768 tokens, making it suitable for processing longer inputs.
Key Capabilities
- Enhanced Mathematical Reasoning: Benefits from GRPO training, a method designed to push the limits of mathematical reasoning in open language models.
- Long Context Understanding: Capable of handling inputs up to 32768 tokens, useful for complex problems requiring extensive context.
- Fine-tuned from Qwen2.5-Math-1.5B: Builds upon a strong mathematical foundation.
Good for
- Mathematical Problem Solving: Ideal for tasks requiring advanced mathematical reasoning and computation.
- Research and Development: Useful for exploring and applying reinforcement learning techniques in language model fine-tuning.
- Complex Query Handling: Its long context window makes it suitable for detailed questions or scenarios.