zhaohq/PureRL-7B-v6e-B-lam03-sigmoid-maskon-acc05
PureRL-7B-v6e-B-lam03-sigmoid-maskon-acc05 is a 7.6 billion parameter language model fine-tuned from Qwen/Qwen2.5-Math-7B by zhaohq. This model was trained using the GRPO method, which is designed to enhance mathematical reasoning capabilities. It leverages a 32768 token context length, making it suitable for tasks requiring deep contextual understanding, particularly in mathematical domains.
Loading preview...
Overview
This model, PureRL-7B-v6e-B-lam03-sigmoid-maskon-acc05, is a 7.6 billion parameter language model developed by zhaohq. It is a fine-tuned version of the Qwen/Qwen2.5-Math-7B base model, specifically optimized for mathematical reasoning tasks. The model was trained using the Transformer Reinforcement Learning (TRL) framework, incorporating the GRPO (Gradient-based Reward Policy Optimization) method.
Key Training Details
- Base Model: Qwen/Qwen2.5-Math-7B
- Training Method: GRPO, as introduced in the DeepSeekMath paper, which focuses on pushing the limits of mathematical reasoning in open language models.
- Framework: TRL (Transformer Reinforcement Learning)
- Context Length: Supports a context length of 32768 tokens.
Use Cases
This model is particularly well-suited for applications requiring advanced mathematical problem-solving and reasoning. Its fine-tuning with the GRPO method suggests improved performance on complex mathematical queries and tasks compared to general-purpose language models.