zhaohq/PureRL-1.5B-v7-s2-async-l2-maskoff-afew
TEXT GENERATIONConcurrency Cost:1Model Size:1.5BQuant:BF16Ctx Length:32kPublished:May 20, 2026Architecture:Transformer Warm
PureRL-1.5B-v7-s2-async-l2-maskoff-afew by zhaohq is a 1.5 billion parameter language model, fine-tuned from zhaohq/PureRL-1.5B-v7-stage1-A-fewshot using the TRL framework. This model was specifically trained with GRPO, a method detailed in the DeepSeekMath paper, indicating an optimization for mathematical reasoning and complex problem-solving. Its primary use case is in applications requiring advanced reasoning capabilities, particularly in mathematical contexts.
Loading preview...
PureRL-1.5B-v7-s2-async-l2-maskoff-afew Overview
This model, developed by zhaohq, is a 1.5 billion parameter language model fine-tuned from its predecessor, zhaohq/PureRL-1.5B-v7-stage1-A-fewshot. It leverages the TRL (Transformer Reinforcement Learning) framework for its training process.
Key Capabilities
- Enhanced Mathematical Reasoning: The model was trained using GRPO (Generalized Reinforcement Learning with Policy Optimization), a method introduced in the "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" paper. This training approach suggests a strong focus on improving the model's ability to handle complex mathematical problems and reasoning tasks.
- Fine-tuned Performance: As a fine-tuned version, it builds upon the base model's capabilities, likely offering improved performance in specific domains targeted by the GRPO training.
Good for
- Mathematical Problem Solving: Ideal for applications requiring robust mathematical reasoning, such as solving equations, proofs, or complex logical problems.
- Research and Development: Useful for researchers exploring advanced reinforcement learning techniques in language models, particularly those interested in the GRPO method.
- Specialized AI Tasks: Suitable for scenarios where a model with a strong foundation in logical and mathematical processing is beneficial.