zhaohq/PureRL-1.5B-v7-s2-async-l2-maskoff-afew

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:1.5BQuant:BF16Ctx Length:32kPublished:May 20, 2026Architecture:Transformer Warm

PureRL-1.5B-v7-s2-async-l2-maskoff-afew by zhaohq is a 1.5 billion parameter language model, fine-tuned from zhaohq/PureRL-1.5B-v7-stage1-A-fewshot using the TRL framework. This model was specifically trained with GRPO, a method detailed in the DeepSeekMath paper, indicating an optimization for mathematical reasoning and complex problem-solving. Its primary use case is in applications requiring advanced reasoning capabilities, particularly in mathematical contexts.

Loading preview...

PureRL-1.5B-v7-s2-async-l2-maskoff-afew Overview

This model, developed by zhaohq, is a 1.5 billion parameter language model fine-tuned from its predecessor, zhaohq/PureRL-1.5B-v7-stage1-A-fewshot. It leverages the TRL (Transformer Reinforcement Learning) framework for its training process.

Key Capabilities

  • Enhanced Mathematical Reasoning: The model was trained using GRPO (Generalized Reinforcement Learning with Policy Optimization), a method introduced in the "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" paper. This training approach suggests a strong focus on improving the model's ability to handle complex mathematical problems and reasoning tasks.
  • Fine-tuned Performance: As a fine-tuned version, it builds upon the base model's capabilities, likely offering improved performance in specific domains targeted by the GRPO training.

Good for

  • Mathematical Problem Solving: Ideal for applications requiring robust mathematical reasoning, such as solving equations, proofs, or complex logical problems.
  • Research and Development: Useful for researchers exploring advanced reinforcement learning techniques in language models, particularly those interested in the GRPO method.
  • Specialized AI Tasks: Suitable for scenarios where a model with a strong foundation in logical and mathematical processing is beneficial.