zhaohq/PureRL-1.5B-v7-s2-margin-maskon

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:1.5BQuant:BF16Ctx Length:32kPublished:May 20, 2026Architecture:Transformer Warm

zhaohq/PureRL-1.5B-v7-s2-margin-maskon is a 1.5 billion parameter language model fine-tuned by zhaohq using the TRL library. This model was trained with GRPO, a method introduced in the DeepSeekMath paper, suggesting an optimization for mathematical reasoning and complex problem-solving. With a context length of 32768 tokens, it is designed for generating coherent and contextually relevant text, particularly in response to intricate prompts.

Loading preview...

Model Overview

zhaohq/PureRL-1.5B-v7-s2-margin-maskon is a 1.5 billion parameter language model developed by zhaohq. It is a fine-tuned model, leveraging the TRL (Transformer Reinforcement Learning) library for its training process. The model's development specifically incorporated GRPO (Generalized Reinforcement Learning with Policy Optimization), a method detailed in the "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" paper.

Key Characteristics

  • Parameter Count: 1.5 billion parameters, offering a balance between performance and computational efficiency.
  • Context Length: Supports a substantial context window of 32768 tokens, enabling it to process and generate longer, more complex sequences of text while maintaining coherence.
  • Training Method: Utilizes GRPO, indicating a focus on enhancing reasoning capabilities, potentially in areas like mathematics or logical problem-solving, as suggested by its origin in the DeepSeekMath research.
  • Frameworks: Trained with TRL (version 0.16.0.dev0), Transformers (4.48.3), Pytorch (2.5.1), Datasets (4.0.0), and Tokenizers (0.21.1).

Potential Use Cases

  • Complex Question Answering: Its training with GRPO suggests an aptitude for handling questions requiring deeper reasoning.
  • Content Generation: Capable of generating detailed and contextually rich responses, as demonstrated by the example prompt.
  • Research and Development: Serves as a base for further experimentation with reinforcement learning techniques in language models, particularly for tasks benefiting from improved reasoning.