zhaohq/PureRL-1.5B-v14B-k4

TEXT GENERATIONConcurrency Cost:1Model Size:1.5BQuant:BF16Ctx Length:32kTool Calling:SupportedPublished:May 20, 2026Architecture:Transformer Cold

The zhaohq/PureRL-1.5B-v14B-k4 model is a 1.5 billion parameter language model fine-tuned by zhaohq using TRL. It was trained with GRPO, a method detailed in the DeepSeekMath paper, which focuses on mathematical reasoning. This model is designed for general text generation tasks, leveraging its specialized training for potentially improved reasoning capabilities.

Loading preview...

Model Overview

zhaohq/PureRL-1.5B-v14B-k4 is a 1.5 billion parameter language model developed by zhaohq. It has been fine-tuned using the Transformer Reinforcement Learning (TRL) framework, specifically incorporating the GRPO (Generalized Reinforcement Learning with Policy Optimization) training method.

Key Training Details

  • Training Method: Utilizes GRPO, a technique introduced in the research paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" (arXiv:2402.03300). This suggests a focus on enhancing the model's reasoning abilities, particularly in mathematical contexts.
  • Frameworks: Trained with TRL 0.16.0.dev0, Transformers 4.48.3, Pytorch 2.5.1, Datasets 4.0.0, and Tokenizers 0.21.1.

Potential Use Cases

This model is suitable for general text generation tasks. Given its training with GRPO, it may exhibit enhanced performance in scenarios requiring logical inference or structured reasoning, making it a candidate for applications beyond simple conversational AI.