zhaohq/RLCR-1.5B-hotpot-rac-lr5e6-accW1

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:1.5BQuant:BF16Ctx Length:32kPublished:May 15, 2026Architecture:Transformer Warm

zhaohq/RLCR-1.5B-hotpot-rac-lr5e6-accW1 is a 1.5 billion parameter causal language model, fine-tuned from Qwen/Qwen2.5-1.5B by zhaohq. This model utilizes the GRPO training method, as introduced in the DeepSeekMath paper, to enhance its reasoning capabilities. It is primarily designed for tasks requiring advanced reasoning, leveraging its specialized training approach to differentiate it from general-purpose LLMs.

Loading preview...

Overview

zhaohq/RLCR-1.5B-hotpot-rac-lr5e6-accW1 is a 1.5 billion parameter language model, fine-tuned from the Qwen/Qwen2.5-1.5B base model. This model was developed by zhaohq and trained using the TRL library, specifically incorporating the GRPO (Gradient-based Reward Policy Optimization) method. GRPO is a technique highlighted in the "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" paper, suggesting an optimization for complex reasoning tasks.

Key Capabilities

  • Enhanced Reasoning: Fine-tuned with the GRPO method, which is associated with improving mathematical and general reasoning in language models.
  • Qwen2.5-1.5B Base: Benefits from the robust architecture and pre-training of the Qwen2.5-1.5B model.
  • TRL Framework: Developed using the TRL (Transformer Reinforcement Learning) library, indicating a reinforcement learning approach to fine-tuning.

Good For

  • Applications requiring improved reasoning abilities, potentially in areas like complex question answering or logical inference.
  • Researchers and developers interested in exploring models fine-tuned with advanced reinforcement learning techniques like GRPO.
  • Use cases where a smaller, efficient model (1.5B parameters) with specialized reasoning capabilities is preferred over larger, more general-purpose LLMs.