cjiao/goldengoose-gumbel_combined_grpoc_tau1.00-25grp

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:1.5BQuant:BF16Ctx Length:32kPublished:May 28, 2026Architecture:Transformer Warm

The cjiao/goldengoose-gumbel_combined_grpoc_tau1.00-25grp model is a 1.5 billion parameter instruction-tuned causal language model, fine-tuned from Qwen/Qwen2.5-1.5B-Instruct. Developed by cjiao, it utilizes the GRPO (Gumbel-softmax Reinforcement Learning with Policy Optimization) method, originally introduced for mathematical reasoning, to enhance its capabilities. This model is designed for general text generation tasks, leveraging its fine-tuning to produce coherent and contextually relevant responses.

Loading preview...

Model Overview

This model, goldengoose-gumbel_combined_grpoc_tau1.00-25grp, is a 1.5 billion parameter instruction-tuned language model. It is a fine-tuned version of the Qwen/Qwen2.5-1.5B-Instruct base model, developed by cjiao.

Key Differentiator: GRPO Training

The model was trained using GRPO (Gumbel-softmax Reinforcement Learning with Policy Optimization), a method first introduced in the paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models". This training approach aims to improve the model's reasoning and generation capabilities, potentially extending beyond its original mathematical focus to general instruction following.

Training Details

  • Base Model: Qwen/Qwen2.5-1.5B-Instruct
  • Training Framework: TRL (Transformer Reinforcement Learning) version 0.19.1
  • Methodology: GRPO, as detailed in the DeepSeekMath paper.

Use Cases

This model is suitable for various text generation tasks, particularly those requiring instruction following and coherent responses. Its fine-tuning with GRPO suggests potential strengths in tasks that benefit from enhanced reasoning, making it a candidate for applications where robust and logical outputs are desired.