cjiao/goldengoose-gumbel_combined_gradsim_tau0.50-25grp
The cjiao/goldengoose-gumbel_combined_gradsim_tau0.50-25grp is a 1.5 billion parameter language model, fine-tuned from Qwen/Qwen2.5-1.5B-Instruct. It was trained using the GRPO method, as introduced in the DeepSeekMath paper, which focuses on enhancing mathematical reasoning. This model is optimized for tasks requiring improved reasoning capabilities, leveraging its specialized training approach.
Loading preview...
Overview
This model, cjiao/goldengoose-gumbel_combined_gradsim_tau0.50-25grp, is a 1.5 billion parameter language model derived from the Qwen/Qwen2.5-1.5B-Instruct architecture. It has been fine-tuned using the TRL (Transformer Reinforcement Learning) framework.
Key Capabilities
- Enhanced Reasoning: The model's primary differentiator is its training with the GRPO (Gumbel-softmax Reinforcement Learning with Policy Optimization) method. This technique, detailed in the DeepSeekMath paper, is designed to push the limits of mathematical reasoning in language models.
- Instruction Following: As a fine-tuned version of an instruct model, it is capable of following user instructions for various text generation tasks.
Training Details
The model was trained using TRL version 0.19.1, with Transformers 4.57.6 and PyTorch 2.5.1. The GRPO method, central to its training, aims to improve reasoning abilities, particularly in complex domains like mathematics. This specialized training distinguishes it from general-purpose instruction-tuned models by focusing on a more robust reasoning process.