cjiao/goldengoose-gumbel_combined_grpoc_tau0.10-25grp
The cjiao/goldengoose-gumbel_combined_grpoc_tau0.10-25grp model is a 1.5 billion parameter instruction-tuned language model, fine-tuned from Qwen/Qwen2.5-1.5B-Instruct. It was trained using the TRL library and incorporates the GRPO (Gumbel-softmax Reinforcement Learning with Policy Optimization) method, which is designed to enhance mathematical reasoning capabilities. This model is particularly suited for tasks requiring improved logical and mathematical problem-solving, building upon its Qwen2.5 base.
Loading preview...
Model Overview
cjiao/goldengoose-gumbel_combined_grpoc_tau0.10-25grp is a 1.5 billion parameter language model, fine-tuned from the Qwen/Qwen2.5-1.5B-Instruct architecture. This model leverages the TRL (Transformer Reinforcement Learning) library for its training process.
Key Capabilities
- Enhanced Reasoning: The model was specifically trained using the GRPO (Gumbel-softmax Reinforcement Learning with Policy Optimization) method. This technique, introduced in the context of improving mathematical reasoning in large language models, suggests an emphasis on logical and problem-solving tasks.
- Instruction Following: As a fine-tuned version of an instruction-tuned base model (Qwen2.5-1.5B-Instruct), it is designed to follow user instructions effectively.
Training Details
The training procedure utilized GRPO, a method detailed in the research paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" (arXiv:2402.03300). The training was conducted using TRL version 0.19.1, Transformers 4.57.6, Pytorch 2.5.1, Datasets 4.8.4, and Tokenizers 0.22.2.
Good For
- Applications requiring improved mathematical or logical reasoning.
- Tasks where robust instruction following is crucial.
- Developers looking for a compact 1.5B parameter model with specialized reasoning enhancements.