SohamK18/data-cleaning-grpo
The SohamK18/data-cleaning-grpo model is a 1.5 billion parameter instruction-tuned causal language model, fine-tuned from Qwen/Qwen2.5-1.5B-Instruct. It was trained using the GRPO method, which is designed to enhance mathematical reasoning capabilities. This model is particularly suited for tasks requiring improved logical and mathematical problem-solving, building upon its base Qwen2.5 architecture.
Loading preview...
Overview
SohamK18/data-cleaning-grpo is a 1.5 billion parameter language model, fine-tuned from the Qwen/Qwen2.5-1.5B-Instruct base model. It leverages the GRPO (Gradient-based Reward Policy Optimization) method, as introduced in the DeepSeekMath paper, to enhance its reasoning abilities.
Key Capabilities
- Enhanced Reasoning: Specifically trained with GRPO to improve performance on tasks requiring logical and mathematical reasoning.
- Instruction Following: Inherits strong instruction-following capabilities from its Qwen2.5-Instruct base.
- Context Length: Supports a substantial context window of 32768 tokens, allowing for processing longer inputs.
Training Details
The model was fine-tuned using the TRL library, with a focus on applying the GRPO method. This training approach aims to push the limits of mathematical reasoning in open language models, as detailed in the DeepSeekMath research.
Good For
- Applications requiring improved mathematical problem-solving.
- Tasks benefiting from enhanced logical reasoning.
- Use cases where a smaller, efficient model with strong reasoning is preferred.