SohamK18/data-cleaning-grpo

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:1.5BQuant:BF16Ctx Length:32kPublished:Apr 6, 2026Architecture:Transformer Warm

The SohamK18/data-cleaning-grpo model is a 1.5 billion parameter instruction-tuned causal language model, fine-tuned from Qwen/Qwen2.5-1.5B-Instruct. It was trained using the GRPO method, which is designed to enhance mathematical reasoning capabilities. This model is particularly suited for tasks requiring improved logical and mathematical problem-solving, building upon its base Qwen2.5 architecture.

Loading preview...

Overview

SohamK18/data-cleaning-grpo is a 1.5 billion parameter language model, fine-tuned from the Qwen/Qwen2.5-1.5B-Instruct base model. It leverages the GRPO (Gradient-based Reward Policy Optimization) method, as introduced in the DeepSeekMath paper, to enhance its reasoning abilities.

Key Capabilities

  • Enhanced Reasoning: Specifically trained with GRPO to improve performance on tasks requiring logical and mathematical reasoning.
  • Instruction Following: Inherits strong instruction-following capabilities from its Qwen2.5-Instruct base.
  • Context Length: Supports a substantial context window of 32768 tokens, allowing for processing longer inputs.

Training Details

The model was fine-tuned using the TRL library, with a focus on applying the GRPO method. This training approach aims to push the limits of mathematical reasoning in open language models, as detailed in the DeepSeekMath research.

Good For

  • Applications requiring improved mathematical problem-solving.
  • Tasks benefiting from enhanced logical reasoning.
  • Use cases where a smaller, efficient model with strong reasoning is preferred.