harsha070/exp2-qwen-mbpp-s42-lambda-0p25

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:3.1BQuant:BF16Ctx Length:32kPublished:May 4, 2026Architecture:Transformer0.0K Warm

The harsha070/exp2-qwen-mbpp-s42-lambda-0p25 is a 3.1 billion parameter language model fine-tuned from harsha070/sft-warmup-qwen-v1. It was trained using the GRPO method, which is designed to enhance mathematical reasoning capabilities. This model is optimized for tasks requiring advanced mathematical problem-solving and logical deduction, leveraging a 32768 token context length.

Loading preview...

Model Overview

The harsha070/exp2-qwen-mbpp-s42-lambda-0p25 is a 3.1 billion parameter language model, fine-tuned from harsha070/sft-warmup-qwen-v1. This model leverages a substantial 32768 token context length, making it suitable for processing longer inputs and maintaining context over extended interactions.

Key Capabilities

  • Enhanced Mathematical Reasoning: The model was specifically trained using the GRPO (Gradient-based Reinforcement Learning with Policy Optimization) method, as introduced in the DeepSeekMath paper. This training approach aims to significantly improve its performance on mathematical reasoning tasks.
  • Fine-tuned with TRL: The fine-tuning process utilized the TRL (Transformers Reinforcement Learning) library, indicating a focus on optimizing model behavior through reinforcement learning techniques.

Training Details

The training procedure for this model incorporated GRPO, a method detailed in the paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" (arXiv:2402.03300). This suggests a specialized focus on developing robust mathematical problem-solving abilities. The model was developed using TRL 1.3.0, Transformers 5.7.0, Pytorch 2.11.0, Datasets 4.8.5, and Tokenizers 0.22.2.

Good For

  • Applications requiring strong mathematical reasoning.
  • Tasks benefiting from a model fine-tuned with advanced reinforcement learning techniques like GRPO.
  • Scenarios where a 3.1 billion parameter model with a large context window (32768 tokens) is advantageous for balancing performance and computational resources.