harsha070/exp2-qwen-mbpp-s123-lambda-0p25

TEXT GENERATIONConcurrency Cost:1Model Size:3.1BQuant:BF16Ctx Length:32kPublished:May 4, 2026Architecture:Transformer Cold

The harsha070/exp2-qwen-mbpp-s123-lambda-0p25 model is a 3.1 billion parameter language model, fine-tuned from harsha070/sft-warmup-qwen-v2 using the TRL framework. It was trained with GRPO, a method designed to enhance mathematical reasoning, as introduced in the DeepSeekMath paper. This model is optimized for tasks requiring advanced mathematical problem-solving and logical deduction, leveraging its 32768 token context length.

Loading preview...

Model Overview

The harsha070/exp2-qwen-mbpp-s123-lambda-0p25 is a 3.1 billion parameter language model, fine-tuned from harsha070/sft-warmup-qwen-v2. This model leverages a 32768 token context length, making it suitable for processing longer inputs.

Key Training Details

  • Fine-tuning Method: The model was trained using the TRL library.
  • Optimization Technique: It incorporates GRPO (Gradient-based Reward Policy Optimization), a method highlighted in the DeepSeekMath paper, which focuses on improving mathematical reasoning capabilities.

Intended Use Cases

This model is particularly well-suited for applications that demand strong mathematical reasoning and problem-solving. Its training with GRPO suggests an emphasis on tasks where logical deduction and numerical accuracy are critical.