harsha070/exp2-qwen-mbpp-s123-lambda-0p25
The harsha070/exp2-qwen-mbpp-s123-lambda-0p25 model is a 3.1 billion parameter language model, fine-tuned from harsha070/sft-warmup-qwen-v2 using the TRL framework. It was trained with GRPO, a method designed to enhance mathematical reasoning, as introduced in the DeepSeekMath paper. This model is optimized for tasks requiring advanced mathematical problem-solving and logical deduction, leveraging its 32768 token context length.
Loading preview...
Model Overview
The harsha070/exp2-qwen-mbpp-s123-lambda-0p25 is a 3.1 billion parameter language model, fine-tuned from harsha070/sft-warmup-qwen-v2. This model leverages a 32768 token context length, making it suitable for processing longer inputs.
Key Training Details
- Fine-tuning Method: The model was trained using the TRL library.
- Optimization Technique: It incorporates GRPO (Gradient-based Reward Policy Optimization), a method highlighted in the DeepSeekMath paper, which focuses on improving mathematical reasoning capabilities.
Intended Use Cases
This model is particularly well-suited for applications that demand strong mathematical reasoning and problem-solving. Its training with GRPO suggests an emphasis on tasks where logical deduction and numerical accuracy are critical.