harsha070/expfinal-phi-mbpp-s42-lambda-0p25
The harsha070/expfinal-phi-mbpp-s42-lambda-0p25 is a 4 billion parameter language model, fine-tuned from harsha070/sft-warmup-phi-v1 using the TRL framework. This model incorporates the GRPO training method, as introduced in the DeepSeekMath paper, to enhance its mathematical reasoning capabilities. It is specifically optimized for tasks requiring robust logical and mathematical problem-solving.
Loading preview...
Model Overview
This model, harsha070/expfinal-phi-mbpp-s42-lambda-0p25, is a 4 billion parameter language model derived from harsha070/sft-warmup-phi-v1. It has been fine-tuned using the TRL (Transformers Reinforcement Learning) framework, specifically version 1.3.0.
Key Training Details
A significant aspect of this model's development is the application of the GRPO (Gradient Regularized Policy Optimization) training method. This technique, detailed in the research paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" (arXiv:2402.03300), aims to improve the model's proficiency in mathematical reasoning tasks. The training utilized Transformers 5.8.0, Pytorch 2.11.0, Datasets 4.8.5, and Tokenizers 0.22.2.
Use Cases
Given its fine-tuning with the GRPO method, this model is particularly well-suited for:
- Mathematical problem-solving: Tasks that require logical deduction and numerical computation.
- Reasoning-intensive applications: Scenarios where robust analytical capabilities are crucial.
- Code generation related to mathematical or logical functions: Potentially beneficial for generating code that solves specific mathematical challenges, although not explicitly stated as a primary focus.