harsha070/expfinal-qwen-island-s42-lambda-0p0
harsha070/expfinal-qwen-island-s42-lambda-0p0 is a 3.1 billion parameter instruction-tuned causal language model, fine-tuned from Qwen/Qwen2.5-3B-Instruct. This model was trained using the GRPO method, which is designed to enhance mathematical reasoning capabilities. It is optimized for tasks requiring advanced logical and mathematical problem-solving, building upon the base Qwen2.5 architecture with a 32K context length.
Loading preview...
Model Overview
This model, harsha070/expfinal-qwen-island-s42-lambda-0p0, is a fine-tuned variant of the Qwen/Qwen2.5-3B-Instruct base model, featuring 3.1 billion parameters and a 32K token context length. It has been specifically trained using the GRPO (Gradient-based Reward Policy Optimization) method, as introduced in the research paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models". This training approach aims to significantly improve the model's proficiency in mathematical reasoning tasks.
Key Capabilities
- Enhanced Mathematical Reasoning: Leverages the GRPO training method to improve performance on complex mathematical problems.
- Instruction Following: Builds upon the instruction-tuned capabilities of the Qwen2.5-3B-Instruct base model.
- Efficient Inference: With 3.1 billion parameters, it offers a balance between performance and computational efficiency.
Training Details
The model was trained using the TRL (Transformers Reinforcement Learning) library, version 1.3.0, with Transformers 5.7.0 and PyTorch 2.11.0. The GRPO method, which is central to its training, is detailed in the DeepSeekMath paper.
Use Cases
This model is particularly well-suited for applications requiring strong mathematical problem-solving and logical reasoning, making it a valuable tool for educational platforms, scientific research, or any domain where precise numerical and logical understanding is critical.