Model Overview
This model, sleeepeer/llama3-warm_up-dolly_new_1200_0113-42-202601130042, is an 8 billion parameter language model derived from sleeepeer/meta-llama-Llama-3.1-8B-Instruct-sanitization-clean-OPI_SEP-42-202601102333. It has been fine-tuned using the TRL (Transformer Reinforcement Learning) framework.
Key Capabilities
- Mathematical Reasoning: The model's primary differentiator is its training with GRPO (Guided Reasoning Policy Optimization), a method detailed in the DeepSeekMath paper. This technique is designed to significantly enhance mathematical reasoning abilities in large language models.
- Instruction Following: As a fine-tuned instruction model, it is capable of understanding and executing user prompts effectively.
- Llama 3.1 Base: Built upon the Llama 3.1 architecture, it inherits the strong foundational capabilities of this family of models.
Training Details
The model was trained using the TRL library, with specific framework versions including TRL 0.26.2, Transformers 4.56.2, Pytorch 2.9.0, Datasets 4.4.2, and Tokenizers 0.22.1. The GRPO method, central to its mathematical optimization, was introduced in the 2024 DeepSeekMath research.
Recommended Use Cases
This model is particularly well-suited for applications requiring advanced mathematical problem-solving, logical deduction, and general instruction-following where numerical accuracy and reasoning are critical.