Llama3.1-8B-Thinking-R1: A Deep Reasoning Model
Jackrong/Llama3.1-8B-Thinking-R1 is an 8 billion parameter model based on Llama-3.1-8B-Instruct, specifically engineered for complex reasoning tasks in logic, mathematics, and programming. Its core innovation lies in a sophisticated "Think-and-Answer" paradigm, where the model utilizes <think> tags for self-correction, logical decomposition, and multi-path exploration before generating a final response.
Key Training Methodology
The model undergoes a unique three-stage training pipeline:
- Cold-start SFT: Initial fine-tuning on high-quality mathematical reasoning data to establish basic reasoning formats and the use of
<think> tags. - GRPO Reinforcement Learning: Large-scale reinforcement training using Group Relative Policy Optimization, guided by Accuracy and Format Rewards to optimize thought processes and reduce redundancy.
- Final CoT Distillation SFT: Instruction fine-tuning with high-quality Chain-of-Thought data distilled from ultra-large models like GPT-OSS-120B and Qwen3-235B, enhancing logical rigor and expressiveness, particularly in Chinese logic and multi-turn dialogues.
Notable Features & Capabilities
- Reinforcement Learning: Employs the GRPO algorithm for autonomous learning of logical decomposition.
- Multi-stage Distillation: Incorporates reasoning logic from 120B+ scale models, significantly boosting performance in complex contexts.
- Long Context Support: Capable of handling complex, long-chain reasoning tasks with a context length of up to 65,536 tokens.
- Efficient Fine-Tuning: Built on the Unsloth framework using LoRA to maintain reasoning capabilities while preventing catastrophic forgetting.
Ideal Use Cases
This model is particularly well-suited for applications requiring:
- Solving intricate mathematical problems.
- Executing complex logical deductions.
- Handling multi-turn dialogue scenarios that demand deep reasoning.
- Tasks benefiting from structured, self-correcting thought processes.