Jackrong/Llama3.1-8B-Thinking-R1

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:32kPublished:Dec 20, 2025License:llama3.1Architecture:Transformer Cold

Jackrong/Llama3.1-8B-Thinking-R1 is an 8 billion parameter deep reasoning model built upon Llama-3.1-8B-Instruct, designed to solve complex logic, mathematics, and programming problems. It features a refined Chain-of-Thought (CoT) capability, performing self-correction and multi-path exploration within tags before answering. This model excels at structured reasoning tasks, supporting a long context length of up to 65,536 tokens.

Loading preview...

Llama3.1-8B-Thinking-R1: A Deep Reasoning Model

Jackrong/Llama3.1-8B-Thinking-R1 is an 8 billion parameter model based on Llama-3.1-8B-Instruct, specifically engineered for complex reasoning tasks in logic, mathematics, and programming. Its core innovation lies in a sophisticated "Think-and-Answer" paradigm, where the model utilizes <think> tags for self-correction, logical decomposition, and multi-path exploration before generating a final response.

Key Training Methodology

The model undergoes a unique three-stage training pipeline:

  • Cold-start SFT: Initial fine-tuning on high-quality mathematical reasoning data to establish basic reasoning formats and the use of <think> tags.
  • GRPO Reinforcement Learning: Large-scale reinforcement training using Group Relative Policy Optimization, guided by Accuracy and Format Rewards to optimize thought processes and reduce redundancy.
  • Final CoT Distillation SFT: Instruction fine-tuning with high-quality Chain-of-Thought data distilled from ultra-large models like GPT-OSS-120B and Qwen3-235B, enhancing logical rigor and expressiveness, particularly in Chinese logic and multi-turn dialogues.

Notable Features & Capabilities

  • Reinforcement Learning: Employs the GRPO algorithm for autonomous learning of logical decomposition.
  • Multi-stage Distillation: Incorporates reasoning logic from 120B+ scale models, significantly boosting performance in complex contexts.
  • Long Context Support: Capable of handling complex, long-chain reasoning tasks with a context length of up to 65,536 tokens.
  • Efficient Fine-Tuning: Built on the Unsloth framework using LoRA to maintain reasoning capabilities while preventing catastrophic forgetting.

Ideal Use Cases

This model is particularly well-suited for applications requiring:

  • Solving intricate mathematical problems.
  • Executing complex logical deductions.
  • Handling multi-turn dialogue scenarios that demand deep reasoning.
  • Tasks benefiting from structured, self-correcting thought processes.