Model Overview
Jackrong/GPT-Distill-Qwen3-8B-Thinking is an 8 billion parameter instruction-tuned and reasoning-enhanced language model built upon the Qwen3-8B base. It features a 16,384 token context window and supports both English and Chinese. This model was developed using Supervised Fine-Tuning (SFT) with Unsloth and incorporates knowledge distillation from large-scale reasoning models (120B/235B class).
Key Differentiators
- "Thinking" Capability: Explicitly trained to generate internal reasoning chains, wrapped in
<think>...</think> tags, before providing a final answer. This significantly improves performance on complex math, logic, and scientific tasks. - Distilled Intelligence: Inherits advanced reasoning patterns from high-intelligence teacher models (GPT-OSS-120B and Qwen3-235B), allowing an 8B model to mimic the problem-solving approaches of much larger architectures.
- Long Context: Processes extensive documents and conversations up to 16K tokens, making it suitable for tasks requiring broad contextual understanding.
- Efficient Size: Offers high performance in an 8B parameter footprint, optimized for lower VRAM usage.
Recommended Use Cases
- Complex Reasoning: Ideal for math problems, logical puzzles, and scientific derivations, leveraging its CoT mechanism.
- Long-Context Tasks: Processing and understanding information from lengthy texts or dialogues.
- Instruction Following: Adheres well to intricate user instructions and constraints.
- Multilingual NLP: Fluent generation and understanding in both Chinese and English.
Training Details
The model was fine-tuned on approximately 88,000 high-quality examples, including specialized datasets for reasoning and CoT, ShareGPT for conversational flow, and instruction following. The training specifically focused on modeling assistant behavior by training only on responses.