Jackrong/GPT-Distill-Qwen3-8B-Thinking

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:32kLicense:apache-2.0Architecture:Transformer Open Weights Cold

Jackrong/GPT-Distill-Qwen3-8B-Thinking is an 8 billion parameter instruction-tuned language model based on Qwen3-8B, featuring a 16K token context length. It is specifically designed for complex reasoning and instruction following, having been distilled from 120B+ parameter teacher models. A key differentiator is its "Thinking" capability, which generates explicit Chain-of-Thought processes within tags to enhance performance on math, logic, and scientific tasks. This model excels at mimicking high-intelligence reasoning patterns in a more efficient 8B architecture.

Loading preview...

Model Overview

Jackrong/GPT-Distill-Qwen3-8B-Thinking is an 8 billion parameter instruction-tuned and reasoning-enhanced language model built upon the Qwen3-8B base. It features a 16,384 token context window and supports both English and Chinese. This model was developed using Supervised Fine-Tuning (SFT) with Unsloth and incorporates knowledge distillation from large-scale reasoning models (120B/235B class).

Key Differentiators

  • "Thinking" Capability: Explicitly trained to generate internal reasoning chains, wrapped in <think>...</think> tags, before providing a final answer. This significantly improves performance on complex math, logic, and scientific tasks.
  • Distilled Intelligence: Inherits advanced reasoning patterns from high-intelligence teacher models (GPT-OSS-120B and Qwen3-235B), allowing an 8B model to mimic the problem-solving approaches of much larger architectures.
  • Long Context: Processes extensive documents and conversations up to 16K tokens, making it suitable for tasks requiring broad contextual understanding.
  • Efficient Size: Offers high performance in an 8B parameter footprint, optimized for lower VRAM usage.

Recommended Use Cases

  • Complex Reasoning: Ideal for math problems, logical puzzles, and scientific derivations, leveraging its CoT mechanism.
  • Long-Context Tasks: Processing and understanding information from lengthy texts or dialogues.
  • Instruction Following: Adheres well to intricate user instructions and constraints.
  • Multilingual NLP: Fluent generation and understanding in both Chinese and English.

Training Details

The model was fine-tuned on approximately 88,000 high-quality examples, including specialized datasets for reasoning and CoT, ShareGPT for conversational flow, and instruction following. The training specifically focused on modeling assistant behavior by training only on responses.