Viesar/gemma-3-4b-opus-reasoning-distill

Hugging Face
VISIONConcurrency Cost:1Model Size:4.3BQuant:BF16Ctx Length:32kPublished:May 6, 2026License:gemmaArchitecture:Transformer0.0K Warm

Viesar/gemma-3-4b-opus-reasoning-distill is a 4.3 billion parameter Gemma 3-4B instruction-tuned model, fine-tuned by Viesar using QLoRA. This model is specifically distilled for reasoning tasks, leveraging the Opus-4.6-Reasoning dataset. It is designed for chat-style reasoning applications, despite showing reduced performance on traditional completion-style math benchmarks compared to its base model.

Loading preview...

Viesar/gemma-3-4b-opus-reasoning-distill Overview

This model is a QLoRA fine-tune of the google/gemma-3-4b-it base, developed by Viesar. It features 4.3 billion parameters and a 32768 token context length. The fine-tuning process utilized the Crownelius/Opus-4.6-Reasoning-3300x dataset, specifically filtered to approximately 1,900 examples, with a strong focus on mathematical reasoning (94% math content).

Key Characteristics & Performance

  • Base Model: google/gemma-3-4b-it
  • Fine-tuning Method: QLoRA (r=16, alpha=16)
  • Training Data: Subset of Opus-4.6-Reasoning-3300x, emphasizing math problems.
  • Transparent Benchmarking: The model's performance on MATH-500 and GSM8K benchmarks shows a decrease compared to the base Gemma 3 4B model. For instance, on MATH-500, it achieved 24.6% exact match compared to the base's 29.6%, and on GSM8K, 53.7% exact match versus 68.7%.
  • Reasoning Style: This performance drop is attributed to a format mismatch between the benchmarks (few-shot completion prompts) and the model's trained chat-template style (<think>...</think>), as well as potential capability narrowing from specialized training. Qualitatively, the model retains its intended chat-style reasoning capabilities.
  • Inference Cost: Identical to the base model, with peak VRAM around 6.2 GB on an RTX 3050 8GB for 4-bit inference.

Use Cases

  • Chat-style Reasoning Tasks: Best suited for interactive reasoning challenges where the model can utilize its trained thought process and chat template.
  • Further Fine-tuning: The provided transformers format is suitable for additional fine-tuning or evaluation.
  • Local Inference: A GGUF version is available for efficient local inference, including multimodal readiness.