nightmedia/Qwen3.6-35B-A3B-Fable-Holo3-Qwopus

TEXT GENERATIONConcurrency Cost:3Model Size:35.1BQuant:FP8Ctx Length:32kTool Calling:SupportedPublished:Jun 30, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

The nightmedia/Qwen3.6-35B-A3B-Fable-Holo3-Qwopus model is a 35.1 billion parameter language model based on the Qwen3.6 architecture, created by nightmedia through a merge of two specialized Qwen3.6-35B variants. This model is specifically engineered for enhanced reasoning performance, particularly in multi-step abstraction tasks, by employing a selective mixed-precision quantization scheme. It utilizes 6-bit precision for attention heads and embeddings to preserve critical routing fidelity, while using 4-bit for general layers, optimizing for reasoning benchmarks like ARC while maintaining efficient memory usage and throughput.

Loading preview...

Model Overview

The nightmedia/Qwen3.6-35B-A3B-Fable-Holo3-Qwopus is a 35.1 billion parameter language model derived from the Qwen3.6 architecture. It is a strategic merge of armand0e/Qwen3.6-35B-A3B-Fable-5-Distill and nightmedia/Qwen3.6-35B-A3B-MTP-Holo3-Qwopus-BF16, designed to optimize reasoning capabilities through a unique quantization approach.

Key Capabilities & Design Philosophy

  • Enhanced Reasoning: The model demonstrates improved performance on reasoning benchmarks like ARC, attributed to its "Deckard(qx)" mixed-precision quantization scheme.
  • Selective Precision Quantization: It allocates 6-bit precision to high-sensitivity pathways such as attention heads and embeddings, crucial for maintaining the fidelity of context selection and multi-step abstraction. General layers utilize 4-bit precision for efficiency.
  • Optimized Memory & Throughput: The qx64-hi quantization variant achieves a perplexity of 4.438 with 36.91 GB peak memory and approximately 1466 tokens/second, balancing reasoning performance with resource efficiency.
  • Group Size 32: A uniform group size across all layers simplifies kernel fusion and ensures consistent quantization noise, aiding stable Multi-Token Prediction (MTP) distillation.
  • Functional Parallel to Holodeck Architecture: The quantization strategy mirrors a "selective precision over uniform compression" philosophy, allocating fidelity where routing is critical and compressing where redundancy exists.

Good For

  • Applications requiring strong multi-step abstraction and reasoning capabilities.
  • Scenarios where optimizing for reasoning benchmarks is a priority.
  • Environments needing a balance between high performance, memory efficiency, and throughput in a 35B parameter model.