nightmedia/Qwen3.6-35B-A3B-Fable-Holo3-Qwopus
The nightmedia/Qwen3.6-35B-A3B-Fable-Holo3-Qwopus model is a 35.1 billion parameter language model based on the Qwen3.6 architecture, created by nightmedia through a merge of two specialized Qwen3.6-35B variants. This model is specifically engineered for enhanced reasoning performance, particularly in multi-step abstraction tasks, by employing a selective mixed-precision quantization scheme. It utilizes 6-bit precision for attention heads and embeddings to preserve critical routing fidelity, while using 4-bit for general layers, optimizing for reasoning benchmarks like ARC while maintaining efficient memory usage and throughput.
Loading preview...
Model Overview
The nightmedia/Qwen3.6-35B-A3B-Fable-Holo3-Qwopus is a 35.1 billion parameter language model derived from the Qwen3.6 architecture. It is a strategic merge of armand0e/Qwen3.6-35B-A3B-Fable-5-Distill and nightmedia/Qwen3.6-35B-A3B-MTP-Holo3-Qwopus-BF16, designed to optimize reasoning capabilities through a unique quantization approach.
Key Capabilities & Design Philosophy
- Enhanced Reasoning: The model demonstrates improved performance on reasoning benchmarks like ARC, attributed to its "Deckard(qx)" mixed-precision quantization scheme.
- Selective Precision Quantization: It allocates 6-bit precision to high-sensitivity pathways such as attention heads and embeddings, crucial for maintaining the fidelity of context selection and multi-step abstraction. General layers utilize 4-bit precision for efficiency.
- Optimized Memory & Throughput: The
qx64-hiquantization variant achieves a perplexity of 4.438 with 36.91 GB peak memory and approximately 1466 tokens/second, balancing reasoning performance with resource efficiency. - Group Size 32: A uniform group size across all layers simplifies kernel fusion and ensures consistent quantization noise, aiding stable Multi-Token Prediction (MTP) distillation.
- Functional Parallel to Holodeck Architecture: The quantization strategy mirrors a "selective precision over uniform compression" philosophy, allocating fidelity where routing is critical and compressing where redundancy exists.
Good For
- Applications requiring strong multi-step abstraction and reasoning capabilities.
- Scenarios where optimizing for reasoning benchmarks is a priority.
- Environments needing a balance between high performance, memory efficiency, and throughput in a 35B parameter model.