exolabs/Qwen3.6-35B-A3B-Q4KM-dequant-bf16-vllm

TEXT GENERATIONConcurrency Cost:3Model Size:35.1BQuant:FP8Ctx Length:32kTool Calling:SupportedPublished:Jun 26, 2026License:otherArchitecture:Transformer Cold

The exolabs/Qwen3.6-35B-A3B-Q4KM-dequant-bf16-vllm model is a 35.1 billion parameter Qwen3.6-A3B variant, dequantized from a Q4_K_M GGUF checkpoint to BF16 precision. This model is specifically prepared for deployment with vLLM, featuring a transformed tensor layout to align with HF/vLLM standards for Qwen3.5/3.6 Gated DeltaNet tensors. It is optimized for efficient inference within a vLLM serving environment, supporting a context length of 32768 tokens.

Loading preview...

exolabs/Qwen3.6-35B-A3B-Q4KM-dequant-bf16-vllm Overview

This model is a specialized 35.1 billion parameter variant of the Qwen3.6-A3B architecture, meticulously prepared for high-performance inference using vLLM. It originates from a private Exolabs checkpoint, converted from a bartowski/Qwen_Qwen3.6-35B-A3B-GGUF file, specifically the Qwen_Qwen3.6-35B-A3B-Q4_K_M.gguf version.

Key Technical Details

  • Parameter Count: 35.1 billion parameters.
  • Dequantization: The model's weights were dequantized from Q4_K_M GGUF to FP32, then transformed to the Hugging Face/vLLM layout, and finally cast to BF16 precision.
  • vLLM Optimization: It includes specific adjustments for Qwen3.5/3.6 Gated DeltaNet tensors, which are stored differently in llama.cpp, ensuring compatibility and optimal performance within vLLM.
  • Context Length: Supports a substantial context length of 32768 tokens.

Deployment and Usage

This model is validated for use with vLLM 0.23.0, leveraging specific vLLM configurations for bfloat16 dtype, Triton-based MoE and attention backends, and Triton for GDN prefill. It is designed for developers seeking an efficient, dequantized Qwen3.6-A3B model for vLLM-based serving, particularly where BF16 precision and a large context window are beneficial.