exolabs/Qwen3.6-35B-A3B-Q4KM-dequant-bf16-vllm
The exolabs/Qwen3.6-35B-A3B-Q4KM-dequant-bf16-vllm model is a 35.1 billion parameter Qwen3.6-A3B variant, dequantized from a Q4_K_M GGUF checkpoint to BF16 precision. This model is specifically prepared for deployment with vLLM, featuring a transformed tensor layout to align with HF/vLLM standards for Qwen3.5/3.6 Gated DeltaNet tensors. It is optimized for efficient inference within a vLLM serving environment, supporting a context length of 32768 tokens.
Loading preview...
exolabs/Qwen3.6-35B-A3B-Q4KM-dequant-bf16-vllm Overview
This model is a specialized 35.1 billion parameter variant of the Qwen3.6-A3B architecture, meticulously prepared for high-performance inference using vLLM. It originates from a private Exolabs checkpoint, converted from a bartowski/Qwen_Qwen3.6-35B-A3B-GGUF file, specifically the Qwen_Qwen3.6-35B-A3B-Q4_K_M.gguf version.
Key Technical Details
- Parameter Count: 35.1 billion parameters.
- Dequantization: The model's weights were dequantized from Q4_K_M GGUF to FP32, then transformed to the Hugging Face/vLLM layout, and finally cast to BF16 precision.
- vLLM Optimization: It includes specific adjustments for Qwen3.5/3.6 Gated DeltaNet tensors, which are stored differently in
llama.cpp, ensuring compatibility and optimal performance within vLLM. - Context Length: Supports a substantial context length of 32768 tokens.
Deployment and Usage
This model is validated for use with vLLM 0.23.0, leveraging specific vLLM configurations for bfloat16 dtype, Triton-based MoE and attention backends, and Triton for GDN prefill. It is designed for developers seeking an efficient, dequantized Qwen3.6-A3B model for vLLM-based serving, particularly where BF16 precision and a large context window are beneficial.