Name: Entrit/Qwen2.5-7B-qat-d2-6k API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: Entrit

Overview

Entrit/Qwen2.5-7B-qat-d2-6k is a 7.6 billion parameter model derived from the Qwen/Qwen2.5-7B base model. Its core innovation lies in its balanced ternary quantization, which reduces the model's weight representation to 3.47 bits per weight. This quantization was achieved using a block-wise Quantization-Aware Training (QAT) method, trained on 6,000 WikiText samples over 500 steps per block.

Key Quantization Details

Source Model: Qwen/Qwen2.5-7B
Quantization Depth: d=2 (9 levels per weight)
Bits per Weight: 3.47
Method: Block-wise QAT using the tritllm-codec from Entrit.
Quantized Layers: All 2D linear matrices are ternary-quantized.
FP16 Layers: lm_head, token embeddings, and all *_norm layers remain in FP16 to preserve accuracy.

Performance and Use Cases

While the model weights are dequantized to FP16 for compatibility with standard transformers libraries, its true efficiency is realized when deployed on hardware capable of processing the packed trit format directly. This makes it particularly suitable for applications requiring reduced memory footprint and faster inference, as detailed in the associated research paper "Balanced Ternary Post-Training Quantization for Large Language Models" by Stentzel (2026). The 3.47-bpw figure reflects the information content of the quantized matrices, aligning with how other quantization schemes report BPW.

Overview

Overview

Key Quantization Details

Performance and Use Cases

Full Model Card (README)