Name: Entrit/Qwen2.5-3B-trit-uniform-d4 API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: Entrit

Entrit/Qwen2.5-3B-trit-uniform-d4: Quantized Qwen2.5-3B Model

This model is a balanced ternary post-training quantized version of the original Qwen/Qwen2.5-3B model, developed by Entrit Systems. It utilizes a depth of d=4, which translates to 81 levels per weight and an effective 6.64 bits per weight. This quantization significantly reduces the information content of the model's weights, making it highly efficient for specialized inference hardware.

Key Quantization Details

Source Model: Qwen/Qwen2.5-3B
Quantization Method: Uniform Post-Training Quantization (PTQ)
Depth: d=4 (81 levels)
Bits per Weight: 6.64
Quantized Layers: All 2D linear matrices
FP16 Kept: lm_head, token embeddings, and all *_norm layers
Codec: tritllm v2, as detailed in the paper "Balanced Ternary Post-Training Quantization for Large Language Models" by Stentzel (2026).

Performance and Use Cases

While the on-disk size remains similar to the FP16 source due to dequantization for stock transformers compatibility, the 6.64-bpw figure is crucial for inference on hardware designed to consume the packed trit format directly. This model is ideal for scenarios where memory footprint and inference speed are critical, especially when paired with compatible hardware or kernels like Entrit/tritllm-kernel. It offers a path to more efficient deployment of large language models without sacrificing significant performance, leveraging advanced quantization techniques.

Overview

Entrit/Qwen2.5-3B-trit-uniform-d4: Quantized Qwen2.5-3B Model

Key Quantization Details

Performance and Use Cases

Full Model Card (README)