Entrit/Qwen2.5-0.5B-trit-uniform-d3
Entrit/Qwen2.5-0.5B-trit-uniform-d3 is a 0.5 billion parameter Qwen2.5 model that has undergone balanced ternary post-training quantization (PTQ) at depth d=3, resulting in 5.05 bits per weight. Developed by Entrit Systems, this model is optimized for efficient inference on hardware that consumes packed trit format directly, offering a highly compressed representation of the original Qwen2.5-0.5B. It is designed for scenarios where reduced memory footprint and faster processing of quantized weights are critical.
Loading preview...
Model Overview
Entrit/Qwen2.5-0.5B-trit-uniform-d3 is a quantized version of the Qwen/Qwen2.5-0.5B language model, developed by Entrit Systems. This model utilizes balanced ternary post-training quantization (PTQ), specifically at a depth of d=3, which translates to 27 levels per weight and an information content of 5.05 bits per weight. The quantization method is uniform PTQ, applied to all 2D linear matrices within the model.
Key Quantization Details
- Source Model: Qwen/Qwen2.5-0.5B
- Quantization Depth: d=3 (27 levels)
- Bits per Weight: 5.05 bpw
- Group Size: 16
- Scale Codebook: 27-entry log-spaced (scale_depth=3)
- Quantized Layers: All 2D linear matrices
- FP16 Kept:
lm_head, token embeddings, and all*_normlayers remain in FP16.
Performance and Use Cases
While the on-disk size remains similar to the FP16 source due to dequantization for stock-transformers compatibility, the 5.05-bpw figure highlights the model's reduced information content. This makes it particularly suitable for inference on specialized hardware that can directly process the packed trit format, as supported by the Entrit/tritllm-kernel. The quantization process is based on the codec described in the paper "Balanced Ternary Post-Training Quantization for Large Language Models" (Stentzel, 2026).
This model is ideal for applications requiring highly efficient and memory-optimized LLM inference, especially where custom hardware or specialized kernels can leverage the balanced ternary quantization for faster computation.