Entrit/Qwen2.5-32B-trit-uniform-d2
Entrit/Qwen2.5-32B-trit-uniform-d2 is a 32.8 billion parameter language model from Entrit, based on Qwen/Qwen2.5-32B, featuring balanced ternary post-training quantization. It uses a uniform quantization method at depth d=2, resulting in 3.47 bits per weight for its 2D linear matrices. This model is optimized for efficient inference on hardware that can directly consume its packed trit format, offering a compact representation of the original Qwen2.5-32B.
Loading preview...
Overview
Entrit/Qwen2.5-32B-trit-uniform-d2 is a 32.8 billion parameter large language model derived from Qwen/Qwen2.5-32B. Developed by Entrit, this model implements a balanced ternary post-training quantization (PTQ) scheme, as detailed in the paper "Balanced Ternary Post-Training Quantization for Large Language Models" (Stentzel, 2026).
Key Quantization Details
- Source Model: Qwen/Qwen2.5-32B
- Quantization Method: Uniform PTQ with a depth of d=2, yielding 9 levels per weight.
- Bits per Weight: Achieves an information content of 3.47 bits per weight for quantized matrices.
- Quantized Components: All 2D linear matrices are ternary-quantized.
- FP16 Components:
lm_head, token embeddings, and all*_normlayers remain in FP16 for compatibility and performance. - Codec: Utilizes the
tritllm v2codec, available in the Entrit/tritllm-codec repository.
Performance and Usage
While the on-disk size matches the FP16 source due to dequantization for standard transformers compatibility, the 3.47-bpw figure is crucial for inference on specialized hardware that can process the packed trit format directly. This model is designed for scenarios where reduced memory footprint and potentially faster inference (with compatible hardware) are critical, without significant loss in model capability compared to its FP16 base.