Entrit/Qwen2.5-32B-trit-uniform-d3

TEXT GENERATIONConcurrency Cost:2Model Size:32.8BQuant:FP8Ctx Length:32kPublished:Apr 26, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

Entrit/Qwen2.5-32B-trit-uniform-d3 is a 32.8 billion parameter Qwen2.5 model from Entrit Systems, featuring balanced ternary post-training quantization (PTQ) at a depth of d=3. This quantization method achieves 5.05 bits per weight, significantly reducing the information content of the model's weights. It is optimized for efficient inference on hardware supporting packed trit formats, while maintaining compatibility with standard transformers by dequantizing to FP16 for general use. This model is ideal for applications requiring reduced memory footprint and faster inference with minimal performance degradation.

Loading preview...

Entrit/Qwen2.5-32B-trit-uniform-d3: Balanced Ternary Quantization

This model is a 32.8 billion parameter variant of the Qwen2.5-32B architecture, developed by Entrit Systems. It implements a novel balanced ternary post-training quantization (PTQ) scheme, as detailed in the paper "Balanced Ternary Post-Training Quantization for Large Language Models" (Stentzel, 2026).

Key Quantization Details

  • Quantization Method: Uniform PTQ applied to all 2D linear matrices.
  • Bits Per Weight (BPW): Achieves an effective 5.05 BPW, representing the information content of the quantized weights.
  • Depth: d=3, corresponding to 27 levels per weight.
  • Group Size: 16.
  • Codec: Utilizes the tritllm v2 codec, available in the Entrit/tritllm-codec repository.
  • FP16 Preservation: Key components like lm_head, token embeddings, and all *_norm layers remain in FP16 to preserve model integrity.

Performance and Use Cases

While the on-disk size matches the FP16 source due to dequantization for transformers compatibility, the 5.05 BPW is crucial for hardware designed to consume packed trit formats directly, enabling more efficient inference. This model is particularly suited for scenarios where memory footprint and inference speed are critical, leveraging the reduced information content of its weights without requiring specialized hardware for basic operation. For full evaluation results and technical specifics, refer to the associated paper.