Entrit/Qwen2.5-72B-trit-uniform-d3

Hugging Face
TEXT GENERATIONConcurrency Cost:4Model Size:72.7BQuant:FP8Ctx Length:32kPublished:Apr 26, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

Entrit/Qwen2.5-72B-trit-uniform-d3 is a 72.7 billion parameter language model, a balanced ternary post-training quantized version of Qwen/Qwen2.5-72B. Developed by Entrit Systems, this model uses a uniform quantization method at depth d=3, achieving 5.05 bits per weight for its 2D linear matrices. It is optimized for efficient inference on hardware that directly consumes its packed trit format, offering a compact representation while maintaining compatibility with standard transformers for dequantized FP16 use.

Loading preview...

Entrit/Qwen2.5-72B-trit-uniform-d3: Balanced Ternary Quantization

This model is a 72.7 billion parameter variant of the Qwen/Qwen2.5-72B base model, developed by Entrit Systems. It features balanced ternary post-training quantization (PTQ), a technique designed to significantly reduce model size and improve inference efficiency, particularly on specialized hardware. The quantization process is detailed in the paper "Balanced Ternary Post-Training Quantization for Large Language Models" (Stentzel, 2026).

Key Quantization Details

  • Source Model: Qwen/Qwen2.5-72B.
  • Quantization Method: Uniform PTQ with a depth of d=3, resulting in 27 levels per weight.
  • Bits per Weight (BPW): Achieves an information content of 5.05 BPW for the quantized matrices.
  • Quantized Layers: All 2D linear weight matrices are ternary-quantized.
  • FP16 Layers: lm_head, token embeddings, and all *_norm layers are kept in FP16 to preserve critical model components.
  • Codec: Utilizes the tritllm v2 codec, available in the Entrit/tritllm-codec repository.

Usage and Performance

While the on-disk size remains similar to the FP16 source due to dequantization for standard transformers compatibility, the 5.05 BPW figure highlights its potential for highly efficient inference when deployed on hardware capable of directly processing the packed trit format. This makes it a strong candidate for applications requiring reduced memory footprint and faster computation with minimal accuracy loss compared to its full-precision counterpart.