Entrit/Qwen2.5-72B-trit-uniform-d2

TEXT GENERATIONConcurrency Cost:4Model Size:72.7BQuant:FP8Ctx Length:32kPublished:Apr 26, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

Entrit/Qwen2.5-72B-trit-uniform-d2 is a 72.7 billion parameter causal language model, a balanced ternary quantized version of Qwen/Qwen2.5-72B. It utilizes uniform post-training quantization (PTQ) at a depth of d=2, achieving 3.47 bits per weight. This model is optimized for efficient inference on hardware capable of consuming packed trit formats, offering significant memory and computational savings for large language model deployment.

Loading preview...

Model Overview

Entrit/Qwen2.5-72B-trit-uniform-d2 is a 72.7 billion parameter language model derived from Qwen/Qwen2.5-72B, featuring balanced ternary post-training quantization (PTQ). This quantization scheme, based on research by Stentzel (2026), reduces the model's information content to 3.47 bits per weight across its 2D linear matrices, using 9 levels per weight (depth d=2).

Key Quantization Details

  • Source Model: Qwen/Qwen2.5-72B
  • Quantization Method: Uniform PTQ
  • Bits per Weight: 3.47 bpw (information content)
  • Quantized Layers: All 2D linear matrices
  • FP16 Layers: lm_head, token embeddings, and *_norm layers remain in FP16 for compatibility and performance.

Unique Characteristics

While the on-disk size remains similar to the FP16 source due to dequantization for standard transformers compatibility, the 3.47-bpw figure is crucial for hardware acceleration. This model is designed to leverage specialized hardware that can directly process the packed trit format, enabling more efficient inference than traditional FP16 models. The quantization process follows standard conventions, focusing on the most memory-intensive parts of the model.

Reproducibility

The quantization process is fully reproducible using the Entrit/tritllm-codec repository, allowing users to verify the methodology and results.