Entrit/Qwen2.5-7B-trit-uniform-d1

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:7.6BQuant:FP8Ctx Length:32kPublished:May 4, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

Entrit/Qwen2.5-7B-trit-uniform-d1 is a 7.6 billion parameter language model, a balanced ternary post-training quantization of the Qwen/Qwen2.5-7B architecture. Developed by Entrit Systems, this model uses a depth-1 quantization scheme, achieving 1.88 bits per weight, significantly reducing the information content of the model's matrices. It is designed for efficient inference on hardware capable of consuming packed trit formats, while maintaining compatibility with standard transformers libraries by dequantizing to FP16 for on-disk storage. This model is particularly suited for applications requiring reduced memory footprint and faster processing through specialized ternary hardware.

Loading preview...

Entrit/Qwen2.5-7B-trit-uniform-d1: Balanced Ternary Quantization

This model is a 7.6 billion parameter variant of the Qwen2.5-7B architecture, developed by Entrit Systems. It features a balanced ternary post-training quantization (PTQ) at a depth of 1, meaning each weight is represented by 3 levels, resulting in an information content of 1.88 bits per weight. This quantization is based on research presented in "Balanced Ternary Post-Training Quantization for Large Language Models" (Stentzel, 2026).

Key Quantization Details

  • Source Model: Qwen/Qwen2.5-7B
  • Quantization Depth: d=1 (3 levels per weight)
  • Bits per Weight: 1.88
  • Quantization Method: Uniform PTQ, applied to all 2D linear matrices.
  • Exclusions: lm_head, token embeddings, and all *_norm layers remain in FP16.
  • Codec: Utilizes tritllm v2 for the quantization process.

Performance and Compatibility

While the model's on-disk size is equivalent to its FP16 source due to dequantization for transformers compatibility, its core innovation lies in its 1.88-bpw information content. This makes it ideal for inference on specialized hardware that can directly process packed trit formats, offering potential benefits in memory and computational efficiency. The model can be loaded and used with standard transformers library functions, with weights dequantized to FP16 during runtime.