Entrit/Qwen2.5-32B-trit-uniform-d2

TEXT GENERATIONConcurrency Cost:2Model Size:32.8BQuant:FP8Ctx Length:32kPublished:Apr 26, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

Entrit/Qwen2.5-32B-trit-uniform-d2 is a 32.8 billion parameter language model from Entrit, based on Qwen/Qwen2.5-32B, featuring balanced ternary post-training quantization. It uses a uniform quantization method at depth d=2, resulting in 3.47 bits per weight for its 2D linear matrices. This model is optimized for efficient inference on hardware that can directly consume its packed trit format, offering a compact representation of the original Qwen2.5-32B.

Loading preview...

Overview

Entrit/Qwen2.5-32B-trit-uniform-d2 is a 32.8 billion parameter large language model derived from Qwen/Qwen2.5-32B. Developed by Entrit, this model implements a balanced ternary post-training quantization (PTQ) scheme, as detailed in the paper "Balanced Ternary Post-Training Quantization for Large Language Models" (Stentzel, 2026).

Key Quantization Details

  • Source Model: Qwen/Qwen2.5-32B
  • Quantization Method: Uniform PTQ with a depth of d=2, yielding 9 levels per weight.
  • Bits per Weight: Achieves an information content of 3.47 bits per weight for quantized matrices.
  • Quantized Components: All 2D linear matrices are ternary-quantized.
  • FP16 Components: lm_head, token embeddings, and all *_norm layers remain in FP16 for compatibility and performance.
  • Codec: Utilizes the tritllm v2 codec, available in the Entrit/tritllm-codec repository.

Performance and Usage

While the on-disk size matches the FP16 source due to dequantization for standard transformers compatibility, the 3.47-bpw figure is crucial for inference on specialized hardware that can process the packed trit format directly. This model is designed for scenarios where reduced memory footprint and potentially faster inference (with compatible hardware) are critical, without significant loss in model capability compared to its FP16 base.