Entrit/Qwen2.5-3B-trit-uniform-d2

TEXT GENERATIONConcurrency Cost:1Model Size:3.1BQuant:BF16Ctx Length:32kPublished:May 4, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

Entrit/Qwen2.5-3B-trit-uniform-d2 is a 3.1 billion parameter Qwen2.5-3B model that has undergone balanced ternary post-training quantization (PTQ) by Entrit Systems. This quantization reduces the model's weights to 3.47 bits per weight, using 9 levels per weight at a depth of d=2. It is optimized for efficient inference on hardware capable of consuming packed trit format, making it suitable for resource-constrained environments.

Loading preview...

Entrit/Qwen2.5-3B-trit-uniform-d2: Balanced Ternary Quantization

This model is a 3.1 billion parameter variant of the Qwen2.5-3B architecture, developed by Entrit Systems. It features a balanced ternary post-training quantization (PTQ) scheme, reducing its weights to an information content of 3.47 bits per weight with 9 levels per weight (depth d=2). This quantization is based on the codec described in "Balanced Ternary Post-Training Quantization for Large Language Models" (Stentzel, 2026).

Key Quantization Details

  • Source Model: Qwen/Qwen2.5-3B
  • Quantization Method: Uniform PTQ
  • Bits per Weight: 3.47
  • Depth: d=2 (9 levels)
  • Quantized Layers: All 2D linear matrices
  • FP16 Kept: lm_head, token embeddings, and all *_norm layers

While the on-disk size remains similar to the FP16 source due to dequantization for stock transformers compatibility, the 3.47-bpw figure highlights its efficiency for specialized hardware that can directly process the packed trit format. The tritllm-codec and tritllm-kernel projects provide the underlying technology for this quantization and efficient inference.

Good for

  • Deploying Qwen2.5-3B in environments requiring reduced memory footprint and faster inference with specialized hardware.
  • Research and development in efficient LLM quantization techniques, particularly balanced ternary methods.
  • Applications where the trade-off between model size/speed and minor performance degradation from quantization is acceptable.