Entrit/Qwen2.5-3B-trit-uniform-d4

TEXT GENERATIONConcurrency Cost:1Model Size:3.1BQuant:BF16Ctx Length:32kPublished:May 4, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

Entrit/Qwen2.5-3B-trit-uniform-d4 is a 3.1 billion parameter Qwen2.5-3B model developed by Entrit Systems, featuring balanced ternary post-training quantization at a depth of d=4, resulting in 6.64 bits per weight. This model is optimized for efficient inference on hardware that can directly consume its packed trit format, offering a significantly reduced information content per weight compared to its FP16 source. It is particularly suited for applications requiring smaller model footprints and faster processing through specialized quantization techniques.

Loading preview...

Entrit/Qwen2.5-3B-trit-uniform-d4: Quantized Qwen2.5-3B Model

This model is a balanced ternary post-training quantized version of the original Qwen/Qwen2.5-3B model, developed by Entrit Systems. It utilizes a depth of d=4, which translates to 81 levels per weight and an effective 6.64 bits per weight. This quantization significantly reduces the information content of the model's weights, making it highly efficient for specialized inference hardware.

Key Quantization Details

  • Source Model: Qwen/Qwen2.5-3B
  • Quantization Method: Uniform Post-Training Quantization (PTQ)
  • Depth: d=4 (81 levels)
  • Bits per Weight: 6.64
  • Quantized Layers: All 2D linear matrices
  • FP16 Kept: lm_head, token embeddings, and all *_norm layers
  • Codec: tritllm v2, as detailed in the paper "Balanced Ternary Post-Training Quantization for Large Language Models" by Stentzel (2026).

Performance and Use Cases

While the on-disk size remains similar to the FP16 source due to dequantization for stock transformers compatibility, the 6.64-bpw figure is crucial for inference on hardware designed to consume the packed trit format directly. This model is ideal for scenarios where memory footprint and inference speed are critical, especially when paired with compatible hardware or kernels like Entrit/tritllm-kernel. It offers a path to more efficient deployment of large language models without sacrificing significant performance, leveraging advanced quantization techniques.