Entrit/Qwen2.5-3B-trit-uniform-d3

TEXT GENERATIONConcurrency Cost:1Model Size:3.1BQuant:BF16Ctx Length:32kPublished:May 4, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

Entrit/Qwen2.5-3B-trit-uniform-d3 is a 3.1 billion parameter language model based on the Qwen2.5-3B architecture, featuring balanced ternary post-training quantization. This model utilizes a depth of d=3, resulting in 27 levels per weight and an information content of 5.05 bits per weight. Developed by Entrit Systems, this quantization scheme is designed for efficient inference on hardware capable of consuming packed trit formats directly. It offers a compact representation of the original Qwen2.5-3B model, making it suitable for resource-constrained environments.

Loading preview...

Model Overview

Entrit/Qwen2.5-3B-trit-uniform-d3 is a quantized version of the Qwen/Qwen2.5-3B large language model, developed by Entrit Systems. This model implements a balanced ternary post-training quantization (PTQ) scheme, as detailed in the paper "Balanced Ternary Post-Training Quantization for Large Language Models" by Eric Stentzel (2026).

Key Quantization Details

  • Source Model: Qwen/Qwen2.5-3B
  • Quantization Method: Uniform PTQ with a depth of d=3, yielding 27 levels per weight.
  • Bits per Weight: Achieves an information content of 5.05 bits per weight.
  • Codec: Utilizes tritllm v2 for quantization, with the source available at Entrit/tritllm-codec.
  • Quantized Layers: All 2D linear matrices are quantized.
  • FP16 Layers: lm_head, token embeddings, and all *_norm layers are kept in FP16 precision.

Performance and Use Cases

While the on-disk size of this model is equivalent to the FP16 source due to dequantization for standard transformers compatibility, its 5.05-bpw figure is crucial for inference on specialized hardware that can directly process the packed trit format. This makes the model particularly relevant for scenarios requiring reduced memory footprint and potentially faster inference when deployed with compatible hardware and kernels (e.g., Entrit/tritllm-kernel). It is ideal for developers exploring efficient deployment of LLMs in resource-constrained environments or those interested in advanced quantization techniques.