Entrit/Qwen2.5-0.5B-trit-uniform-d4

TEXT GENERATIONConcurrency Cost:1Model Size:0.5BQuant:BF16Ctx Length:32kPublished:May 4, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

Entrit/Qwen2.5-0.5B-trit-uniform-d4 is a 0.5 billion parameter Qwen2.5-based causal language model developed by Entrit. This model features balanced ternary post-training quantization at a depth of 4, resulting in 6.64 bits per weight. It is optimized for efficient inference on hardware that can directly consume its packed trit format, offering a quantized version of the original Qwen2.5-0.5B.

Loading preview...

Overview

Entrit/Qwen2.5-0.5B-trit-uniform-d4 is a quantized version of the Qwen/Qwen2.5-0.5B model, developed by Entrit. This model utilizes balanced ternary post-training quantization (PTQ) with a depth of 4, achieving an information content of 6.64 bits per weight. This quantization method is based on research presented in "Balanced Ternary Post-Training Quantization for Large Language Models" (Stentzel, 2026).

Key Quantization Details

  • Source Model: Qwen/Qwen2.5-0.5B
  • Quantization Method: Uniform PTQ with a depth of 4 (81 levels per weight).
  • Bits per Weight: 6.64, indicating significant compression compared to standard FP16.
  • Quantized Layers: All 2D linear matrices are quantized.
  • FP16 Layers: lm_head, token embeddings, and all *_norm layers remain in FP16 for compatibility and performance.
  • Codec: Uses tritllm v2 for the quantization process.

Usage and Performance Considerations

While the model's on-disk size is similar to its FP16 source due to dequantization for stock transformers compatibility, its true efficiency is realized when used with hardware and kernels designed for packed trit formats (e.g., Entrit/tritllm-kernel). This makes it particularly suitable for scenarios where reduced memory footprint and faster inference with specialized hardware are critical.