Entrit/Qwen2.5-32B-trit-uniform-d4

TEXT GENERATIONConcurrency Cost:2Model Size:32.8BQuant:FP8Ctx Length:32kPublished:Apr 26, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

Entrit/Qwen2.5-32B-trit-uniform-d4 is a 32.8 billion parameter language model based on the Qwen2.5-32B architecture, developed by Entrit Systems. This model features balanced ternary post-training quantization (PTQ) at a depth of d=4, resulting in 6.64 bits per weight. It is optimized for efficient inference on hardware that directly consumes packed trit formats, making it suitable for applications requiring reduced memory footprint and potentially faster processing with specialized hardware.

Loading preview...

Overview

Entrit/Qwen2.5-32B-trit-uniform-d4 is a 32.8 billion parameter language model derived from the original Qwen/Qwen2.5-32B. This version incorporates balanced ternary post-training quantization (PTQ), a technique developed by Entrit Systems and detailed in the paper "Balanced Ternary Post-Training Quantization for Large Language Models" (Stentzel, 2026).

Key Quantization Details

  • Quantization Method: Uniform PTQ applied to all 2D linear matrices.
  • Bit Depth: Achieves 6.64 bits per weight (bpw) through a depth of d=4, corresponding to 81 levels per weight.
  • Information Content: The 6.64 bpw figure represents the information content of the quantized matrices, which is crucial for inference on hardware designed for packed trit formats.
  • FP16 Compatibility: While quantized, the model weights are dequantized to FP16 for compatibility with standard transformers libraries, meaning the on-disk size remains similar to the original FP16 model.
  • Components Kept in FP16: lm_head, token embeddings, and all *_norm layers remain in FP16, following common quantization practices to preserve model performance.

Use Cases

This model is particularly relevant for developers and researchers interested in:

  • Efficient Inference: Exploring models with reduced information content for potential memory and computational savings on specialized hardware.
  • Quantization Research: Studying the practical application and performance of balanced ternary quantization techniques.
  • Hardware-Aware Deployment: Deploying large language models on platforms that can leverage packed trit formats for optimized inference.