Name: Entrit/Qwen2.5-32B-trit-uniform-d4 API
Brand: Featherless.ai
Price: 25.00 USD
Availability: InStock
Author: Entrit

Overview

Entrit/Qwen2.5-32B-trit-uniform-d4 is a 32.8 billion parameter language model derived from the original Qwen/Qwen2.5-32B. This version incorporates balanced ternary post-training quantization (PTQ), a technique developed by Entrit Systems and detailed in the paper "Balanced Ternary Post-Training Quantization for Large Language Models" (Stentzel, 2026).

Key Quantization Details

Quantization Method: Uniform PTQ applied to all 2D linear matrices.
Bit Depth: Achieves 6.64 bits per weight (bpw) through a depth of d=4, corresponding to 81 levels per weight.
Information Content: The 6.64 bpw figure represents the information content of the quantized matrices, which is crucial for inference on hardware designed for packed trit formats.
FP16 Compatibility: While quantized, the model weights are dequantized to FP16 for compatibility with standard transformers libraries, meaning the on-disk size remains similar to the original FP16 model.
Components Kept in FP16: lm_head, token embeddings, and all *_norm layers remain in FP16, following common quantization practices to preserve model performance.

Use Cases

This model is particularly relevant for developers and researchers interested in:

Efficient Inference: Exploring models with reduced information content for potential memory and computational savings on specialized hardware.
Quantization Research: Studying the practical application and performance of balanced ternary quantization techniques.
Hardware-Aware Deployment: Deploying large language models on platforms that can leverage packed trit formats for optimized inference.

Overview

Overview

Key Quantization Details

Use Cases

Full Model Card (README)