Name: Entrit/Qwen2.5-1.5B-trit-uniform-d1 API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: Entrit

Entrit/Qwen2.5-1.5B-trit-uniform-d1: A Quantized Qwen2.5 Model

This model is a 1.5 billion parameter variant of the Qwen/Qwen2.5-1.5B architecture, developed by Entrit. Its primary distinguishing feature is the application of balanced ternary post-training quantization (PTQ), reducing the model's weight representation to 1.88 bits per weight (bpw).

Key Quantization Details

Source Model: Qwen/Qwen2.5-1.5B
Quantization Method: Uniform PTQ with a depth of d=1 (3 levels per weight).
Bits per Weight: Achieves 1.88 bpw, indicating a highly compressed weight representation.
Codec: Utilizes the tritllm v2 codec, as detailed in the paper "Balanced Ternary Post-Training Quantization for Large Language Models" (Stentzel, 2026).
Selective Quantization: All 2D linear matrices are quantized, while lm_head, token embeddings, and *_norm layers are kept in FP16 for performance and stability.

Performance and Use Cases

While the on-disk size remains similar to the FP16 source due to dequantization for standard transformers compatibility, the 1.88 bpw figure highlights its potential for highly efficient inference on hardware designed to process the packed trit format directly. This makes it particularly relevant for applications requiring reduced memory footprint and faster computation when paired with compatible inference kernels (e.g., Entrit/tritllm-kernel). Developers interested in exploring extreme quantization for resource-constrained deployments should consider this model.

Overview

Entrit/Qwen2.5-1.5B-trit-uniform-d1: A Quantized Qwen2.5 Model

Key Quantization Details

Performance and Use Cases

Full Model Card (README)