Name: Entrit/Qwen2.5-14B-trit-uniform-d1 API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: Entrit

Entrit/Qwen2.5-14B-trit-uniform-d1 Overview

This model is a quantized version of the Qwen/Qwen2.5-14B large language model, developed by Entrit. It employs a balanced ternary post-training quantization (PTQ) method, specifically designed for memory and inference efficiency. The quantization process reduces the model's weight representation to 1.88 bits per weight, significantly lowering the information content required for storage and processing.

Key Quantization Details

Source Model: Based on Qwen/Qwen2.5-14B.
Quantization Method: Uniform PTQ with a depth of d=1, meaning 3 levels per weight.
Bits per Weight: Achieves an effective 1.88 bits per weight, indicating high compression.
Codec: Utilizes the tritllm v2 codec, detailed in the associated research "Balanced Ternary Post-Training Quantization for Large Language Models" (Stentzel, 2026).
Layer Coverage: All 2D linear matrices within the model are quantized, while critical components like lm_head, token embeddings, and normalization layers remain in FP16 for stability.
Compatibility: While the on-disk size matches FP16 due to dequantization for transformers compatibility, its true efficiency is realized on hardware capable of consuming the packed trit format directly.

Use Cases and Benefits

This model is particularly beneficial for scenarios requiring reduced memory footprint and potentially faster inference, especially when deployed on specialized hardware that can leverage its packed trit format. It offers a balance between model performance and resource consumption, making it suitable for edge devices or environments with strict memory constraints. Developers can load and use it with standard transformers libraries, which dequantizes weights to FP16 for seamless integration.

Overview

Entrit/Qwen2.5-14B-trit-uniform-d1 Overview

Key Quantization Details

Use Cases and Benefits

Full Model Card (README)