Entrit/Qwen2.5-14B-trit-uniform-d1
Entrit/Qwen2.5-14B-trit-uniform-d1 is a 14.8 billion parameter language model based on Qwen/Qwen2.5-14B, featuring balanced ternary post-training quantization. Developed by Entrit, this model uses a d=1 depth quantization scheme, achieving 1.88 bits per weight for efficient inference. It is optimized for reduced memory footprint and faster processing on hardware supporting packed trit formats, while maintaining compatibility with standard FP16 dequantization for broader use.
Loading preview...
Entrit/Qwen2.5-14B-trit-uniform-d1 Overview
This model is a quantized version of the Qwen/Qwen2.5-14B large language model, developed by Entrit. It employs a balanced ternary post-training quantization (PTQ) method, specifically designed for memory and inference efficiency. The quantization process reduces the model's weight representation to 1.88 bits per weight, significantly lowering the information content required for storage and processing.
Key Quantization Details
- Source Model: Based on
Qwen/Qwen2.5-14B. - Quantization Method: Uniform PTQ with a depth of d=1, meaning 3 levels per weight.
- Bits per Weight: Achieves an effective 1.88 bits per weight, indicating high compression.
- Codec: Utilizes the
tritllm v2codec, detailed in the associated research "Balanced Ternary Post-Training Quantization for Large Language Models" (Stentzel, 2026). - Layer Coverage: All 2D linear matrices within the model are quantized, while critical components like
lm_head, token embeddings, and normalization layers remain in FP16 for stability. - Compatibility: While the on-disk size matches FP16 due to dequantization for
transformerscompatibility, its true efficiency is realized on hardware capable of consuming the packed trit format directly.
Use Cases and Benefits
This model is particularly beneficial for scenarios requiring reduced memory footprint and potentially faster inference, especially when deployed on specialized hardware that can leverage its packed trit format. It offers a balance between model performance and resource consumption, making it suitable for edge devices or environments with strict memory constraints. Developers can load and use it with standard transformers libraries, which dequantizes weights to FP16 for seamless integration.