Entrit/Llama-3.1-8B-trit-uniform-d1
Entrit/Llama-3.1-8B-trit-uniform-d1 is an 8 billion parameter language model developed by Entrit, based on Meta's Llama-3.1-8B architecture with an 8192-token context length. This model features balanced ternary post-training quantization (PTQ) at depth d=1, resulting in 1.88 bits per weight. It is a research artifact optimized for exploring efficient inference on hardware that directly consumes packed trit formats, offering a highly compressed representation of the original Llama-3.1-8B weights.
Loading preview...
Model Overview
Entrit/Llama-3.1-8B-trit-uniform-d1 is a research artifact developed by Entrit, applying balanced ternary post-training quantization (PTQ) to Meta's Llama-3.1-8B. This 8 billion parameter model is quantized at a depth of d=1, meaning each weight uses 3 levels, achieving an information content of 1.88 bits per weight. The quantization method is uniform PTQ, and it applies to all 2D linear matrices, while lm_head, token embeddings, and *_norm layers remain in FP16.
Key Characteristics
- Base Model: Derived from
meta-llama/Llama-3.1-8B. - Quantization: Balanced ternary PTQ (3 levels per weight) resulting in 1.88 bits per weight.
- Codec: Utilizes the
tritllm v2codec, detailed in the paper "Balanced Ternary Post-Training Quantization for Large Language Models" (Stentzel, 2026). - Compatibility: Weights are dequantized to FP16 for compatibility with stock
transformerslibraries, maintaining the on-disk size of the FP16 source. - Efficiency Focus: The 1.88-bpw figure highlights its potential for efficient inference on specialized hardware designed to process packed trit formats directly.
Use Cases
- Research and Development: Ideal for exploring and experimenting with highly quantized LLMs and their performance characteristics.
- Hardware Optimization: Suitable for developers working on custom hardware or inference engines that can leverage balanced ternary representations.
- Resource-Constrained Deployment: Offers a path towards more memory-efficient LLM deployment, particularly when paired with compatible hardware.