Entrit/Llama-3.1-8B-trit-uniform-d1

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:8kPublished:May 4, 2026License:llama3.1Architecture:Transformer Warm

Entrit/Llama-3.1-8B-trit-uniform-d1 is an 8 billion parameter language model developed by Entrit, based on Meta's Llama-3.1-8B architecture with an 8192-token context length. This model features balanced ternary post-training quantization (PTQ) at depth d=1, resulting in 1.88 bits per weight. It is a research artifact optimized for exploring efficient inference on hardware that directly consumes packed trit formats, offering a highly compressed representation of the original Llama-3.1-8B weights.

Loading preview...

Model Overview

Entrit/Llama-3.1-8B-trit-uniform-d1 is a research artifact developed by Entrit, applying balanced ternary post-training quantization (PTQ) to Meta's Llama-3.1-8B. This 8 billion parameter model is quantized at a depth of d=1, meaning each weight uses 3 levels, achieving an information content of 1.88 bits per weight. The quantization method is uniform PTQ, and it applies to all 2D linear matrices, while lm_head, token embeddings, and *_norm layers remain in FP16.

Key Characteristics

  • Base Model: Derived from meta-llama/Llama-3.1-8B.
  • Quantization: Balanced ternary PTQ (3 levels per weight) resulting in 1.88 bits per weight.
  • Codec: Utilizes the tritllm v2 codec, detailed in the paper "Balanced Ternary Post-Training Quantization for Large Language Models" (Stentzel, 2026).
  • Compatibility: Weights are dequantized to FP16 for compatibility with stock transformers libraries, maintaining the on-disk size of the FP16 source.
  • Efficiency Focus: The 1.88-bpw figure highlights its potential for efficient inference on specialized hardware designed to process packed trit formats directly.

Use Cases

  • Research and Development: Ideal for exploring and experimenting with highly quantized LLMs and their performance characteristics.
  • Hardware Optimization: Suitable for developers working on custom hardware or inference engines that can leverage balanced ternary representations.
  • Resource-Constrained Deployment: Offers a path towards more memory-efficient LLM deployment, particularly when paired with compatible hardware.