jsyeom/llama-2-13b-hf-smooth

TEXT GENERATIONConcurrency Cost:1Model Size:13BQuant:FP8Ctx Length:4kPublished:Mar 24, 2026Architecture:Transformer Cold

The jsyeom/llama-2-13b-hf-smooth model is a 13 billion parameter Llama 2-based causal language model that has undergone SmoothQuant smoothing without any quantization. Developed by jsyeom, this model retains the full precision of the original Llama 2-13b-hf while applying a smoothing technique to its activations. This process is designed to prepare the model for potential future quantization, making it particularly suitable for research and development in efficient inference. It maintains a 4096 token context length, offering a smoothed foundation for various natural language processing tasks.

Loading preview...

jsyeom/llama-2-13b-hf-smooth: A Smoothed Llama 2-13B Model

This model, developed by jsyeom, is a 13 billion parameter variant of the meta-llama/Llama-2-13b-hf base model. Its primary distinguishing feature is the application of SmoothQuant smoothing to its internal activations. Crucially, no quantization has been applied; the model retains its full precision.

Key Characteristics

  • Base Model: meta-llama/Llama-2-13b-hf
  • Smoothing Technique: SmoothQuant smoothing applied to activations.
  • Smoothing Alpha: Configured with an alpha value of 0.85, indicating the migration strength.
  • Act Scales Source: Utilizes mit-han-lab/smoothquant-scales for activation scales.
  • Precision: Remains a full-precision model, as only smoothing, not quantization, has been performed.

Purpose and Potential Use Cases

This model is particularly relevant for researchers and developers exploring quantization-aware training or post-training quantization (PTQ). By providing a smoothed version of Llama 2-13B, it offers a pre-processed foundation that can potentially lead to better performance when subsequently quantized to lower bit-widths. It allows for experimentation with the benefits of SmoothQuant without the immediate performance impact of quantization, making it a valuable intermediate step in the optimization pipeline for efficient deployment of large language models.