open-machine/Llama-3.1-8B-FlashNorm

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:8kTool Calling:SupportedPublished:Apr 22, 2026License:llama3.1Architecture:Transformer0.0K Cold

The open-machine/Llama-3.1-8B-FlashNorm is an 8 billion parameter Llama-3.1 model developed by OpenMachine-ai, featuring a FlashNorm-prepared checkpoint. This model is mathematically equivalent to the original Llama-3.1-8B but optimizes the RMSNorm layer by folding its weights into subsequent linear layers, as detailed in the FlashNorm paper. This modification aims to improve inference efficiency without altering output, making it suitable for applications requiring high-performance transformer models with an 8192-token context length.

Loading preview...

Overview

open-machine/Llama-3.1-8B-FlashNorm is a specialized version of Meta's Llama-3.1-8B model, developed by OpenMachine-ai. This 8 billion parameter model incorporates the FlashNorm technique, which reformulates the RMSNorm -> Linear operation by folding the per-channel normalization weights into the following linear layer. This process, detailed in the FlashNorm paper, results in a model that is mathematically equivalent to the original Llama-3.1-8B but is designed for potentially faster inference by simplifying the RMSNorm layer at runtime.

Key Characteristics

  • FlashNorm Optimization: The primary differentiator is the integration of FlashNorm, which folds RMSNorm weights into linear layers, removing the per-channel scale from the RMSNorm layer during runtime.
  • Mathematical Equivalence: Despite the structural changes, the model produces identical outputs to the original Llama-3.1-8B, as proven by Proposition 1 of the FlashNorm paper.
  • HuggingFace Transformers Support: Fully compatible with HuggingFace Transformers, though it will issue a warning about missing norm weights, which is expected and handled correctly by defaulting them to ones.
  • vLLM Compatibility: Currently, vLLM does not natively support this checkpoint due to the absence of norm weight tensors, but a patch is being tracked.

Usage Considerations

This model is ideal for developers looking to leverage the Llama-3.1-8B architecture with potential inference optimizations offered by FlashNorm. It's particularly useful for applications where maximizing throughput and minimizing latency are critical, provided the chosen inference framework supports the FlashNorm structure.